Pandas

Pandas Refresher1

#First things first
import numpy as np
import pandas as pd

Object Creation

s = pd.Series([1,3,5,np.nan,6,8])

s
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Question: Are we allowed to insert non-basic types into a Series/Dataframe directly?

Answer: Yes! np.nan is a special value that occasionally appears when you work with numerical data through numpy.

Dates and Random

dates = pd.date_range("20130101", periods=6)

print(dates)

rng = np.random.default_rng()

df = pd.DataFrame(rng.random((6,4)), index=dates, columns=list("ABCD"))

print(df)
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
                   A         B         C         D
2013-01-01  0.429159  0.894459  0.276385  0.606649
2013-01-02  0.994782  0.849224  0.760898  0.360695
2013-01-03  0.665977  0.758237  0.380970  0.840581
2013-01-04  0.933263  0.080145  0.283548  0.398697
2013-01-05  0.174812  0.394146  0.154096  0.191397
2013-01-06  0.021537  0.426432  0.325085  0.163449

Question: What has happened when we ran date_range? What is a DatetimeIndex?

Answer: It produces a DatetimeIndex of equally spaced points. It’s an immutable array-like object containing datetime64 objects.

Question: When we created the dataframe, how can we only pass one function to populate the whole table?

Answer: rng.random((6,4)) returns enough results to populate the whole table.

Populating many columns at once:

df2 = pd.DataFrame(
  {
    "A": 1.0,
    "B": pd.Timestamp("20130102"),
    "C": pd.Series(1, index=list(range(4)), dtype="float32"),
    "D": np.array([3] * 4, dtype="int32"),
    "E": pd.Categorical(["test", "train", "test", "train"]),
    "F": "foo",
  }
)


print(df2,"\n")

df2.dtypes
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo 
A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

Question: Quickly describe each of the columns and their types.

Answer: ...

Viewing Data

df.head()
df.tail(3)
df.index
df.to_numpy()
df.describe()
df.T
df.sort_index(axis=1, ascending=False)
df.sort_values(by="B")
A B C D
2013-01-04 0.933263 0.080145 0.283548 0.398697
2013-01-05 0.174812 0.394146 0.154096 0.191397
2013-01-06 0.021537 0.426432 0.325085 0.163449
2013-01-03 0.665977 0.758237 0.380970 0.840581
2013-01-02 0.994782 0.849224 0.760898 0.360695
2013-01-01 0.429159 0.894459 0.276385 0.606649

Selection

df["A"]
df[0:3]
df["20130102":"20130104"]
A B C D
2013-01-02 0.994782 0.849224 0.760898 0.360695
2013-01-03 0.665977 0.758237 0.380970 0.840581
2013-01-04 0.933263 0.080145 0.283548 0.398697

By label:

df.loc[dates[0]]
df.loc[:, ["A", "B"]]
df.loc["20130102":"20130104", ["A", "B"]]
df.loc[dates[0], "A"]
df.at[dates[0], "A"]
np.float64(0.429159232554452)

By position

df.iloc[3]
df.iloc[3:5, 0:2]
df.iloc[[1, 2, 4], [0, 2]]
df.iloc[1:3, :]
df.iloc[1, 1]
df.iat[1, 1]
np.float64(0.849223948484039)

Honestly, if you’re relying on this frequently, maybe it’s time to reconsider the form of your data…

Boolean Indexing

df[df["A"] > 0]
A B C D
2013-01-01 0.429159 0.894459 0.276385 0.606649
2013-01-02 0.994782 0.849224 0.760898 0.360695
2013-01-03 0.665977 0.758237 0.380970 0.840581
2013-01-04 0.933263 0.080145 0.283548 0.398697
2013-01-05 0.174812 0.394146 0.154096 0.191397
2013-01-06 0.021537 0.426432 0.325085 0.163449
df[df>0]
A B C D
2013-01-01 0.429159 0.894459 0.276385 0.606649
2013-01-02 0.994782 0.849224 0.760898 0.360695
2013-01-03 0.665977 0.758237 0.380970 0.840581
2013-01-04 0.933263 0.080145 0.283548 0.398697
2013-01-05 0.174812 0.394146 0.154096 0.191397
2013-01-06 0.021537 0.426432 0.325085 0.163449
df2 = df.copy()

df2["E"] = ["one", "one", "two", "three", "four", "three"]

df2
A B C D E
2013-01-01 0.429159 0.894459 0.276385 0.606649 one
2013-01-02 0.994782 0.849224 0.760898 0.360695 one
2013-01-03 0.665977 0.758237 0.380970 0.840581 two
2013-01-04 0.933263 0.080145 0.283548 0.398697 three
2013-01-05 0.174812 0.394146 0.154096 0.191397 four
2013-01-06 0.021537 0.426432 0.325085 0.163449 three
df2[df2["E"].isin(["two", "four"])]
A B C D E
2013-01-03 0.665977 0.758237 0.380970 0.840581 two
2013-01-05 0.174812 0.394146 0.154096 0.191397 four

Homework

  • Go to https://enviro.epa.gov/query-builder/advanced/radnet/

  • Under 1: Select “Types of Radionuclides and Radiation”

  • Under 3: Select “Sampling Location”

  • Under 4:

    • Select under “Types of Radionuclides and Radiation”: “Analyte Name”, “Result in SI”

    • Select under “Sampling Location”: “City Name”, “State Abbreviation”, “Station”, “Surface Water Source

  • Click “Get Data”

  • Download as CSV

  • Read into DataFrame

  • Provide both the following information and the code that produces the answer:

    • What are the column names of DataFrame taken from the .csv?

    • How many of the readings were taken in Ohio?

    • What is the most common measurement (analyte_name) over the entire dataset?

    • How many of the data points produced by the most common measurement came from Ohio?

    • Create a separate dataframe that has three columns :

      1. city_names from Ohio
      2. Total radiation observed in the city
      3. Percentage of radiation observed in Ohio
    • Create a bar graph showing each city’s percentage contribution to the total radiation observed in Ohio.

For the purposes of this exercise, you may ignore the units provided in ‘result_unit’. Though, what might happen if you were to ignore units in a real analysis?

df = pd.read_csv(r"./Data/radnet.csv")
df.columns
Index(['city_name', 'state_abbr', 'station', 'surface_water_source',
       'ana_proc_name', 'analyte_name', 'result_in_si', 'result_unit'],
      dtype='object')

Footnotes

  1. Parts of this refresher are derived from the Pandas documentation, go check it out! https://pandas.pydata.org/docs/user_guide/10min.html↩︎