Pandas

Pandas Refresher¹

#First things first
import numpy as np
import pandas as pd

Object Creation

s = pd.Series([1,3,5,np.nan,6,8])

s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Question: Are we allowed to insert non-basic types into a Series/Dataframe directly?

Answer: Yes! np.nan is a special value that occasionally appears when you work with numerical data through numpy.

Dates and Random

dates = pd.date_range("20130101", periods=6)

print(dates)

rng = np.random.default_rng()

df = pd.DataFrame(rng.random((6,4)), index=dates, columns=list("ABCD"))

print(df)

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')
                   A         B         C         D
2013-01-01  0.429159  0.894459  0.276385  0.606649
2013-01-02  0.994782  0.849224  0.760898  0.360695
2013-01-03  0.665977  0.758237  0.380970  0.840581
2013-01-04  0.933263  0.080145  0.283548  0.398697
2013-01-05  0.174812  0.394146  0.154096  0.191397
2013-01-06  0.021537  0.426432  0.325085  0.163449

Question: What has happened when we ran date_range? What is a DatetimeIndex?

Answer: It produces a DatetimeIndex of equally spaced points. It’s an immutable array-like object containing datetime64 objects.

Question: When we created the dataframe, how can we only pass one function to populate the whole table?

Answer: rng.random((6,4)) returns enough results to populate the whole table.

Populating many columns at once:

df2 = pd.DataFrame(
  {
    "A": 1.0,
    "B": pd.Timestamp("20130102"),
    "C": pd.Series(1, index=list(range(4)), dtype="float32"),
    "D": np.array([3] * 4, dtype="int32"),
    "E": pd.Categorical(["test", "train", "test", "train"]),
    "F": "foo",
  }
)


print(df2,"\n")

df2.dtypes

     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

Question: Quickly describe each of the columns and their types.

Answer: ...

Viewing Data

df.head()
df.tail(3)
df.index
df.to_numpy()
df.describe()
df.T
df.sort_index(axis=1, ascending=False)
df.sort_values(by="B")

	A	B	C	D
2013-01-04	0.933263	0.080145	0.283548	0.398697
2013-01-05	0.174812	0.394146	0.154096	0.191397
2013-01-06	0.021537	0.426432	0.325085	0.163449
2013-01-03	0.665977	0.758237	0.380970	0.840581
2013-01-02	0.994782	0.849224	0.760898	0.360695
2013-01-01	0.429159	0.894459	0.276385	0.606649

Selection

df["A"]
df[0:3]
df["20130102":"20130104"]

	A	B	C	D
2013-01-02	0.994782	0.849224	0.760898	0.360695
2013-01-03	0.665977	0.758237	0.380970	0.840581
2013-01-04	0.933263	0.080145	0.283548	0.398697

By label:

df.loc[dates[0]]
df.loc[:, ["A", "B"]]
df.loc["20130102":"20130104", ["A", "B"]]
df.loc[dates[0], "A"]
df.at[dates[0], "A"]

np.float64(0.429159232554452)

By position

df.iloc[3]
df.iloc[3:5, 0:2]
df.iloc[[1, 2, 4], [0, 2]]
df.iloc[1:3, :]
df.iloc[1, 1]
df.iat[1, 1]

np.float64(0.849223948484039)

Honestly, if you’re relying on this frequently, maybe it’s time to reconsider the form of your data…

Boolean Indexing

df[df["A"] > 0]

	A	B	C	D
2013-01-01	0.429159	0.894459	0.276385	0.606649
2013-01-02	0.994782	0.849224	0.760898	0.360695
2013-01-03	0.665977	0.758237	0.380970	0.840581
2013-01-04	0.933263	0.080145	0.283548	0.398697
2013-01-05	0.174812	0.394146	0.154096	0.191397
2013-01-06	0.021537	0.426432	0.325085	0.163449

df[df>0]

	A	B	C	D
2013-01-01	0.429159	0.894459	0.276385	0.606649
2013-01-02	0.994782	0.849224	0.760898	0.360695
2013-01-03	0.665977	0.758237	0.380970	0.840581
2013-01-04	0.933263	0.080145	0.283548	0.398697
2013-01-05	0.174812	0.394146	0.154096	0.191397
2013-01-06	0.021537	0.426432	0.325085	0.163449

df2 = df.copy()

df2["E"] = ["one", "one", "two", "three", "four", "three"]

df2

	A	B	C	D	E
2013-01-01	0.429159	0.894459	0.276385	0.606649	one
2013-01-02	0.994782	0.849224	0.760898	0.360695	one
2013-01-03	0.665977	0.758237	0.380970	0.840581	two
2013-01-04	0.933263	0.080145	0.283548	0.398697	three
2013-01-05	0.174812	0.394146	0.154096	0.191397	four
2013-01-06	0.021537	0.426432	0.325085	0.163449	three

df2[df2["E"].isin(["two", "four"])]

	A	B	C	D	E
2013-01-03	0.665977	0.758237	0.380970	0.840581	two
2013-01-05	0.174812	0.394146	0.154096	0.191397	four

Homework

Go to https://enviro.epa.gov/query-builder/advanced/radnet/
Under 1: Select “Types of Radionuclides and Radiation”
Under 3: Select “Sampling Location”
Under 4:
- Select under “Types of Radionuclides and Radiation”: “Analyte Name”, “Result in SI”
- Select under “Sampling Location”: “City Name”, “State Abbreviation”, “Station”, “Surface Water Source
Click “Get Data”

Download as CSV
Read into DataFrame
Provide both the following information and the code that produces the answer:
- What are the column names of DataFrame taken from the .csv?
- How many of the readings were taken in Ohio?
- What is the most common measurement (analyte_name) over the entire dataset?
- How many of the data points produced by the most common measurement came from Ohio?
- Create a separate dataframe that has three columns :
- 1. city_names from Ohio
  2. Total radiation observed in the city
  3. Percentage of radiation observed in Ohio
- Create a bar graph showing each city’s percentage contribution to the total radiation observed in Ohio.

For the purposes of this exercise, you may ignore the units provided in ‘result_unit’. Though, what might happen if you were to ignore units in a real analysis?

df = pd.read_csv(r"./Data/radnet.csv")
df.columns

Index(['city_name', 'state_abbr', 'station', 'surface_water_source',
       'ana_proc_name', 'analyte_name', 'result_in_si', 'result_unit'],
      dtype='object')

Footnotes

Parts of this refresher are derived from the Pandas documentation, go check it out! https://pandas.pydata.org/docs/user_guide/10min.html ↩︎

Pandas Refresher1