#First things first
import numpy as np
import pandas as pd
Pandas
Pandas Refresher1
Object Creation
= pd.Series([1,3,5,np.nan,6,8])
s
s
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
Question: Are we allowed to insert non-basic types into a Series/Dataframe directly?
Answer: Yes! np.nan
is a special value that occasionally appears when you work with numerical data through numpy.
Dates and Random
= pd.date_range("20130101", periods=6)
dates
print(dates)
= np.random.default_rng()
rng
= pd.DataFrame(rng.random((6,4)), index=dates, columns=list("ABCD"))
df
print(df)
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
A B C D
2013-01-01 0.429159 0.894459 0.276385 0.606649
2013-01-02 0.994782 0.849224 0.760898 0.360695
2013-01-03 0.665977 0.758237 0.380970 0.840581
2013-01-04 0.933263 0.080145 0.283548 0.398697
2013-01-05 0.174812 0.394146 0.154096 0.191397
2013-01-06 0.021537 0.426432 0.325085 0.163449
Question: What has happened when we ran date_range
? What is a DatetimeIndex
?
Answer: It produces a DatetimeIndex
of equally spaced points. It’s an immutable array-like object containing datetime64 objects.
Question: When we created the dataframe, how can we only pass one function to populate the whole table?
Answer: rng.random((6,4))
returns enough results to populate the whole table.
Populating many columns at once:
= pd.DataFrame(
df2
{"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
print(df2,"\n")
df2.dtypes
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
A float64
B datetime64[s]
C float32
D int32
E category
F object
dtype: object
Question: Quickly describe each of the columns and their types.
Answer: ...
Viewing Data
df.head()3)
df.tail(
df.index
df.to_numpy()
df.describe()
df.T=1, ascending=False)
df.sort_index(axis="B") df.sort_values(by
A | B | C | D | |
---|---|---|---|---|
2013-01-04 | 0.933263 | 0.080145 | 0.283548 | 0.398697 |
2013-01-05 | 0.174812 | 0.394146 | 0.154096 | 0.191397 |
2013-01-06 | 0.021537 | 0.426432 | 0.325085 | 0.163449 |
2013-01-03 | 0.665977 | 0.758237 | 0.380970 | 0.840581 |
2013-01-02 | 0.994782 | 0.849224 | 0.760898 | 0.360695 |
2013-01-01 | 0.429159 | 0.894459 | 0.276385 | 0.606649 |
Selection
"A"]
df[0:3]
df["20130102":"20130104"] df[
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 0.994782 | 0.849224 | 0.760898 | 0.360695 |
2013-01-03 | 0.665977 | 0.758237 | 0.380970 | 0.840581 |
2013-01-04 | 0.933263 | 0.080145 | 0.283548 | 0.398697 |
By label:
0]]
df.loc[dates["A", "B"]]
df.loc[:, ["20130102":"20130104", ["A", "B"]]
df.loc[0], "A"]
df.loc[dates[0], "A"] df.at[dates[
np.float64(0.429159232554452)
By position
3]
df.iloc[3:5, 0:2]
df.iloc[1, 2, 4], [0, 2]]
df.iloc[[1:3, :]
df.iloc[1, 1]
df.iloc[1, 1] df.iat[
np.float64(0.849223948484039)
Honestly, if you’re relying on this frequently, maybe it’s time to reconsider the form of your data…
Boolean Indexing
"A"] > 0] df[df[
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0.429159 | 0.894459 | 0.276385 | 0.606649 |
2013-01-02 | 0.994782 | 0.849224 | 0.760898 | 0.360695 |
2013-01-03 | 0.665977 | 0.758237 | 0.380970 | 0.840581 |
2013-01-04 | 0.933263 | 0.080145 | 0.283548 | 0.398697 |
2013-01-05 | 0.174812 | 0.394146 | 0.154096 | 0.191397 |
2013-01-06 | 0.021537 | 0.426432 | 0.325085 | 0.163449 |
>0] df[df
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0.429159 | 0.894459 | 0.276385 | 0.606649 |
2013-01-02 | 0.994782 | 0.849224 | 0.760898 | 0.360695 |
2013-01-03 | 0.665977 | 0.758237 | 0.380970 | 0.840581 |
2013-01-04 | 0.933263 | 0.080145 | 0.283548 | 0.398697 |
2013-01-05 | 0.174812 | 0.394146 | 0.154096 | 0.191397 |
2013-01-06 | 0.021537 | 0.426432 | 0.325085 | 0.163449 |
= df.copy()
df2
"E"] = ["one", "one", "two", "three", "four", "three"]
df2[
df2
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | 0.429159 | 0.894459 | 0.276385 | 0.606649 | one |
2013-01-02 | 0.994782 | 0.849224 | 0.760898 | 0.360695 | one |
2013-01-03 | 0.665977 | 0.758237 | 0.380970 | 0.840581 | two |
2013-01-04 | 0.933263 | 0.080145 | 0.283548 | 0.398697 | three |
2013-01-05 | 0.174812 | 0.394146 | 0.154096 | 0.191397 | four |
2013-01-06 | 0.021537 | 0.426432 | 0.325085 | 0.163449 | three |
"E"].isin(["two", "four"])] df2[df2[
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-03 | 0.665977 | 0.758237 | 0.380970 | 0.840581 | two |
2013-01-05 | 0.174812 | 0.394146 | 0.154096 | 0.191397 | four |
Homework
Under 1: Select “Types of Radionuclides and Radiation”
Under 3: Select “Sampling Location”
Under 4:
Select under “Types of Radionuclides and Radiation”: “Analyte Name”, “Result in SI”
Select under “Sampling Location”: “City Name”, “State Abbreviation”, “Station”, “Surface Water Source
Click “Get Data”
Download as CSV
Read into DataFrame
Provide both the following information and the code that produces the answer:
What are the column names of DataFrame taken from the .csv?
How many of the readings were taken in Ohio?
What is the most common measurement (analyte_name) over the entire dataset?
How many of the data points produced by the most common measurement came from Ohio?
Create a separate dataframe that has three columns :
- city_names from Ohio
- Total radiation observed in the city
- Percentage of radiation observed in Ohio
Create a bar graph showing each city’s percentage contribution to the total radiation observed in Ohio.
For the purposes of this exercise, you may ignore the units provided in ‘result_unit’. Though, what might happen if you were to ignore units in a real analysis?
= pd.read_csv(r"./Data/radnet.csv")
df df.columns
Index(['city_name', 'state_abbr', 'station', 'surface_water_source',
'ana_proc_name', 'analyte_name', 'result_in_si', 'result_unit'],
dtype='object')
Footnotes
Parts of this refresher are derived from the Pandas documentation, go check it out! https://pandas.pydata.org/docs/user_guide/10min.html↩︎