Pandas is our go-to library for loading and manipulating data. Data in Pandas is stored in two fundamental data types:
Series
is the datatype that stores a single column of data.DataFrame
is the datatype that stores an entire dataset.They both behave very similar to numpy arrays, so you can use them as input to anything in numpy, scipy, or scikit-learn. The main difference from numpy arrays is indexing.
Series
have a single index.
DataFrame
has distinct column- and row-indices.
Pandas operations will align data on indices, adding NaN
wherever data is missing.
If the index is not important, this might be the wrong thing to do.
df.set_index("indexcolumn")
This code snippet will set the row-index to be a given column in the data (as opposed to a simple 1,2,3,... row-number, or something else entirely).
Important to craft good Kaggle submissions - where the row-index you want is the one given in test.csv
.
Getting data into Pandas can be done in several ways.
The pd.Series
and pd.DataFrame
constructors take numpy arrays, or Python lists, or dictionaries, or (for pd.DataFrame
) dictionaries of dictionaries.
Primarily useful: pd.read_csv
will read in a CSV file and create a DataFrame
.
Getting data out again can also be done in several ways. s.to_numpy()
and df.to_numpy()
create numpy arrays; also notable: to.html()
, .to_markdown()
and .to_latex()
for typesetting.
Primarily useful: s.to_csv("filename.csv")
and df.to_csv("filename.csv")
will write out to a CSV file.
Pandas operations usually chain together: each function returns the Series
/DataFrame
it works on, so that you can call a new method on the return value:
(
test[["read_id"]].assign({"target": ypred})
.set_index("read_id")
.to_csv("submission.csv")
)
(the initial and final brackets "trick" Python into allowing us to do row-initial indent+.
)
DataFrame
¶Pandas enables many different ways to access your data. Do you want "the first" row? or the row with index value "A"
?
On a Series
, using []
works similar to numpy:
s[3]
s[5:7]
On a DataFrame
, you can pick what you are trying to do using the df.loc
and df.iloc
accessors. With df.loc
, you use index keys, with df.iloc
you use integer positions.
Danger lies in constructions like df[a][b] = c
- these are not guaranteed to work. Also problem is df.loc[a][b] = c
- these will not work.
Instead, use df.loc[a,b] = c
inplace=True
¶Many Pandas functions can take an argument of inplace = True
.
The difference between in-place and "normal" function calls is that in-place modifies the data object directly while "normal" functions return a copy (or a view).
What goes wrong here?
test = pd.read_csv("test.csv")
testX = test.drop(["row_id"], inplace=True)
submission = test[["row_id"]]
df.query("(a < b) & (b < c)")
s.fillna()
, df.fillna()
df.fillna(0)
df.fillna("pad")
df.fillna("backfill")
df.interpolate()
(...also time-series aware!)