Quick and dirty introduction to Pandas

Pandas is our go-to library for loading and manipulating data. Data in Pandas is stored in two fundamental data types:

  • Series is the datatype that stores a single column of data.
  • DataFrame is the datatype that stores an entire dataset.

They both behave very similar to numpy arrays, so you can use them as input to anything in numpy, scipy, or scikit-learn. The main difference from numpy arrays is indexing.

Indexes

Series have a single index. DataFrame has distinct column- and row-indices.

Pandas operations will align data on indices, adding NaN wherever data is missing. If the index is not important, this might be the wrong thing to do.

df.set_index("indexcolumn")

This code snippet will set the row-index to be a given column in the data (as opposed to a simple 1,2,3,... row-number, or something else entirely). Important to craft good Kaggle submissions - where the row-index you want is the one given in test.csv.

Input / Output

Getting data into Pandas can be done in several ways. The pd.Series and pd.DataFrame constructors take numpy arrays, or Python lists, or dictionaries, or (for pd.DataFrame) dictionaries of dictionaries.

Primarily useful: pd.read_csv will read in a CSV file and create a DataFrame.

Getting data out again can also be done in several ways. s.to_numpy() and df.to_numpy() create numpy arrays; also notable: to.html(), .to_markdown() and .to_latex() for typesetting.

Primarily useful: s.to_csv("filename.csv") and df.to_csv("filename.csv") will write out to a CSV file.

Chaining

Pandas operations usually chain together: each function returns the Series/DataFrame it works on, so that you can call a new method on the return value:

(
  test[["read_id"]].assign({"target": ypred})
  .set_index("read_id")
  .to_csv("submission.csv")
)

(the initial and final brackets "trick" Python into allowing us to do row-initial indent+.)

Common pitfall: assigning to DataFrame

Pandas enables many different ways to access your data. Do you want "the first" row? or the row with index value "A"?

On a Series, using [] works similar to numpy:

s[3]
s[5:7]

On a DataFrame, you can pick what you are trying to do using the df.loc and df.iloc accessors. With df.loc, you use index keys, with df.iloc you use integer positions.

Danger lies in constructions like df[a][b] = c - these are not guaranteed to work. Also problem is df.loc[a][b] = c - these will not work. Instead, use df.loc[a,b] = c

Common pitfall: when to use inplace=True

Many Pandas functions can take an argument of inplace = True. The difference between in-place and "normal" function calls is that in-place modifies the data object directly while "normal" functions return a copy (or a view).

What goes wrong here?

test = pd.read_csv("test.csv")
testX = test.drop(["row_id"], inplace=True)
submission = test[["row_id"]]

Where Pandas shines - fun things that you can do

  • Select based on "database queries": df.query("(a < b) & (b < c)")
  • Fill in missing values: s.fillna(), df.fillna()
    • Replace with scalar df.fillna(0)
    • Copy most recent present value df.fillna("pad")
    • Copy next present value df.fillna("backfill")
    • Interpolate from known values df.interpolate() (...also time-series aware!)