How to add data¶

Kaggle supports the addition of new data sets, using the "Add Data" button. It brings up a search engine, through which additional data sets can be found and added to a notebook.

After adding, the file path to the new dataset can be copied from the side menu and included in a pd.read_csv call to load the data.

How to add data - after loading¶

Depending on the layout of the data you are adding in, different methods are needed to add observations to your main dataframe.

Interpolation, possibly with machine learning models.
Joins of data sets.

Interpolating for data addition¶

If the data you are adding does not share a key column with your main dataset, you will have to interpolate: given the values you have for your new data, figure out what values would be appropriate to pair with your main data set.

Example:

Dataset	Main key	Usable key
Main data	ID	Date-time-stamp (`pickup_datetime`)
Weather data	Hourly date-time-stamp (`Time`)	Hourly date-time-stamp `Time`

The following code is inspired by impute.KNNImputer: it computes nearest neighbors based on the time-stamp, and averages values between them.

from sklearn import neighbors, multioutput
categorical_weather_model = multioutput.MultiOutputClassifier(
    neighbors.KNeighborsClassifier(n_neighbors=2), n_jobs=-1)
numeric_weather_model = multioutput.MultiOutputRegressor(
    neighbors.KNeighborsRegressor(n_neighbors=2), n_jobs=-1)

categorical_weather_model.fit(knyc_metars[["dt"]], knyc_metars[categorical_weathers])
train[categorical_weathers] = categorical_weather_model.predict(train[["pickup_dt"]])
test[categorical_weathers] = categorical_weather_model.predict(test[["pickup_dt"]])

numeric_weather_model.fit(knyc_metars[["dt"]], knyc_metars[numeric_weathers])
train[numeric_weathers] = numeric_weather_model.predict(train[["pickup_dt"]])
test[numeric_weathers] = numeric_weather_model.predict(test[["pickup_dt"]])

Join of data sets¶

The operation of taking two data tables and putting them "side-by-side" is called a join operation. We distinguish 4 types of joins:

Inner join - keeps an intersection of keys
Left join - keeps all keys from the left table
Right join - keeps all keys from the right table
Outer join - keeps an union of keys