Kaggle supports the addition of new data sets, using the "Add Data" button. It brings up a search engine, through which additional data sets can be found and added to a notebook.
After adding, the file path to the new dataset can be copied from the side menu and included in a pd.read_csv
call to load the data.
Depending on the layout of the data you are adding in, different methods are needed to add observations to your main dataframe.
If the data you are adding does not share a key column with your main dataset, you will have to interpolate: given the values you have for your new data, figure out what values would be appropriate to pair with your main data set.
Example:
Dataset | Main key | Usable key |
---|---|---|
Main data | ID | Date-time-stamp (pickup_datetime ) |
Weather data | Hourly date-time-stamp (Time ) |
Hourly date-time-stamp Time |
The following code is inspired by impute.KNNImputer
: it computes nearest neighbors based on the time-stamp, and averages values between them.
from sklearn import neighbors, multioutput
categorical_weather_model = multioutput.MultiOutputClassifier(
neighbors.KNeighborsClassifier(n_neighbors=2), n_jobs=-1)
numeric_weather_model = multioutput.MultiOutputRegressor(
neighbors.KNeighborsRegressor(n_neighbors=2), n_jobs=-1)
categorical_weather_model.fit(knyc_metars[["dt"]], knyc_metars[categorical_weathers])
train[categorical_weathers] = categorical_weather_model.predict(train[["pickup_dt"]])
test[categorical_weathers] = categorical_weather_model.predict(test[["pickup_dt"]])
numeric_weather_model.fit(knyc_metars[["dt"]], knyc_metars[numeric_weathers])
train[numeric_weathers] = numeric_weather_model.predict(train[["pickup_dt"]])
test[numeric_weathers] = numeric_weather_model.predict(test[["pickup_dt"]])
The operation of taking two data tables and putting them "side-by-side" is called a join operation. We distinguish 4 types of joins: