First lecture task

First lecture task

Use whichever platform you prefer to work on – we will give references and hints for R and for Python.

The first week’s task will make sure you are up and running with statistical software, and that you try out some initial exploratory data analysis and visualization tasks.

For this task I strongly suggest you do the steps in order.

Recommended software setup

I recommend you ensure you have installed for your platform the following packages:

Python R
Matplotlib ggformula
Seaborn ggplot2
Numpy/Scipy Mosaic
scikit-learn caret
pandas

Task: Load data, numerical summaries

  1. Load the task dataset.
  2. Calculate mean and standard deviation of x and y in the data as well as the correlation between x and y – both for the entire dataset, and separately for each value of dataset
    What do you notice? Recall that a simple linear regression is completely determined by these values: the regression line goes through the mean point, and has slope determined as \(r_{x,y} s_x / s_y\) for the correlation coefficient \(r_{x,y}\) and the standard deviations \(s_x\) and \(s_y\).
  3. Produce sequential plots of x and y to check for time-series type structures in the data.
  4. Produce scatterplots of x against y separately for each value of dataset.