First lecture task

Use whichever platform you prefer to work on – we will give references and hints for R and for Python.

The first week’s task will make sure you are up and running with statistical software, and that you try out some initial exploratory data analysis and visualization tasks.

For this task I strongly suggest you do the steps in order.

Recommended software setup

I recommend you ensure you have installed for your platform the following packages:

Python	R
Matplotlib	ggformula
Seaborn	ggplot2
Numpy/Scipy	Mosaic
scikit-learn	caret
pandas

Task: Load data, numerical summaries

Load the task dataset.
Calculate mean and standard deviation of x and y in the data as well as the correlation between x and y – both for the entire dataset, and separately for each value of dataset
What do you notice? Recall that a simple linear regression is completely determined by these values: the regression line goes through the mean point, and has slope determined as \(r_{x,y} s_x / s_y\) for the correlation coefficient \(r_{x,y}\) and the standard deviations \(s_x\) and \(s_y\).
Produce sequential plots of x and y to check for time-series type structures in the data.
Produce scatterplots of x against y separately for each value of dataset.

First lecture task

Recommended software setup

Task: Load data, numerical summaries

Categories