First lecture task
Use whichever platform you prefer to work on – we will give references and hints for R and for Python.
The first week’s task will make sure you are up and running with statistical software, and that you try out some initial exploratory data analysis and visualization tasks.
For this task I strongly suggest you do the steps in order.
Recommended software setup
I recommend you ensure you have installed for your platform the following packages:
| Python | R |
|---|---|
| Matplotlib | ggformula |
| Seaborn | ggplot2 |
| Numpy/Scipy | Mosaic |
| scikit-learn | caret |
| pandas |
Task: Load data, numerical summaries
- Load the task dataset.
- Calculate mean and standard deviation of
xandyin the data as well as the correlation betweenxandy– both for the entire dataset, and separately for each value ofdataset
What do you notice? Recall that a simple linear regression is completely determined by these values: the regression line goes through the mean point, and has slope determined as \(r_{x,y} s_x / s_y\) for the correlation coefficient \(r_{x,y}\) and the standard deviations \(s_x\) and \(s_y\). - Produce sequential plots of
xandyto check for time-series type structures in the data. - Produce scatterplots of
xagainstyseparately for each value ofdataset.