First lecture task
Use whichever platform you prefer to work on – we will give references and hints for R
and for Python
.
The first week’s task will make sure you are up and running with statistical software, and that you try out some initial exploratory data analysis and visualization tasks.
For this task I strongly suggest you do the steps in order.
Recommended software setup
I recommend you ensure you have installed for your platform the following packages:
Python | R |
---|---|
Matplotlib | ggformula |
Seaborn | ggplot2 |
Numpy/Scipy | Mosaic |
scikit-learn | caret |
pandas |
Task: Load data, numerical summaries
- Load the task dataset.
- Calculate mean and standard deviation of
x
andy
in the data as well as the correlation betweenx
andy
– both for the entire dataset, and separately for each value ofdataset
What do you notice? Recall that a simple linear regression is completely determined by these values: the regression line goes through the mean point, and has slope determined as \(r_{x,y} s_x / s_y\) for the correlation coefficient \(r_{x,y}\) and the standard deviations \(s_x\) and \(s_y\). - Produce sequential plots of
x
andy
to check for time-series type structures in the data. - Produce scatterplots of
x
againsty
separately for each value ofdataset
.