Lecture 1

29 January, 2018

I am: Mikael Vejdemo-Johansson
Office: 1S-208; Office Hours: MW 12.45-14.15
Lectures: MW 14.30 - 16.30

Course webpage: http://www.math.csi.cuny.edu/~mvj/MTH410

All course details: syllabus, grading scheme, report requirements, schedule, …
All additional course content: lecture slides, homework, …
Linked from Blackboard

Course will be graded on a written report:

critical analysis of statistical methods in a published research paper, or
explanation and illustration of a statistical method or concept not covered in the course

Statistics is…

Statistics is the grammar of science.

Karl Pearson

Statistics is…

Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.

John Wilder Tukey

Statistics is…

All models are wrong, but some are useful.

George E. P. Box

Statistics is…

Prediction is very difficult, especially about the future.

Niels Bohr

Statistics is…

On two occasions I have been asked [by members of Parliament], ‘Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?’ I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

Charles Babbage

Statistics is…

…what do you think?

Statistics is…

The complement to probability theory: in probability, a distribution leads to random draws; in statistics, observations lead to a distribution.

Crash course in modern probability theory

A crash course

The book we use is using measure-theoretic probability as a basis for its arguments. For most of the time, we will abstract away from this as much as possible, but to help you digest the text this week we will focus on a high speed crash course in the differences to what you might have encountered before.

Measures

Measures give an axiomatic way to reason about and work with sizes of sets. A measure \(\mu\) on a set \(X\) is a function \(\Sigma\to[0,\infty]\) defined on a system of subsets \(\Sigma\subset2^\mathbb{R}\).

\(\Sigma\) needs to fulfill a few axioms:

\(X\in\Sigma\) and \(\emptyset\in\Sigma\)
If \(S\in\Sigma\) then the complement \(S^c = X\setminus S \in \Sigma\).
For any countably infinite sequence \(S_i\) of sets in \(\Sigma\), their union \(\bigcup_{i=1}^\infty S_i\in\Sigma\).

Measures

\(\mu\) also needs to fulfill axioms:

\(\mu(\emptyset) = 0\)
For countably many disjoint elements \(S_i\) in \(\Sigma\), \[ \mu\left(\bigcup_{i=1}^\infty S_i\right) = \sum_{i=1}^\infty\mu(S_i) \] (additivity)

Examples

If \(X\) is finite or countable, then \(\mu(S) = |S|\) is a measure. This is called the counting measure on \(X\). Here \(\Sigma\) is the set of all subsets of \(X\).

If \(X=\mathbb{R}^n\) then we can define \(\mu(S) = \int\dots\int_A dx_1\dots dx_n\). This measures the length/area/volume/hypervolume in the usual sense and is called the Lebesgue measure. Here \(\Sigma\) consists of the Borel sets: these are all rectangles \((a_1,b_1)\times\dots\times(a_n,b_n)\) and anything we can write with countable unions and complements from these.

Probability measure; limits

A measure \(\mu\) is a probability measure if \(\mu(X)=1\). In this case, the triple \((X,\Sigma,\mu)\) is called a probability space. Elements \(x\in X\) are outcomes and sets \(S\in\Sigma\) are events.

For all measures, if \(S_1\subset S_2\subset\dots\) have union \(S=\bigcup_{i=1}^\infty\), then \[ \mu(S) = \lim_{i\to\infty} \mu(S_i) \]

For all finite measures, where \(\mu(X)<\infty\), an intersection version holds: if \(S_1\supset S_2\supset\dots\) and \(S=\bigcap_{i=1}^\infty S_i\), then \[ \mu(S) = \lim_{i\to\infty} \mu(S_i) \]

Integrals and measurable functions

A major function of introducing measures is to get a more robust theory of integrals.

If \(f:X\to\mathbb{R}\) is a function on a measurable space \((X,\Sigma)\) (ie a space on which a measure can be defined), then \(f\) is measurable if \[ f^{-1}(B) = \{ x\in X : f(x)\in B \} \in \Sigma \] for every Borel set \(B\). In other words, the preimage of any nice set \(B\) needs to be nice.

Indicator functions; simple functions

The indicator function \(1_A(x)\) takes the value 0 if \(x\not\in A\) and the value 1 if \(x\in A\). This function is measurable.

A function on the form \[ f(x) = \sum_{i=1}^m a_i 1_{A_i} \] is called a simple function.

Theorem If \(f\) is nonnegative and measurable, there exist nonnegative simple functions \(f_1\leq f_2\leq\dots\) with \(f=\lim_{n\to\infty} f_n\).

Axioms for integrals

We expect an integral to follow the rules we learned in calculus. We restate them here, for the case where we have a measure we care about.

\(\int 1_A d\mu = \mu(A)\)
\(\int af+bg d\mu = a\int f d\mu + b\int g d\mu\) if \(a,b\in[0,\infty)\) and \(f, g\) are nonnegative measurable functions
If \(f_1\leq f_2\leq\dots\) are nonnegative measurable functions and \(f(x) = \lim_{n\to\infty}f_n(x)\) then \[ \int f d\mu = \lim_{n\to\infty}\int f_n d\mu \]

Integrating any measurable function

By the theorem earlier, the integral properties define what the integral of any non-negative measurable function is. To integrate a generic measurable function, write \(f^+(x) = \max(f(x),0)\) and \(f^-(x)=-\min(f(x),0)\). Then \(f(x)=f^+-f^-\) and \(\int f d\mu = \int f^+ d\mu - \int f^- d\mu\).

This works as long as not both \(\int f^+ d\mu = \infty\) and \(\int f^- d\mu = \infty\).

A function \(f\) such that \(\int f d\mu < \infty\) is called integrable.

Probability

A probability space is defined by the triple \((E, B, \mathbb{P})\) of outcomes, events and the probability measure itself.

A measurable function \(X:E\to\mathbb{R}\) is called a random variable. Any random variable defines a probability measure on Borel sets by \[ \mathbb{P}_X(A) = \mathbb{P}\{e\in E: X(e)\in A\}= \mathbb{P}(X\in A) \]

We write \(X \sim Q\) to denote that the variable \(X\) follows the distribution \(Q\), or that \(\mathbb{P}_X = Q\).

The cumulative distribution function is a powerful description of \(X\), defined by \[ F_X(x) = \mathbb{P}(X \leq x) = \mathbb{P}(\{e\in E : X(e) \leq x\}) = \mathbb{P}_X((-\infty, x]) \]

Densities and Radon-Nikodym

At least as common as the CDF, we will talk about probability density functions.

Theorem (Radon-Nikodym) If \(\mathbb{P}\) is a finite measure, absolutely continuous wrt \(\mu\), then there is a nonnegative measurable \(f\) such that \[ \mathbb{P}(A) = \int_A f d\mu = \int f 1_A d\mu \]

The function \(f\) is called the Radon-Nikodym derivative of \(\mathbb{P}\) wrt \(\mu\), or the density of \(\mathbb{P}\) wrt \(\mu\). We write it \(d\mathbb{P}/d\mu\) when we wish to emphasize the derivative perspective.

Absolutely continuous here means that \(\mu(A)=0\) implies \(\mathbb{P}(A)=0\).

Expectation

\(X\) a random variable on \((E,B,\mathbb{P})\).

The expectation or expected value of \(X\) is defined as \[ \mathbb{E}X = \int X d\mathbb{P} = \int xd\mathbb{P}_X(x) \]

If \(Y=f(X)\), then \[ \mathbb{E}Y = \mathbb{E}f(X) = \int f d\mathbb{P}_X \]

If \(d\mathbb{P}/d\mu = p(x)\), then these integrals reduce to \(\int fpd\mu\).

Example

Rolling one D6

Outcomes: \(\{1,2,3,4,5,6\}\)
Events: all subsets
Probability measure: normalized counting measure – \(\mu(S)=|S|/6\)

Random variable: \(X(E) = E\)

The expected value is \[ \int x d\mathbb{P}_X(x) = \sum_{x=0}^\infty x\mathbb{P}(x) = \frac{1}{6}\sum_{x=1}^6x = \frac{21}{6} = \frac{7}{2} \]

Variance and Covariance

The covariance of random variables \(X\) and \(Y\) with finite expectations is defined as

\[ Cov(X,Y) = \mathbb{E}\left[(X-\mathbb{E}X)(Y-\mathbb{Y})\right] = \mathbb{E}(XY)-(\mathbb{E}X)(\mathbb{E}Y) \]

The variance of \(X\) is the covariance of \(X\) with itself \[ Var(X) = \mathbb{V}X = Cov(X,X) = \mathbb{E}\left((X-\mathbb{E}X)^2\right) = \mathbb{E}X^2 - (\mathbb{E}X)^2 \]

Correlation is a normalized kind of covariance \[ Cor(X,Y) = \frac{Cov(X,Y)}{\sqrt{\mathbb{V}X\cdot\mathbb{V}Y}} \]

Correlations only take values in \([-1,1]\).

Linearity

Expectation is linear

\[ \mathbb{E}(aX+bY) = a\mathbb{E}X + b\mathbb{E}Y \]

Covariance (and variance) are more complicated \[ Cov(aX+bY, cW+dV) = ac\cdot Cov(X,W) + ad\cdot Cov(X,V) \\ + bc\cdot Cov(Y,W) + bd\cdot Cov(Y,V) \] \[ \mathbb{V}\left(\sum_i a_i X_i\right) = \sum_i a_i^2\mathbb{V}X_i + 2\sum_{i<j}a_i a_j\cdot Cov(X_i,X_j) \]

Example

Variance of rolling one D6:

\[ \mathbb{V}X = \mathbb{E}X^2 - \mathbb{E}(X)^2 = \sum_{x=1}^6 x^2\mathbb{P}(x) - 49/4 = \\ = 91/6 - 49/4 = 70/24 = 35/12 \simeq 2.9167 \]

Expectation and variance of sum of two D6

\[ \mathbb{E}(X+X') = \mathbb{E}X + \mathbb{E}X' = 7/2 + 7/2 = 7 \]

\[ \mathbb{V}(X+X') = \mathbb{V}X + \mathbb{V}X' + 2 Cov(X,X') \]

Independence

Two random variables (vector-valued if we feel like it) are independent if \[ \mathbb{P}(X\in A, Y\in B) = \mathbb{P}(X\in A)\cdot\mathbb{P}(Y\in B) \] for all Borel sets \(A\), \(B\).

If independent, probability measure is the product measure: \[ (\mu\times\nu)(A\times B) = \mu(A)\cdot\nu(B) \]

Fubini's theorem: integrating a product measure corresponds to integrating by one then the other: \[ \int f d(\mu\times\nu) = \!\!\int\!\!\left(\int\! f(x,y)d\mu(x)\!\!\right)\!\!d\nu(y) = \!\!\int\!\!\left(\int\! f(x,y)d\mu(y)\!\!\right)\!\!d\mu(x) \]

Covariance and independence

If \(X\) and \(Y\) are independent random variables, \[ Cov(X,Y) = \mathbb{E}(XY)-(\mathbb{E}X)(\mathbb{E}Y) \] \[ \mathbb{E}(XY) = \int xy d\mathbb{P}_{X\times Y} \\ = \int\int xy d\mathbb{P}_X(x)d\mathbb{P}_Y(y) = \int\left(\int xd\mathbb{P}_X(x)\right)yd\mathbb{P}_Y(y) \\ = \int xd\mathbb{P}_X(x) \cdot \int yd\mathbb{P}_Y(y) = (\mathbb{E}X)(\mathbb{E}Y) \]

Example

We established that for two D6, captured as random variables \(X\) and \(X'\), \[ \mathbb{V}(X+X') = \mathbb{V}X + \mathbb{V}X' + Cov(X,X') \]

If dice throws are independent, \(Cov(X,X')=0\), and the variance comes out as \[ \mathbb{V}(X+X') = \mathbb{V}X + \mathbb{V}X' = 35/12 + 35/12 = 35/6 \simeq 5.83 \]