The review is comprehensive, though more weight will come from the material on confidence intervals and significance tests.
Quick review of formulas in Ch 7 on two sample tests:
\[ T = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{SE} = \frac{observed - expected}{SE} \]
This will have a \(t\) distribution.
If we assume matched samples, then we subtract the paired data to produce a single sample and use the one-sample \(t\) test with \(n-1\) degrees of freedom.
If we assume independent samples and normal populations (or \(n\) large enough given how far the population is from normal) then we have
\[ SD = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}. \]
With no assumptions on the variances, we have:
\[ SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}. \]
and use the smaller of \(n_1-1\) or \(n_2-1\) for the degrees of freedom.
If we assume \(\sigma = \sigma_1 = \sigma_2\) (but unknown value for \(\sigma\)) then
\[ SE = s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}. \]
and use \(n_1+ n_2 - 2\) for the degrees of freedom.
Finally, for completeness, though not a reasonable assumption, if we assume we know \(\sigma_1\) and \(\sigma_2\), then \(SE=SD\) and \(T\) will have a normal distribution (a \(t\) distribution with \(\infty\) degrees of freedom).
Related, but different, Chapter 8 deals with sample proportions for which we have:
\[ \hat{p} = \frac{X}{n}, \quad \hat{p}_1 = \frac{X_1}{n_1}, \quad \hat{p}_2 = \frac{X_2}{n_2}, \quad \]
The “\(X\)” above are binomial, so we can say \(\hat{p}\) are approximately normal if \(np\) and \(n(1-p)\) are greater than 10.
The main test statistics are (for one and two samples):
\[ Z = \frac{\hat{p} - p}{SE}, \quad Z = \frac{(\hat{p}_1 - \hat{p}_2) - (p_1 - p_2)}{SE} \]
If we assume both \(np\) and \(n(1-p)\) are greater than 10 then we will assume \(Z\) has a standard normal distribution.
The standard deviations are:
\[ SD(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}, \quad SD(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \]
The SE used will depend on the assumptions:
For a one sample CI, we use \(\hat{p}\) to estimate \(p\).
For a two sample CI, we use \(\hat{p}_1\) and \(\hat{p}_2\) to estimate \(p_1\) and \(p_2\).
For a one sample significance test, we use \(SE=SD\), as \(p\) is an assumed value under \(H_o\).
For a two sample significance test, we use pool the data and use \(\hat{p} = (n_1\hat{p}_1 + n_2\hat{p}_2)/(n_1+n_1)\) as an estimate for both \(p_1\) and \(p_2\).
The regression model is that the mean response for \(y\) given \(x\) is linear: \(\mu_{y\mid x} = \beta_0 + \beta_1 x\). A given value \(y_i\) is modeled by the mean plus an error, or \(y_i = \beta_0 + \beta_1 x_i + \epsilon_i\), where our assumptions are that the \(\epsilon_1, \epsilon_2, \dots\) are an i.i.d. sample from a normal population with mean 0 and variance \(\sigma^2\).
We estimate the \(\beta\)s and \(\sigma\) with:
\[ b_1 = r \frac{s_y}{s_x}, \quad b_0 = \bar{y} - b_1 \bar{x}, \quad \hat{y}_i = b_0 + b_1 x_i, \quad \text{residual} = e_i = y_i - \hat{y}_i, \quad s = \sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}} \]
Okay, we focused on just \(\beta_1\) and saw that
\[ SD(\beta_1) = \frac{\sigma}{\sqrt{\sum(x_i - \bar{x})^2}} \]
with the \(SE\) given by estimating \(\sigma\) with \(s\), from above.
The following statistic has a \(t\) distribution with \(n-2\) degrees of freedom:
\[ T = \frac{b_1 - \beta_1}{SE}. \]
This allows CIs (\(b_1 \pm t^* SE\)) and signifance tests to be performed.
From \(y_i -\bar{y} = (\hat{y}_i - \bar{y}) + (y_i - \hat{y}_i)\), we can use the following names:
\[ SSTotal = \sum(y_i -\bar{y})^2,\quad SSModel = \sum(\hat{y}_i - \bar{y})^2, \quad SSError = \sum(y_i - \hat{y}_i)^2, \]
and the formula \(SST = SSM + SSE\). Further, for degrees of freedom \(DFT =DFM + DFE\) and \(DFT=n-1\) and \(DFM=1\) so \(DFE = n-2\).
The mean square is the “sum of squares” over the degrees of freedom. We have \(SSE/(n-2)\) is our estimate for \(\sigma^2\).
The Pearson correlation coefficient can be expressed as \(r^2 = SSM / SST\) which makes precise the statement that \(r^2\) explains the proportion of the total variation is due to the model.
The \(F\) statistic is \(MSM/MSE\). This is small (close to 0) if the model does not explain much variation; and large if it does. It is used to test if \(\beta_1 = 0\), and is output in the software.
wt
and the gestation time gestation
. That is perform a two sided test of \(H_o: \beta_1 = 1/2\).res = lm(wt ~ gestation)
summary(res)
##
## Call:
## lm(formula = wt ~ gestation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.125 -9.767 -3.743 12.085 42.684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -42.1204 76.9746 -0.547 0.5895
## gestation 0.5988 0.2710 2.209 0.0374 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.05 on 23 degrees of freedom
## Multiple R-squared: 0.175, Adjusted R-squared: 0.1392
## F-statistic: 4.88 on 1 and 23 DF, p-value: 0.0374
A plot of residuals versus gestation is given below:
plot(gestation, resid(res))
Does this plot indicate if the value of \(\sigma\) depends on \(x\)?
A qqplot of the residuals is given below:
qqnorm(resid(res))
Does this plot indicate normally distributed errors?
The data is summarized through:
res = lm(neck ~ wrist, fat)
summary(res)
##
## Call:
## lm(formula = neck ~ wrist, data = fat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9980 -1.0889 -0.0192 1.1149 7.0595
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.6370 2.0058 1.315 0.19
## wrist 1.9394 0.1099 17.649 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.625 on 250 degrees of freedom
## Multiple R-squared: 0.5548, Adjusted R-squared: 0.553
## F-statistic: 311.5 on 1 and 250 DF, p-value: < 2.2e-16
Is there evidence that neck size and wrist size are not dependent (i.e, that \(\beta_1 = 0\))?
A plot of residuals versus wrist size is given by
plot(fat$wrist, resid(res))
Base on this, does it appear that \(\sigma\) does not depend on wrist
size?
A qqplot of the residuals is shown:
qqnorm(resid(res))
Based on this, does the assumption that \(\epsilon_i\) are normally distributed seem reasonable?
The rest of the questions are made up but center around the CUNY budget request to the city and state (https://www.cuny.edu/wp-content/uploads/sites/4/page-assets/about/trustees/meetings-of-the-board/2020-2021-Operating-Budget-Request-and-Four-Year-Financial-Plan.pdf).
In 2016, it was written “Eight in 10 CUNY students university-wide graduate with no federal education loans.” Suppose in 2019, a survey of 1000 students is made and it is found that 24 percent have federal education loans. Is there evidence, at the \(\alpha=0.05\) level, that students in 2019 have a greater percentage holding federal loans?
Suppose a survey of CSI and Baruch students was made to assess their support of a proposed tuition increasee of 200 dollars per years. The data is summarized below
X n
CSI 20 225
Baruch 25 250
Construct a 90% CI for \(p_1 - p_2\).
For the same data, carry out a significance test with \(\alpha=0.05\) testing the assumption that the population proportions are equal against a two sided alternative.
X n
25 800
Construct a 90% confidence interval for the population proportion \(p\).
mu s n
Verazzano 125 25 8
Non-Verazzno 100 35 12
Is there evidence – at the \(\alpha=0.10\) level– that Verazzano student can expect to pay a bigger increase than non Verazzano students?
Suppose that currently a survey is held asking students about the above. The data is summarized
X n
75 100
Find a 90% CI for the population propotion. Does it include \(0.80\)?
xbar s n
Before enrolling 55 15 22
3yrs after graduation 65 25 26
Find a 90% CI for the difference of population means?
Now, assuming \(\sigma_1 = \sigma_2\), perform a significance test with \(\alpha=0.05\) that the post graduation mean is more than the pre-enrollment population mean.
Do CSI students agree? A survey of students on credit accumulation by year 2 is taken with cohorts taken from those who have taken online instruction and those who havent. The data is summarized below:
xbar s n
Have taken online 45 20 10
Have not taken online 48 15 25
At the \(\alpha=0.05\) level, perform a two-sided signicance test that the amount of credits is equal. Do not assume equal variances.
A survey of MTH 214 students are asked if the skills they learned are aligned with labor demands. The results are:
X n
aligned 18 30
For a 90% confidence interval, what is the margin or error?
What size sample is needed to have a margin of error no more than \(0.05\)?