Review for final exam (12/16)

The review is comprehensive, though more weight will come from the material on confidence intervals and significance tests.

Quick review of formulas in Ch 7 on two sample tests:

\[ T = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{SE} = \frac{observed - expected}{SE} \]

This will have a \(t\) distribution.


If we assume matched samples, then we subtract the paired data to produce a single sample and use the one-sample \(t\) test with \(n-1\) degrees of freedom.


If we assume independent samples and normal populations (or \(n\) large enough given how far the population is from normal) then we have

\[ SD = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}. \]

With no assumptions on the variances, we have:

\[ SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}. \]

and use the smaller of \(n_1-1\) or \(n_2-1\) for the degrees of freedom.


If we assume \(\sigma = \sigma_1 = \sigma_2\) (but unknown value for \(\sigma\)) then

\[ SE = s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}. \]

and use \(n_1+ n_2 - 2\) for the degrees of freedom.


Finally, for completeness, though not a reasonable assumption, if we assume we know \(\sigma_1\) and \(\sigma_2\), then \(SE=SD\) and \(T\) will have a normal distribution (a \(t\) distribution with \(\infty\) degrees of freedom).

New formulas in Ch 8

Related, but different, Chapter 8 deals with sample proportions for which we have:

\[ \hat{p} = \frac{X}{n}, \quad \hat{p}_1 = \frac{X_1}{n_1}, \quad \hat{p}_2 = \frac{X_2}{n_2}, \quad \]

The “\(X\)” above are binomial, so we can say \(\hat{p}\) are approximately normal if \(np\) and \(n(1-p)\) are greater than 10.

The main test statistics are (for one and two samples):

\[ Z = \frac{\hat{p} - p}{SE}, \quad Z = \frac{(\hat{p}_1 - \hat{p}_2) - (p_1 - p_2)}{SE} \]

If we assume both \(np\) and \(n(1-p)\) are greater than 10 then we will assume \(Z\) has a standard normal distribution.

The standard deviations are:

\[ SD(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}, \quad SD(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \]

The SE used will depend on the assumptions:

For a one sample CI, we use \(\hat{p}\) to estimate \(p\).

For a two sample CI, we use \(\hat{p}_1\) and \(\hat{p}_2\) to estimate \(p_1\) and \(p_2\).

For a one sample significance test, we use \(SE=SD\), as \(p\) is an assumed value under \(H_o\).

For a two sample significance test, we use pool the data and use \(\hat{p} = (n_1\hat{p}_1 + n_2\hat{p}_2)/(n_1+n_1)\) as an estimate for both \(p_1\) and \(p_2\).

New (and old) formulas from Ch 10

The regression model is that the mean response for \(y\) given \(x\) is linear: \(\mu_{y\mid x} = \beta_0 + \beta_1 x\). A given value \(y_i\) is modeled by the mean plus an error, or \(y_i = \beta_0 + \beta_1 x_i + \epsilon_i\), where our assumptions are that the \(\epsilon_1, \epsilon_2, \dots\) are an i.i.d. sample from a normal population with mean 0 and variance \(\sigma^2\).

We estimate the \(\beta\)s and \(\sigma\) with:

\[ b_1 = r \frac{s_y}{s_x}, \quad b_0 = \bar{y} - b_1 \bar{x}, \quad \hat{y}_i = b_0 + b_1 x_i, \quad \text{residual} = e_i = y_i - \hat{y}_i, \quad s = \sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}} \]

Okay, we focused on just \(\beta_1\) and saw that

\[ SD(\beta_1) = \frac{\sigma}{\sqrt{\sum(x_i - \bar{x})^2}} \]

with the \(SE\) given by estimating \(\sigma\) with \(s\), from above.

The following statistic has a \(t\) distribution with \(n-2\) degrees of freedom:

\[ T = \frac{b_1 - \beta_1}{SE}. \]

This allows CIs (\(b_1 \pm t^* SE\)) and signifance tests to be performed.

From \(y_i -\bar{y} = (\hat{y}_i - \bar{y}) + (y_i - \hat{y}_i)\), we can use the following names:

\[ SSTotal = \sum(y_i -\bar{y})^2,\quad SSModel = \sum(\hat{y}_i - \bar{y})^2, \quad SSError = \sum(y_i - \hat{y}_i)^2, \]

and the formula \(SST = SSM + SSE\). Further, for degrees of freedom \(DFT =DFM + DFE\) and \(DFT=n-1\) and \(DFM=1\) so \(DFE = n-2\).

The mean square is the “sum of squares” over the degrees of freedom. We have \(SSE/(n-2)\) is our estimate for \(\sigma^2\).

The Pearson correlation coefficient can be expressed as \(r^2 = SSM / SST\) which makes precise the statement that \(r^2\) explains the proportion of the total variation is due to the model.

The \(F\) statistic is \(MSM/MSE\). This is small (close to 0) if the model does not explain much variation; and large if it does. It is used to test if \(\beta_1 = 0\), and is output in the software.

Sample questions

  • A standard rule of thumb is that soon-to-be-born babies grow a half pound per week in the womb. We will test this hypothesis using the weight data in the variable wt and the gestation time gestation. That is perform a two sided test of \(H_o: \beta_1 = 1/2\).
res = lm(wt ~ gestation)
summary(res)
## 
## Call:
## lm(formula = wt ~ gestation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.125  -9.767  -3.743  12.085  42.684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -42.1204    76.9746  -0.547   0.5895  
## gestation     0.5988     0.2710   2.209   0.0374 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.05 on 23 degrees of freedom
## Multiple R-squared:  0.175,  Adjusted R-squared:  0.1392 
## F-statistic:  4.88 on 1 and 23 DF,  p-value: 0.0374

A plot of residuals versus gestation is given below:

plot(gestation, resid(res))

Does this plot indicate if the value of \(\sigma\) depends on \(x\)?

A qqplot of the residuals is given below:

qqnorm(resid(res))

Does this plot indicate normally distributed errors?

  • A model of neck size predicted by wrist size is tested using a certain data set, assumed to contain a random sample of individuals. The assumption is that neck size is 2 times the wrist size. Using \(\alpha=0.05\), perform a two-sided test of significance.

The data is summarized through:

res = lm(neck ~ wrist, fat)
summary(res)
## 
## Call:
## lm(formula = neck ~ wrist, data = fat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9980 -1.0889 -0.0192  1.1149  7.0595 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.6370     2.0058   1.315     0.19    
## wrist         1.9394     0.1099  17.649   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.625 on 250 degrees of freedom
## Multiple R-squared:  0.5548, Adjusted R-squared:  0.553 
## F-statistic: 311.5 on 1 and 250 DF,  p-value: < 2.2e-16

Is there evidence that neck size and wrist size are not dependent (i.e, that \(\beta_1 = 0\))?


A plot of residuals versus wrist size is given by

plot(fat$wrist, resid(res))

Base on this, does it appear that \(\sigma\) does not depend on wrist size?


A qqplot of the residuals is shown:

qqnorm(resid(res))

Based on this, does the assumption that \(\epsilon_i\) are normally distributed seem reasonable?

  • A survey designer would like to ensure that his survey has a margin of error of no more than 2 percentage points for a 95% confidence interval. How large must \(n\) be so that this will be true? Assume the worst case scenario of \(\hat{p}=1/2\).

The rest of the questions are made up but center around the CUNY budget request to the city and state (https://www.cuny.edu/wp-content/uploads/sites/4/page-assets/about/trustees/meetings-of-the-board/2020-2021-Operating-Budget-Request-and-Four-Year-Financial-Plan.pdf).


  • In 2016, it was written “Eight in 10 CUNY students university-wide graduate with no federal education loans.” Suppose in 2019, a survey of 1000 students is made and it is found that 24 percent have federal education loans. Is there evidence, at the \(\alpha=0.05\) level, that students in 2019 have a greater percentage holding federal loans?

  • Suppose a survey of CSI and Baruch students was made to assess their support of a proposed tuition increasee of 200 dollars per years. The data is summarized below

            X     n
CSI        20   225
Baruch     25   250

Construct a 90% CI for \(p_1 - p_2\).

For the same data, carry out a significance test with \(\alpha=0.05\) testing the assumption that the population proportions are equal against a two sided alternative.

  • A survey is held to determine if students are aware that a new health and wellness fee of 60 dollars per semester is being proposed. The data collected is:
X   n
25  800

Construct a 90% confidence interval for the population proportion \(p\).

  • CUNY is proposing a $200 per year increase in tuition. But not all students will pay this increase, as some are funded through other programs. Suppose a survey is performed to see what the increase would be for Verazzano students compared to non-Verazzano students was performed. This fictitious survey is summarized by:
               mu   s    n
Verazzano     125  25    8
Non-Verazzno  100  35    12

Is there evidence – at the \(\alpha=0.10\) level– that Verazzano student can expect to pay a bigger increase than non Verazzano students?

  • CUNY proposes that if its budget request is funded “80% of enrolled students will know about, use appropriately, and report satisfaction with campus mental health, wellness, food security, or clinical health care services.”

Suppose that currently a survey is held asking students about the above. The data is summarized

X   n
75  100

Find a 90% CI for the population propotion. Does it include \(0.80\)?

  • In CUNY’s budget request, it is stated: “CUNY is a national model in promoting and enhancing social and economic mobility.” To assess this, data is taken on students household income prior to enrolling at CUNY, and the students household income 3 years after graduating from CUNY. The data looks like:
                         xbar   s   n
Before enrolling         55     15  22
3yrs after graduation    65     25  26

Find a 90% CI for the difference of population means?

Now, assuming \(\sigma_1 = \sigma_2\), perform a significance test with \(\alpha=0.05\) that the post graduation mean is more than the pre-enrollment population mean.

  • CUNY promises in its budget request that “Online instruction has enormous potential for CUNY’s current students, for whom access to online courses can increase credit accumulation and fast-track completion, provide scheduling flexibility and greater course availability, and save students commuting and textbook costs”

Do CSI students agree? A survey of students on credit accumulation by year 2 is taken with cohorts taken from those who have taken online instruction and those who havent. The data is summarized below:

                        xbar   s    n
Have taken online        45    20   10
Have not taken online    48    15   25

At the \(\alpha=0.05\) level, perform a two-sided signicance test that the amount of credits is equal. Do not assume equal variances.

  • In CUNY’s budget request, it is written: “In an era of upskilling and reskilling, a college education must respond both to the labor demands and the unique circumstances of students pursuing a degree. By developing modernized career engagement centers and offering further opportunities for paid internships and co-ops, CUNY will align students with the skills and professional networks necessary for the current and future labor demands of growing industries.” In a NY Times op-ed (https://www.nytimes.com/2019/12/07/opinion/sunday/student-success-advice.html) Nicholas Kristof says “1. Take a class in economics and in statistics.”

A survey of MTH 214 students are asked if the skills they learned are aligned with labor demands. The results are:

             X    n
aligned     18   30

For a 90% confidence interval, what is the margin or error?

What size sample is needed to have a margin of error no more than \(0.05\)?