Review for final exam (12/16)

The review is comprehensive, though more weight will come from the material on confidence intervals and significance tests.

Quick review of formulas in Ch 7 on two sample tests:

\[ T = \frac{(\bar{x}_1 - \bar{x}_2) - (\mu_1 - \mu_2)}{SE} = \frac{observed - expected}{SE} \]

This will have a \(t\) distribution.


If we assume matched samples, then we subtract the paired data to produce a single sample and use the one-sample \(t\) test with \(n-1\) degrees of freedom.


If we assume independent samples and normal populations (or \(n\) large enough given how far the population is from normal) then we have

\[ SD = \sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}. \]

With no assumptions on the variances, we have:

\[ SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}. \]

and use the smaller of \(n_1-1\) or \(n_2-1\) for the degrees of freedom.


If we assume \(\sigma = \sigma_1 = \sigma_2\) (but unknown value for \(\sigma\)) then

\[ SE = s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}. \]

and use \(n_1+ n_2 - 2\) for the degrees of freedom.


Finally, for completeness, though not a reasonable assumption, if we assume we know \(\sigma_1\) and \(\sigma_2\), then \(SE=SD\) and \(T\) will have a normal distribution (a \(t\) distribution with \(\infty\) degrees of freedom).

New formulas in Ch 8

Related, but different, Chapter 8 deals with sample proportions for which we have:

\[ \hat{p} = \frac{X}{n}, \quad \hat{p}_1 = \frac{X_1}{n_1}, \quad \hat{p}_2 = \frac{X_2}{n_2}, \quad \]

The “\(X\)” above are binomial, so we can say \(\hat{p}\) are approximately normal if \(np\) and \(n(1-p)\) are greater than 10.

The main test statistics are (for one and two samples):

\[ Z = \frac{\hat{p} - p}{SE}, \quad Z = \frac{(\hat{p}_1 - \hat{p}_2) - (p_1 - p_2)}{SE} \]

If we assume both \(np\) and \(n(1-p)\) are greater than 10 then we will assume \(Z\) has a standard normal distribution.

The standard deviations are:

\[ SD(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}, \quad SD(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \]

The SE used will depend on the assumptions:

For a one sample CI, we use \(\hat{p}\) to estimate \(p\).

For a two sample CI, we use \(\hat{p}_1\) and \(\hat{p}_2\) to estimate \(p_1\) and \(p_2\).

For a one sample significance test, we use \(SE=SD\), as \(p\) is an assumed value under \(H_o\).

For a two sample significance test, we use pool the data and use \(\hat{p} = (n_1\hat{p}_1 + n_2\hat{p}_2)/(n_1+n_1)\) as an estimate for both \(p_1\) and \(p_2\).

New (and old) formulas from Ch 10

The regression model is that the mean response for \(y\) given \(x\) is linear: \(\mu_{y\mid x} = \beta_0 + \beta_1 x\). A given value \(y_i\) is modeled by the mean plus an error, or \(y_i = \beta_0 + \beta_1 x_i + \epsilon_i\), where our assumptions are that the \(\epsilon_1, \epsilon_2, \dots\) are an i.i.d. sample from a normal population with mean 0 and variance \(\sigma^2\).

We estimate the \(\beta\)s and \(\sigma\) with:

\[ b_1 = r \frac{s_y}{s_x}, \quad b_0 = \bar{y} - b_1 \bar{x}, \quad \hat{y}_i = b_0 + b_1 x_i, \quad \text{residual} = e_i = y_i - \hat{y}_i, \quad s = \sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}} \]

Okay, we focused on just \(\beta_1\) and saw that

\[ SD(\beta_1) = \frac{\sigma}{\sqrt{\sum(x_i - \bar{x})^2}} \]

with the \(SE\) given by estimating \(\sigma\) with \(s\), from above.

The following statistic has a \(t\) distribution with \(n-2\) degrees of freedom:

\[ T = \frac{b_1 - \beta_1}{SE}. \]

This allows CIs (\(b_1 \pm t^* SE\)) and signifance tests to be performed.

From \(y_i -\bar{y} = (\hat{y}_i - \bar{y}) + (y_i - \hat{y}_i)\), we can use the following names:

\[ SSTotal = \sum(y_i -\bar{y})^2,\quad SSModel = \sum(\hat{y}_i - \bar{y})^2, \quad SSError = \sum(y_i - \hat{y}_i)^2, \]

and the formula \(SST = SSM + SSE\). Further, for degrees of freedom \(DFT =DFM + DFE\) and \(DFT=n-1\) and \(DFM=1\) so \(DFE = n-2\).

The mean square is the “sum of squares” over the degrees of freedom. We have \(SSE/(n-2)\) is our estimate for \(\sigma^2\).

The Pearson correlation coefficient can be expressed as \(r^2 = SSM / SST\) which makes precise the statement that \(r^2\) explains the proportion of the total variation is due to the model.

The \(F\) statistic is \(MSM/MSE\). This is small (close to 0) if the model does not explain much variation; and large if it does. It is used to test if \(\beta_1 = 0\), and is output in the software.

Sample questions

  • QUESTION A standard rule of thumb is that soon-to-be-born babies grow a half pound per week in the womb. We will test this hypothesis using the weight data in the variable wt and the gestation time gestation. That is perform a two sided test of \(H_o: \beta_1 = 1/2\).
res = lm(wt ~ gestation)
summary(res)
## 
## Call:
## lm(formula = wt ~ gestation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.125  -9.767  -3.743  12.085  42.684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -42.1204    76.9746  -0.547   0.5895  
## gestation     0.5988     0.2710   2.209   0.0374 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.05 on 23 degrees of freedom
## Multiple R-squared:  0.175,  Adjusted R-squared:  0.1392 
## F-statistic:  4.88 on 1 and 23 DF,  p-value: 0.0374

ANS: We use \(T = (obs-exp)/SE\), assuming the model applies. We read off these values to get:

b1 = 0.5988
SE = 0.2710
df  = 23
beta1 = 1/2
T_obs = (b1 - beta1)/SE
T_obs
## [1] 0.3645756

The critical value for a two-sided test with 23 degrees of freedom is:

alpha = 0.05
tstar = qt(alpha/2, lower.tail=FALSE,  df=df)
tstar

Clearly the \(p\)-value is greater than \(\alpha\), so the difference is not statistically significant.


A plot of residuals versus gestation is given below:

plot(gestation, resid(res))

Does this plot indicate if the value of \(\sigma\) depends on \(x\)?

ANS: No. The plot does not show systematic widening or narrowing

A qqplot of the residuals is given below:

qqnorm(resid(res))

Does this plot indicate normally distributed errors?

ANS: Yes, the bulk of the points fall along a rough line.

  • QUESTION A model of neck size predicted by wrist size is tested using a certain data set, assumed to contain a random sample of individuals. The assumption is that neck size is 2 times the wrist size. Using \(\alpha=0.05\), perform a two-sided test of significance.

The data is summarized through:

res = lm(neck ~ wrist, fat)
summary(res)
## 
## Call:
## lm(formula = neck ~ wrist, data = fat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9980 -1.0889 -0.0192  1.1149  7.0595 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.6370     2.0058   1.315     0.19    
## wrist         1.9394     0.1099  17.649   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.625 on 250 degrees of freedom
## Multiple R-squared:  0.5548, Adjusted R-squared:  0.553 
## F-statistic: 311.5 on 1 and 250 DF,  p-value: < 2.2e-16

Is there evidence that neck size and wrist size are not dependent (i.e, that \(\beta_1 = 0\))?

ANS: We just need to look at the test carried out with t-value of 17.649 and \(p\)-value of <2e-16 (which is super tiny


A plot of residuals versus wrist size is given by

plot(fat$wrist, resid(res))

Base on this, does it appear that \(\sigma\) does not depend on wrist size?

ANS: No, there is no systematic widening or narrowing as x increases


A qqplot of the residuals is shown:

qqnorm(resid(res))

Based on this, does the assumption that \(\epsilon_i\) are normally distributed seem reasonable?

ANS: The bulk of the data falls along a line. The tails might hint of longer tails than a normal, but large \(n\) would mean this is reasonable.

  • QUESTION A survey designer would like to ensure that his survey has a margin of error of no more than 2 percentage points for a 95% confidence interval. How large must \(n\) be so that this will be true? Assume the worst case scenario of \(\hat{p}=1/2\).

ANS: we have

\[ moe = 0.02 = z^* \sqrt{\hat{p}(1-\hat{p})/n} \]

Solving for \(n\), we have:

\[ \sqrt{n} = \frac{z^*}{0.02}\sqrt{\hat{p}(1 - \hat{p}} \]

The value on the left is largest when \(\hat{p}= 1 /2\), so the smallest \(n\) guaranteeing this for all possible \(\hat{p}\) is:

phat = 1/2
zstar =  1.96  #  for alpha=0.05
ceiling((zstar/0.02  * sqrt(phat*(1-phat)))^2)  # round up
## [1] 2401

The rest of the questions are made up but center around the CUNY budget request to the city and state (https://www.cuny.edu/wp-content/uploads/sites/4/page-assets/about/trustees/meetings-of-the-board/2020-2021-Operating-Budget-Request-and-Four-Year-Financial-Plan.pdf).


  • QUESTION In 2016, it was written “Eight in 10 CUNY students university-wide graduate with no federal education loans.” Suppose in 2019, a survey of 1000 students is made and it is found that 24 percent have federal education loans. Is there evidence, at the \(\alpha=0.05\) level, that students in 2019 have a greater percentage holding federal loans?

ANS: we use the one-sample test of proportions which has the Z statistic:

phat =  0.24
p =   2/10
SE = sqrt(p*(1-p)/1000)
Zobs = (phat - p)/SE
Zobs

This is right tailed (“greater”), so we use this to find the \(p\)-value:

pnorm(Zobs, lower.tail=FALSE)

This is less than \(\alpha =0.05\), so the difference IS statistically significant.

  • QUESTION Suppose a survey of CSI and Baruch students was made to assess their support of a proposed tuition increasee of 200 dollars per years. The data is summarized below
            X     n
CSI        20   225
Baruch     25   250

Construct a 90% CI for \(p_1 - p_2\).

For the same data, carry out a significance test with \(\alpha=0.05\) testing the assumption that the population proportions are equal against a two sided alternative.

ANS: This is a two-sample test of proportion with two-sided alternative. The SE comes from the SD by pooling the data:

phat1 = 20/225
phat2 = 25/250
phat =   (20 + 25)/(225 + 250)
SE = sqrt(phat*(1-phat)) *  sqrt(1/225 +  1/250)
Zobs = (phat1 - phat2) /  SE
Zobs
## [1] -0.4128812

This is two tailed. The area to the right of \(Zobs\) is doubled. (To the right, as \(Zobs\) is negative here)

2 * pnorm(Zobs)
## [1] 0.6796937

This value is greater than \(\alpha\), so the difference is not statistically significant.

  • QUESTION: A survey is held to determine if students are aware that a new health and wellness fee of 60 dollars per semester is being proposed. The data collected is:
X   n
25  800

Construct a 90% confidence interval for the population proportion \(p\).

ANS:

We have to fill in \(\hat{p} + z^*SE\), to that end we have:

phat = 25/800
zstar   =  qnorm(0.90   + 0.10/2)
SE = sqrt(phat*(1-phat)/800)
MOE = zstar * SE
phat  + c(-MOE,  MOE)
## [1] 0.02113157 0.04136843
  • QUESTION CUNY is proposing a $200 per year increase in tuition. But not all students will pay this increase, as some are funded through other programs. Suppose a survey is performed to see what the increase would be for Verazzano students compared to non-Verazzano students was performed. This fictitious survey is summarized by:
              xbar  s    n
Verazzano     125  25    8
Non-Verazzno  100  35    12

Is there evidence – at the \(\alpha=0.10\) level– that Verazzano student can expect to pay a bigger increase than non Verazzano students?

ANS: This is a two sample \(t\) tests with no assumption of equal population variances. We have:

xbar1 = 125
s1 =  25
n1  =   8
xbar2  =  100
s2  =  35
n2 =   12
SE =   sqrt(s1^2/n1   +  s2^2/n2)
df=  min(n1-1, n2-1)
Tobs =  (xbar1 - xbar2) / SE
Tobs
## [1] 1.862313

We compare this with the critical value which is found for a one-sided test to be:

qt(0.9,  df=df)
## [1] 1.414924

This is less than the observed value, so the \(p\)-value is less than \(\alpha\), hence the difference is statistically significant.

  • QUESTION CUNY proposes that if its budget request is funded “80% of enrolled students will know about, use appropriately, and report satisfaction with campus mental health, wellness, food security, or clinical health care services.”

Suppose that currently a survey is held asking students about the above. The data is summarized

X   n
75  100

Find a 90% CI for the population propotion. Does it include \(0.80\)?

ANS: This is done above for a similar problem. We need to fill in \(\hat{p} \pm z^* MOE\):

phat = 75/100
zstar  = pnorm(0.90 +  0.10/2)
SE = sqrt(phat  *  (1-phat)/100)
MOE = zstar * SE
phat + c(-MOE, MOE)
## [1] 0.7141057 0.7858943
  • QUESTION In CUNY’s budget request, it is stated: “CUNY is a national model in promoting and enhancing social and economic mobility.” To assess this, data is taken on students household income prior to enrolling at CUNY, and the students household income 3 years after graduating from CUNY. The data looks like:
                         xbar   s   n
Before enrolling         55     15  22
3yrs after graduation    65     25  26

Find a 90% CI for the difference of population means?

ANS: We have to fill in \((\bar{x}_1 -\bar{x}_2) \pm t^* SE\). To that end:

xbar1 = 55
s1 = 15
n1 = 22
xbar2 =   65
s2 =  25
n2 = 26
df = min(n1-1, n2 - 1)
tstar = qt(0.90 + 0.10/2, df=df)
SE = sqrt(s1^2/n1 + s2^2/n2)
MOE = tstar * SE
(xbar1 - xbar2)  + c(-MOE, MOE)
## [1] -20.07270253   0.07270253

Now, assuming \(\sigma_1 = \sigma_2\), perform a significance test with \(\alpha=0.05\) that the post graduation mean is more than the pre-enrollment population mean.

ANS: Now the SE is computed differently:

sp = sqrt(((n1-1)*s1^2 + (n2-1)*s2^2) / (n1 + n2 - 2))
SE = sp * sqrt(1/n1 +  1/n2)
Tobs  = (xbar2  - xbar1)/SE  # switched order to match question

We compare this to the critical value:

df = n1 + n2  -  2
tstar =  qt(0.05/2, lower.tail=FALSE, df =  df)
tstar
## [1] 2.012896

By comparison, we can see \(p\)-value \(> \alpha= 0.05\). So the difference is not statistically significant.

  • QUESTION CUNY promises in its budget request that “Online instruction has enormous potential for CUNY’s current students, for whom access to online courses can increase credit accumulation and fast-track completion, provide scheduling flexibility and greater course availability, and save students commuting and textbook costs”

Do CSI students agree? A survey of students on credit accumulation by year 2 is taken with cohorts taken from those who have taken online instruction and those who havent. The data is summarized below:

                        xbar   s    n
Have taken online        45    20   10
Have not taken online    48    15   25

At the \(\alpha=0.05\) level, perform a two-sided signicance test that the amount of credits is equal. Do not assume equal variances.

ANS: Using a two-sample test, we have:

xbar1 =    45
s1 = 20
n1 =  10
xbar2 =  48
s2  =  15
n2 = 25
SE = sqrt(s1^2/n1 + s2^2/n2)
df =  min(n1 - 1, n2 - 1)

Tobs  = (xbar1 - xbar2)/SE
Tobs
## [1] -0.4285714

Compare this to the critical value:

alpha = 0.05
tstar  =  qt(alpha/2,  lower.tail=FALSE,  df=df)
tstar

We see the observed value is not unusual and the \(p\)-value > \(\alpha\).

  • QUESTION: In CUNY’s budget request, it is written: “In an era of upskilling and reskilling, a college education must respond both to the labor demands and the unique circumstances of students pursuing a degree. By developing modernized career engagement centers and offering further opportunities for paid internships and co-ops, CUNY will align students with the skills and professional networks necessary for the current and future labor demands of growing industries.” In a NY Times op-ed (https://www.nytimes.com/2019/12/07/opinion/sunday/student-success-advice.html) Nicholas Kristof says “1. Take a class in economics and in statistics.”

A survey of MTH 214 students are asked if the skills they learned are aligned with labor demands. The results are:

             X    n
aligned     18   30

For a 90% confidence interval, what is the margin or error?

What size sample is needed to have a margin of error no more than \(0.05\)?

ANS: We have seen this before, and need to fill in \(\hat{p} + z^* SE\):

n = 30
phat = 18/30
SE =  sqrt(phat*(1-phat))/n
zstar = qnorm(0.9 +  0.10/2)
MOE = zstar * SE
phat + c(-MOE,  MOE)
## [1] 0.5731397 0.6268603

Finally, we solve:

\[ \sqrt{n} = \frac{z^*}{MOE}\sqrt{\hat{p}(1-\hat{p}}. \]

As we don’t know \(\hat{p}\) we take the worst case value with \(p=1/2\), giving

\[ \sqrt{n} \geq \frac{z^*}{2MOE} \]

That is

zstar  =  qnorm(0.9 +  0.10/2)
MOE = 0.05
ceiling((zstar /(2 * MOE))^2)
## [1] 271