New formulas in Ch 8

Related, but different, Chapter 8 deals with sample proportions for which we have:

\[ \hat{p} = \frac{X}{n}, \quad \hat{p}_1 = \frac{X_1}{n_1}, \quad \hat{p}_2 = \frac{X_2}{n_2}, \quad \]

The “$X$” above are binomial, so we can say $\hat{p}$ are approximately normal if $np$ and $n(1-p)$ are greater than 10.

The main test statistics are (for one and two samples):

\[ Z = \frac{\hat{p} - p}{SE}, \quad Z = \frac{(\hat{p}_1 - \hat{p}_2) - (p_1 - p_2)}{SE} \]

If we assume both $np$ and $n(1-p)$ are greater than 10 then we will assume $Z$ has a standard normal distribution.

The standard deviations are:

\[ SD(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}, \quad SD(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \]

The SE used will depend on the assumptions:

For a one sample CI, we use $\hat{p}$ to estimate $p$.

For a two sample CI, we use $\hat{p}_1$ and $\hat{p}_2$ to estimate $p_1$ and $p_2$.

For a one sample significance test, we use $SE=SD$, as $p$ is an assumed value under $H_o$.

For a two sample significance test, we use pool the data and use $\hat{p} = (n_1\hat{p}_1 + n_2\hat{p}_2)/(n_1+n_1)$ as an estimate for both $p_1$ and $p_2$.

New (and old) formulas from Ch 10

The regression model is that the mean response for $y$ given $x$ is linear: $\mu_{y\mid x} = \beta_0 + \beta_1 x$. A given value $y_i$ is modeled by the mean plus an error, or $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$, where our assumptions are that the $\epsilon_1, \epsilon_2, \dots$ are an i.i.d. sample from a normal population with mean 0 and variance $\sigma^2$.

We estimate the $\beta$s and $\sigma$ with:

\[ b_1 = r \frac{s_y}{s_x}, \quad b_0 = \bar{y} - b_1 \bar{x}, \quad \hat{y}_i = b_0 + b_1 x_i, \quad \text{residual} = e_i = y_i - \hat{y}_i, \quad s = \sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}} \]

Okay, we focused on just $\beta_1$ and saw that

\[ SD(\beta_1) = \frac{\sigma}{\sqrt{\sum(x_i - \bar{x})^2}} \]

with the $SE$ given by estimating $\sigma$ with $s$, from above.

The following statistic has a $t$ distribution with $n-2$ degrees of freedom:

\[ T = \frac{b_1 - \beta_1}{SE}. \]

This allows CIs ($b_1 \pm t^* SE$) and signifance tests to be performed.

From $y_i -\bar{y} = (\hat{y}_i - \bar{y}) + (y_i - \hat{y}_i)$, we can use the following names:

\[ SSTotal = \sum(y_i -\bar{y})^2,\quad SSModel = \sum(\hat{y}_i - \bar{y})^2, \quad SSError = \sum(y_i - \hat{y}_i)^2, \]

and the formula $SST = SSM + SSE$. Further, for degrees of freedom $DFT =DFM + DFE$ and $DFT=n-1$ and $DFM=1$ so $DFE = n-2$.

The mean square is the “sum of squares” over the degrees of freedom. We have $SSE/(n-2)$ is our estimate for $\sigma^2$.

The Pearson correlation coefficient can be expressed as $r^2 = SSM / SST$ which makes precise the statement that $r^2$ explains the proportion of the total variation is due to the model.

The $F$ statistic is $MSM/MSE$. This is small (close to 0) if the model does not explain much variation; and large if it does. It is used to test if $\beta_1 = 0$, and is output in the software.

Sample questions

QUESTION A standard rule of thumb is that soon-to-be-born babies grow a half pound per week in the womb. We will test this hypothesis using the weight data in the variable wt and the gestation time gestation. That is perform a two sided test of $H_o: \beta_1 = 1/2$.

res = lm(wt ~ gestation)
summary(res)

## 
## Call:
## lm(formula = wt ~ gestation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.125  -9.767  -3.743  12.085  42.684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -42.1204    76.9746  -0.547   0.5895  
## gestation     0.5988     0.2710   2.209   0.0374 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.05 on 23 degrees of freedom
## Multiple R-squared:  0.175,  Adjusted R-squared:  0.1392 
## F-statistic:  4.88 on 1 and 23 DF,  p-value: 0.0374

ANS: We use $T = (obs-exp)/SE$, assuming the model applies. We read off these values to get:

b1 = 0.5988
SE = 0.2710
df  = 23
beta1 = 1/2
T_obs = (b1 - beta1)/SE
T_obs

## [1] 0.3645756

The critical value for a two-sided test with 23 degrees of freedom is:

alpha = 0.05
tstar = qt(alpha/2, lower.tail=FALSE,  df=df)
tstar

Clearly the $p$-value is greater than $\alpha$, so the difference is not statistically significant.

A plot of residuals versus gestation is given below:

plot(gestation, resid(res))

Does this plot indicate if the value of $\sigma$ depends on $x$?

ANS: No. The plot does not show systematic widening or narrowing

A qqplot of the residuals is given below:

qqnorm(resid(res))

Does this plot indicate normally distributed errors?

ANS: Yes, the bulk of the points fall along a rough line.

QUESTION A model of neck size predicted by wrist size is tested using a certain data set, assumed to contain a random sample of individuals. The assumption is that neck size is 2 times the wrist size. Using $\alpha=0.05$, perform a two-sided test of significance.

The data is summarized through:

res = lm(neck ~ wrist, fat)
summary(res)

## 
## Call:
## lm(formula = neck ~ wrist, data = fat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9980 -1.0889 -0.0192  1.1149  7.0595 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.6370     2.0058   1.315     0.19    
## wrist         1.9394     0.1099  17.649   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.625 on 250 degrees of freedom
## Multiple R-squared:  0.5548, Adjusted R-squared:  0.553 
## F-statistic: 311.5 on 1 and 250 DF,  p-value: < 2.2e-16

Is there evidence that neck size and wrist size are not dependent (i.e, that $\beta_1 = 0$)?

ANS: We just need to look at the test carried out with t-value of 17.649 and $p$-value of <2e-16 (which is super tiny

A plot of residuals versus wrist size is given by

plot(fat$wrist, resid(res))

Base on this, does it appear that $\sigma$ does not depend on wrist size?

ANS: No, there is no systematic widening or narrowing as x increases

A qqplot of the residuals is shown:

qqnorm(resid(res))

Based on this, does the assumption that $\epsilon_i$ are normally distributed seem reasonable?

ANS: The bulk of the data falls along a line. The tails might hint of longer tails than a normal, but large $n$ would mean this is reasonable.

QUESTION A survey designer would like to ensure that his survey has a margin of error of no more than 2 percentage points for a 95% confidence interval. How large must $n$ be so that this will be true? Assume the worst case scenario of $\hat{p}=1/2$.

ANS: we have

\[ moe = 0.02 = z^* \sqrt{\hat{p}(1-\hat{p})/n} \]

Solving for $n$, we have:

\[ \sqrt{n} = \frac{z^*}{0.02}\sqrt{\hat{p}(1 - \hat{p}} \]

The value on the left is largest when $\hat{p}= 1 /2$, so the smallest $n$ guaranteeing this for all possible $\hat{p}$ is:

phat = 1/2
zstar =  1.96  #  for alpha=0.05
ceiling((zstar/0.02  * sqrt(phat*(1-phat)))^2)  # round up

## [1] 2401

The rest of the questions are made up but center around the CUNY budget request to the city and state (https://www.cuny.edu/wp-content/uploads/sites/4/page-assets/about/trustees/meetings-of-the-board/2020-2021-Operating-Budget-Request-and-Four-Year-Financial-Plan.pdf).

QUESTION In 2016, it was written “Eight in 10 CUNY students university-wide graduate with no federal education loans.” Suppose in 2019, a survey of 1000 students is made and it is found that 24 percent have federal education loans. Is there evidence, at the $\alpha=0.05$ level, that students in 2019 have a greater percentage holding federal loans?

ANS: we use the one-sample test of proportions which has the Z statistic:

phat =  0.24
p =   2/10
SE = sqrt(p*(1-p)/1000)
Zobs = (phat - p)/SE
Zobs

This is right tailed (“greater”), so we use this to find the $p$-value:

pnorm(Zobs, lower.tail=FALSE)

This is less than $\alpha =0.05$, so the difference IS statistically significant.

QUESTION Suppose a survey of CSI and Baruch students was made to assess their support of a proposed tuition increasee of 200 dollars per years. The data is summarized below

            X     n
CSI        20   225
Baruch     25   250

Construct a 90% CI for $p_1 - p_2$.

For the same data, carry out a significance test with $\alpha=0.05$ testing the assumption that the population proportions are equal against a two sided alternative.

ANS: This is a two-sample test of proportion with two-sided alternative. The SE comes from the SD by pooling the data:

phat1 = 20/225
phat2 = 25/250
phat =   (20 + 25)/(225 + 250)
SE = sqrt(phat*(1-phat)) *  sqrt(1/225 +  1/250)
Zobs = (phat1 - phat2) /  SE
Zobs

## [1] -0.4128812

This is two tailed. The area to the right of $Zobs$ is doubled. (To the right, as $Zobs$ is negative here)

2 * pnorm(Zobs)

## [1] 0.6796937

This value is greater than $\alpha$, so the difference is not statistically significant.

QUESTION: A survey is held to determine if students are aware that a new health and wellness fee of 60 dollars per semester is being proposed. The data collected is:

X   n
25  800

Construct a 90% confidence interval for the population proportion $p$.

ANS:

We have to fill in $\hat{p} + z^*SE$, to that end we have:

phat = 25/800
zstar   =  qnorm(0.90   + 0.10/2)
SE = sqrt(phat*(1-phat)/800)
MOE = zstar * SE
phat  + c(-MOE,  MOE)

## [1] 0.02113157 0.04136843

QUESTION CUNY is proposing a $200 per year increase in tuition. But not all students will pay this increase, as some are funded through other programs. Suppose a survey is performed to see what the increase would be for Verazzano students compared to non-Verazzano students was performed. This fictitious survey is summarized by:

              xbar  s    n
Verazzano     125  25    8
Non-Verazzno  100  35    12

Is there evidence – at the $\alpha=0.10$ level– that Verazzano student can expect to pay a bigger increase than non Verazzano students?

ANS: This is a two sample $t$ tests with no assumption of equal population variances. We have:

xbar1 = 125
s1 =  25
n1  =   8
xbar2  =  100
s2  =  35
n2 =   12
SE =   sqrt(s1^2/n1   +  s2^2/n2)
df=  min(n1-1, n2-1)
Tobs =  (xbar1 - xbar2) / SE
Tobs

## [1] 1.862313

We compare this with the critical value which is found for a one-sided test to be:

qt(0.9,  df=df)

## [1] 1.414924

This is less than the observed value, so the $p$-value is less than $\alpha$, hence the difference is statistically significant.

QUESTION CUNY proposes that if its budget request is funded “80% of enrolled students will know about, use appropriately, and report satisfaction with campus mental health, wellness, food security, or clinical health care services.”

Suppose that currently a survey is held asking students about the above. The data is summarized

X   n
75  100

Find a 90% CI for the population propotion. Does it include $0.80$?

ANS: This is done above for a similar problem. We need to fill in $\hat{p} \pm z^* MOE$:

phat = 75/100
zstar  = pnorm(0.90 +  0.10/2)
SE = sqrt(phat  *  (1-phat)/100)
MOE = zstar * SE
phat + c(-MOE, MOE)

## [1] 0.7141057 0.7858943

QUESTION In CUNY’s budget request, it is stated: “CUNY is a national model in promoting and enhancing social and economic mobility.” To assess this, data is taken on students household income prior to enrolling at CUNY, and the students household income 3 years after graduating from CUNY. The data looks like:

                         xbar   s   n
Before enrolling         55     15  22
3yrs after graduation    65     25  26

Find a 90% CI for the difference of population means?

ANS: We have to fill in $(\bar{x}_1 -\bar{x}_2) \pm t^* SE$. To that end:

xbar1 = 55
s1 = 15
n1 = 22
xbar2 =   65
s2 =  25
n2 = 26
df = min(n1-1, n2 - 1)
tstar = qt(0.90 + 0.10/2, df=df)
SE = sqrt(s1^2/n1 + s2^2/n2)
MOE = tstar * SE
(xbar1 - xbar2)  + c(-MOE, MOE)

## [1] -20.07270253   0.07270253

Now, assuming $\sigma_1 = \sigma_2$, perform a significance test with $\alpha=0.05$ that the post graduation mean is more than the pre-enrollment population mean.

ANS: Now the SE is computed differently:

sp = sqrt(((n1-1)*s1^2 + (n2-1)*s2^2) / (n1 + n2 - 2))
SE = sp * sqrt(1/n1 +  1/n2)
Tobs  = (xbar2  - xbar1)/SE  # switched order to match question

We compare this to the critical value:

df = n1 + n2  -  2
tstar =  qt(0.05/2, lower.tail=FALSE, df =  df)
tstar

## [1] 2.012896

By comparison, we can see $p$-value $> \alpha= 0.05$. So the difference is not statistically significant.

QUESTION CUNY promises in its budget request that “Online instruction has enormous potential for CUNY’s current students, for whom access to online courses can increase credit accumulation and fast-track completion, provide scheduling flexibility and greater course availability, and save students commuting and textbook costs”

Do CSI students agree? A survey of students on credit accumulation by year 2 is taken with cohorts taken from those who have taken online instruction and those who havent. The data is summarized below:

                        xbar   s    n
Have taken online        45    20   10
Have not taken online    48    15   25

At the $\alpha=0.05$ level, perform a two-sided signicance test that the amount of credits is equal. Do not assume equal variances.

ANS: Using a two-sample test, we have:

xbar1 =    45
s1 = 20
n1 =  10
xbar2 =  48
s2  =  15
n2 = 25
SE = sqrt(s1^2/n1 + s2^2/n2)
df =  min(n1 - 1, n2 - 1)

Tobs  = (xbar1 - xbar2)/SE
Tobs

## [1] -0.4285714

Compare this to the critical value:

alpha = 0.05
tstar  =  qt(alpha/2,  lower.tail=FALSE,  df=df)
tstar

We see the observed value is not unusual and the $p$-value > $\alpha$.

QUESTION: In CUNY’s budget request, it is written: “In an era of upskilling and reskilling, a college education must respond both to the labor demands and the unique circumstances of students pursuing a degree. By developing modernized career engagement centers and offering further opportunities for paid internships and co-ops, CUNY will align students with the skills and professional networks necessary for the current and future labor demands of growing industries.” In a NY Times op-ed (https://www.nytimes.com/2019/12/07/opinion/sunday/student-success-advice.html) Nicholas Kristof says “1. Take a class in economics and in statistics.”

A survey of MTH 214 students are asked if the skills they learned are aligned with labor demands. The results are:

             X    n
aligned     18   30

For a 90% confidence interval, what is the margin or error?

What size sample is needed to have a margin of error no more than $0.05$?

ANS: We have seen this before, and need to fill in $\hat{p} + z^* SE$:

n = 30
phat = 18/30
SE =  sqrt(phat*(1-phat))/n
zstar = qnorm(0.9 +  0.10/2)
MOE = zstar * SE
phat + c(-MOE,  MOE)

## [1] 0.5731397 0.6268603

Finally, we solve:

\[ \sqrt{n} = \frac{z^*}{MOE}\sqrt{\hat{p}(1-\hat{p}}. \]

As we don’t know $\hat{p}$ we take the worst case value with $p=1/2$, giving

\[ \sqrt{n} \geq \frac{z^*}{2MOE} \]

That is

zstar  =  qnorm(0.9 +  0.10/2)
MOE = 0.05
ceiling((zstar /(2 * MOE))^2)

## [1] 271

Review for final exam (12/16)

New formulas in Ch 8

New (and old) formulas from Ch 10

Sample questions