New formulas in Ch 8

Related, but different, Chapter 8 deals with sample proportions for which we have:

\[ \hat{p} = \frac{X}{n}, \quad \hat{p}_1 = \frac{X_1}{n_1}, \quad \hat{p}_2 = \frac{X_2}{n_2}, \quad \]

The “$X$” above are binomial, so we can say $\hat{p}$ are approximately normal if $np$ and $n(1-p)$ are greater than 10.

The main test statistics are (for one and two samples):

\[ Z = \frac{\hat{p} - p}{SE}, \quad Z = \frac{(\hat{p}_1 - \hat{p}_2) - (p_1 - p_2)}{SE} \]

If we assume both $np$ and $n(1-p)$ are greater than 10 then we will assume $Z$ has a standard normal distribution.

The standard deviations are:

\[ SD(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}, \quad SD(\hat{p}_1 - \hat{p}_2) = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \]

The SE used will depend on the assumptions:

For a one sample CI, we use $\hat{p}$ to estimate $p$.

For a two sample CI, we use $\hat{p}_1$ and $\hat{p}_2$ to estimate $p_1$ and $p_2$.

For a one sample significance test, we use $SE=SD$, as $p$ is an assumed value under $H_o$.

For a two sample significance test, we use pool the data and use $\hat{p} = (n_1\hat{p}_1 + n_2\hat{p}_2)/(n_1+n_1)$ as an estimate for both $p_1$ and $p_2$.

New (and old) formulas from Ch 10

The regression model is that the mean response for $y$ given $x$ is linear: $\mu_{y\mid x} = \beta_0 + \beta_1 x$. A given value $y_i$ is modeled by the mean plus an error, or $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$, where our assumptions are that the $\epsilon_1, \epsilon_2, \dots$ are an i.i.d. sample from a normal population with mean 0 and variance $\sigma^2$.

We estimate the $\beta$s and $\sigma$ with:

\[ b_1 = r \frac{s_y}{s_x}, \quad b_0 = \bar{y} - b_1 \bar{x}, \quad \hat{y}_i = b_0 + b_1 x_i, \quad \text{residual} = e_i = y_i - \hat{y}_i, \quad s = \sqrt{\frac{\sum(y_i - \hat{y}_i)^2}{n-2}} \]

Okay, we focused on just $\beta_1$ and saw that

\[ SD(\beta_1) = \frac{\sigma}{\sqrt{\sum(x_i - \bar{x})^2}} \]

with the $SE$ given by estimating $\sigma$ with $s$, from above.

The following statistic has a $t$ distribution with $n-2$ degrees of freedom:

\[ T = \frac{b_1 - \beta_1}{SE}. \]

This allows CIs ($b_1 \pm t^* SE$) and signifance tests to be performed.

From $y_i -\bar{y} = (\hat{y}_i - \bar{y}) + (y_i - \hat{y}_i)$, we can use the following names:

\[ SSTotal = \sum(y_i -\bar{y})^2,\quad SSModel = \sum(\hat{y}_i - \bar{y})^2, \quad SSError = \sum(y_i - \hat{y}_i)^2, \]

and the formula $SST = SSM + SSE$. Further, for degrees of freedom $DFT =DFM + DFE$ and $DFT=n-1$ and $DFM=1$ so $DFE = n-2$.

The mean square is the “sum of squares” over the degrees of freedom. We have $SSE/(n-2)$ is our estimate for $\sigma^2$.

The Pearson correlation coefficient can be expressed as $r^2 = SSM / SST$ which makes precise the statement that $r^2$ explains the proportion of the total variation is due to the model.

The $F$ statistic is $MSM/MSE$. This is small (close to 0) if the model does not explain much variation; and large if it does. It is used to test if $\beta_1 = 0$, and is output in the software.

Sample questions

A standard rule of thumb is that soon-to-be-born babies grow a half pound per week in the womb. We will test this hypothesis using the weight data in the variable wt and the gestation time gestation. That is perform a two sided test of $H_o: \beta_1 = 1/2$.

res = lm(wt ~ gestation)
summary(res)

## 
## Call:
## lm(formula = wt ~ gestation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.125  -9.767  -3.743  12.085  42.684 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -42.1204    76.9746  -0.547   0.5895  
## gestation     0.5988     0.2710   2.209   0.0374 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.05 on 23 degrees of freedom
## Multiple R-squared:  0.175,  Adjusted R-squared:  0.1392 
## F-statistic:  4.88 on 1 and 23 DF,  p-value: 0.0374

A plot of residuals versus gestation is given below:

plot(gestation, resid(res))

Does this plot indicate if the value of $\sigma$ depends on $x$?

A qqplot of the residuals is given below:

qqnorm(resid(res))

Does this plot indicate normally distributed errors?

A model of neck size predicted by wrist size is tested using a certain data set, assumed to contain a random sample of individuals. The assumption is that neck size is 2 times the wrist size. Using $\alpha=0.05$, perform a two-sided test of significance.

The data is summarized through:

res = lm(neck ~ wrist, fat)
summary(res)

## 
## Call:
## lm(formula = neck ~ wrist, data = fat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9980 -1.0889 -0.0192  1.1149  7.0595 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.6370     2.0058   1.315     0.19    
## wrist         1.9394     0.1099  17.649   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.625 on 250 degrees of freedom
## Multiple R-squared:  0.5548, Adjusted R-squared:  0.553 
## F-statistic: 311.5 on 1 and 250 DF,  p-value: < 2.2e-16

Is there evidence that neck size and wrist size are not dependent (i.e, that $\beta_1 = 0$)?

A plot of residuals versus wrist size is given by

plot(fat$wrist, resid(res))

Base on this, does it appear that $\sigma$ does not depend on wrist size?

A qqplot of the residuals is shown:

qqnorm(resid(res))

Based on this, does the assumption that $\epsilon_i$ are normally distributed seem reasonable?

A survey designer would like to ensure that his survey has a margin of error of no more than 2 percentage points for a 95% confidence interval. How large must $n$ be so that this will be true? Assume the worst case scenario of $\hat{p}=1/2$.

The rest of the questions are made up but center around the CUNY budget request to the city and state (https://www.cuny.edu/wp-content/uploads/sites/4/page-assets/about/trustees/meetings-of-the-board/2020-2021-Operating-Budget-Request-and-Four-Year-Financial-Plan.pdf).

In 2016, it was written “Eight in 10 CUNY students university-wide graduate with no federal education loans.” Suppose in 2019, a survey of 1000 students is made and it is found that 24 percent have federal education loans. Is there evidence, at the $\alpha=0.05$ level, that students in 2019 have a greater percentage holding federal loans?
Suppose a survey of CSI and Baruch students was made to assess their support of a proposed tuition increasee of 200 dollars per years. The data is summarized below

            X     n
CSI        20   225
Baruch     25   250

Construct a 90% CI for $p_1 - p_2$.

For the same data, carry out a significance test with $\alpha=0.05$ testing the assumption that the population proportions are equal against a two sided alternative.

A survey is held to determine if students are aware that a new health and wellness fee of 60 dollars per semester is being proposed. The data collected is:

X   n
25  800

Construct a 90% confidence interval for the population proportion $p$.

CUNY is proposing a $200 per year increase in tuition. But not all students will pay this increase, as some are funded through other programs. Suppose a survey is performed to see what the increase would be for Verazzano students compared to non-Verazzano students was performed. This fictitious survey is summarized by:

               mu   s    n
Verazzano     125  25    8
Non-Verazzno  100  35    12

Is there evidence – at the $\alpha=0.10$ level– that Verazzano student can expect to pay a bigger increase than non Verazzano students?

CUNY proposes that if its budget request is funded “80% of enrolled students will know about, use appropriately, and report satisfaction with campus mental health, wellness, food security, or clinical health care services.”

Suppose that currently a survey is held asking students about the above. The data is summarized

X   n
75  100

Find a 90% CI for the population propotion. Does it include $0.80$?

In CUNY’s budget request, it is stated: “CUNY is a national model in promoting and enhancing social and economic mobility.” To assess this, data is taken on students household income prior to enrolling at CUNY, and the students household income 3 years after graduating from CUNY. The data looks like:

                         xbar   s   n
Before enrolling         55     15  22
3yrs after graduation    65     25  26

Find a 90% CI for the difference of population means?

Now, assuming $\sigma_1 = \sigma_2$, perform a significance test with $\alpha=0.05$ that the post graduation mean is more than the pre-enrollment population mean.

CUNY promises in its budget request that “Online instruction has enormous potential for CUNY’s current students, for whom access to online courses can increase credit accumulation and fast-track completion, provide scheduling flexibility and greater course availability, and save students commuting and textbook costs”

Do CSI students agree? A survey of students on credit accumulation by year 2 is taken with cohorts taken from those who have taken online instruction and those who havent. The data is summarized below:

                        xbar   s    n
Have taken online        45    20   10
Have not taken online    48    15   25

At the $\alpha=0.05$ level, perform a two-sided signicance test that the amount of credits is equal. Do not assume equal variances.

In CUNY’s budget request, it is written: “In an era of upskilling and reskilling, a college education must respond both to the labor demands and the unique circumstances of students pursuing a degree. By developing modernized career engagement centers and offering further opportunities for paid internships and co-ops, CUNY will align students with the skills and professional networks necessary for the current and future labor demands of growing industries.” In a NY Times op-ed (https://www.nytimes.com/2019/12/07/opinion/sunday/student-success-advice.html) Nicholas Kristof says “1. Take a class in economics and in statistics.”

A survey of MTH 214 students are asked if the skills they learned are aligned with labor demands. The results are:

             X    n
aligned     18   30

For a 90% confidence interval, what is the margin or error?

What size sample is needed to have a margin of error no more than $0.05$?

Review for final exam (12/16)

New formulas in Ch 8

New (and old) formulas from Ch 10

Sample questions