Import data

library(ggplot2)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(plyr) 
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:lubridate':
## 
##     here
options(scipen=10)
nine11<-read.csv("nine11calls.csv")
# make a subset
str(nine11)
## 'data.frame':    105571 obs. of  9 variables:
##  $ lat      : num  40.3 40.3 40.1 40.1 40.3 ...
##  $ lng      : num  -75.6 -75.3 -75.4 -75.3 -75.6 ...
##  $ desc     : Factor w/ 105534 levels " ; ; 2016-03-09 @ 05:15:47;",..: 74667 11193 40243 2916 16901 14364 49924 19167 54811 10518 ...
##  $ zip      : int  19525 19446 19401 19401 NA 19446 19044 19426 19438 19462 ...
##  $ title    : Factor w/ 112 levels "EMS: ABDOMINAL PAINS",..: 10 21 84 16 23 37 44 52 59 110 ...
##  $ timeStamp: Factor w/ 60527 levels "1/1/16 0:12",..: 7247 7247 7247 7247 7247 7247 7247 7247 7247 7247 ...
##  $ twp      : Factor w/ 69 levels "","ABINGTON",..: 36 20 37 37 30 23 21 49 32 42 ...
##  $ addr     : Factor w/ 22524 levels "",".","10TH AVE",..: 16107 2161 8578 537 3473 2892 10605 3988 11553 1952 ...
##  $ e        : int  1 1 1 1 1 1 1 1 1 1 ...
set.seed(1234)
nine11 = nine11[-sample(nrow(nine11))[1:(nrow(nine11)/2)],]
#change character to date
nine11$timeStamp<-as.character(nine11$timeStamp)
nine11$timeStamp <- as.POSIXct(nine11$timeStamp, format="%m/%d/%y %H:%M")
#To simplify this analysis, using my timestamp variable need to create new variables needed for the analysis from old variables. So that all dates can be assigned week days as following sun,mon,tues,wed,thurs,fri,sat.Utilizing this variable  for looking at associaciation between emergencies and weekdays. extract weekdays( mon,tues,wed,thurs,fri,sat,sun) from timestamp
nine11$weekday<-wday(nine11$timeStamp, label=TRUE)
nine11$weekdayunord<-factor(nine11$weekday,ordered = FALSE)

#We want to look for whether or not an event involved EMS. This involves finding all levels in nine11$title that have EMS in the level name.
#grep searches a character string; returns a vector of positions in the string where EMS can be found. Checking the length tells us whether or not it's in there at all. sapply creates a vector out of the individual responses for each entry.
nine11$EMS <- sapply(nine11$title, function(s) length(grep("EMS", s)) > 0)
nine11$EMS = as.factor(nine11$EMS)
levels(nine11$EMS)<-c("nonEMS","EMS")

Let’s take a look at latitudes

summary(nine11$lat)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   32.39   40.10   40.15   40.16   40.23   40.58

Introduction and Background

This data set lists out 911 calls including EMS, Traffic,Fire in Montgomery County,PA. It was collected by montcoalert.com The date range for this data set is from Dec 10, 2015 to August 31, 2016. The data was orginally extracted from 911 call logs by data scientists who are interesting in building statistical models and visualizations of data in the 911 calls.

Data

The original dataset had 8 variables and 105,571 observations.The 8 variables were latitude, longitude, description, zip, title, timestamp, township, address. For current analysis I will be using latitude, weekday (created using the timestamp variable), and EMS (created using title variable).

Hypothesis

Here is a list of of my five null and alternative hypotheses:

Null: The mean latitude of calls is 40 Alternative: The mean latitude of calls is >40

Null:The mean latitude of the calls for EMS vs nonEMS calls are the same Alternative:The mean latitude of the calls for EMS vs nonEMS calls are different

Null:The proportion of nonEMS calls is equal to 0.5 (single) Alternative:The proportion of nonEMS calls is >0.5

Null: There is no difference between Friday’s or Monday’s proportion of EMS calls Alternative: Proportion of EMS calls on Friday is higher than Monday

Null: There is no relationship between EMS and Weekdays Alternative: There is a relationship between EMS and Weekdays

All the above hypotheses will be tested at significance level of 0.05

Methods

To test the above hypotheses in the above section, we will use one sample t-test, two sample t-test, Z-test (single proportion),Z-test (test on difference of proportions for two population) and Chi-squared test (test for independence).

For each of the tests, we will first check the assumptions, perform the statistical tests and interpet the test results.In addition,we will indicate possible alternative tests that could have been performed in each case. Laslty, if evidence exist against null hypothesis, we will reject the null hypothesis and accept the alternative hypothesis.

Results

Before I proceed with conducting tests,I will check and justify the assumptions of each test.

One sample t- test assumption

The standard assumption for one sample t-test: The data is Simple Random Sample(SRS) ,from a normal distribution with mean and variance.

To check for normality of the data for one sample t-test , I will use the QQ-plot and histogram

qqnorm(nine11$lat)
qqline(nine11$lat)

ggplot(nine11,aes(lat)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There is departure from normality at both right and left tails. It appears that smaller values are more likely under the data than under normality. The histogram shows that there is skeweness in the distribution of the latitude of calls. However, because the sample size is large (n=52786), the violation of the normality assumption is not important because of central limit theorem. When the sample size is large the t-test is still valid even if the sample heavily is skewed (pg.441[of the F16 textbook]) This makes the t-test a robust test.

One sample t-test

To test for the mean latitude of calls, we utilize the following function

results = t.test(nine11$lat,mu=40,alternative="greater")
results
## 
##  One Sample t-test
## 
## data:  nine11$lat
## t = 397.18, df = 52785, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 40
## 95 percent confidence interval:
##  40.15891      Inf
## sample estimates:
## mean of x 
##  40.15957

We reject the null hypothesis because the p-value is less than 0.05. From this we can conclude that there is evidence that the mean latitude is >40

We have used one sample t-test here because we are testing whether the unknown population mean of the latitude of calls(which is a qunatitative variable) equal to a hypothetical value.Eventhough the assumptions for normality have not been made,the large sample(n=52786) justifies the use of one sample t-test.

My hypothesis is about a single population mean of a quantitative variable(latitude). The appropriate test would be one sample t-test as long the assumptions for the t-test have been made. Alternatively we could have used the sign test. However, one problem with that is that they are less powerful than one sample t-test (pg.438[of the F16 textbook]).

Two-sample t-test assumption

The standard assumption for two-sample t-test is the one-sample t-test assumption applied to each sample.

To check whether assumptions for two sample t-test have been made. We will use QQ-plot for each sample

qqnorm(nine11$lat[nine11$EMS=="EMS"],main="QQ-plot of lat of EMS calls")
qqline(nine11$lat[nine11$EMS=="EMS"])

We can see from the QQ-plot the quantiles are not lying on a straight line. In particular in the tails the quantiles are far away from the line. As a result, the normality assumption does not hold.

qqnorm(nine11$lat[nine11$EMS=="nonEMS"],main="QQ-plot of lat of non-EMS calls")
qqline(nine11$lat[nine11$EMS=="nonEMS"])

The results are similar as in the case of nonEMS. We can see from the QQ-plot of the quantiles that the quantiles are not lying on 45 degree line. In particular in the tails the quantiles are far away from the 45 degree line. As a result, the normality assumption does not hold.However, since both samples are large (n for EMS=26131 , n for nonEMS=26655 ) we can still use t-test.

Two-sample t-test

To to test the mean latitude of calls between EMS and nonEMS are diffrent, we utilize the following function :

results = t.test(nine11$lat~nine11$EMS)
results
## 
##  Welch Two Sample t-test
## 
## data:  nine11$lat by nine11$EMS
## t = -19.515, df = 52210, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.01717391 -0.01403903
## sample estimates:
## mean in group nonEMS    mean in group EMS 
##             40.15185             40.16745

Analyzing my result we reject the null hypothesis because p-value is less than 0.05. We can conclude mean latitude of the calls for EMS vs nonEMS calls are different thus supporting the alternative hypothesis.

My hypothesis is about a mean difference in a quantitative variable(latitude) of two populations (EMS and nonEMS). The appropriate test would be a two sample t-test as long the assumptions for the t-test have been made. Alternatively we could have used the Wilcoxon Rank Sum test (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/AnesShortCourse/NonParametrics.pdf) However, one problem with that is that they are less powerful than two-sample t-test (pg.438[of the F16 textbook])

We have conducted a two-sample t-test because we are dealing with means of two groups. A prop test or Chisquared would not have worked because we are not dealing with proportions or categorical variables.

Z-test(1 -sample proportion) assumption

To verify the suitability of z-test(two proportions) we will check whether the subsequent assumptions have been met.

I.The EMS calls data is a simple randmom sample(SRS) of size from a large population that contains an unknown population proportion of nonEMS calls. II.np0 and nq0 are greater or equal to 10 (where p0 hypothesized value of the proportion of nonEMS calls).

I am assuming that condition I was met( data collected by others)

To check condition II ,we do the following :
p0=0.5 n=52786 n*p0=26393 q0=1-p0 n*q0=26393 As we can see both are greater than 10 , thus we can use z-test.

Z-test

To test if the proportion of nonEMS calls is >0.5 , we will use the following function

table(nine11$EMS)
## 
## nonEMS    EMS 
##  26655  26131
sum(table(nine11$EMS))
## [1] 52786
results = prop.test(table(nine11$EMS)[1],sum(table(nine11$EMS)))
results
## 
##  1-sample proportions test with continuity correction
## 
## data:  table(nine11$EMS)[1] out of sum(table(nine11$EMS)), null probability 0.5
## X-squared = 5.1818, df = 1, p-value = 0.02282
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.5006886 0.5092376
## sample estimates:
##         p 
## 0.5049634

Analyzing my results, we can reject the null hypothesis because p-value is less than 0.05. We can conclude that there is evidence that supports the alternative hypothesis in which the proportion of nonEMS calls are in fact greater than 0.5.

We have used a Z-test because the sample size is large. Z-test is quicker and valid. Alternatively, we could have used a binomial test but we have a large sample and thus it would be computionally demanding.If we had a small sample size the binomial test would have been a better test because in this case the Z-test would not have been valid.

Z-test(two proportions) assumptions

I. The probability for any one call to be EMS or non-EMS is independent between calls, and approximately constant over each day.

II.The number of nonEMS calls and EMS calls in each sample is at least 10

sum(table(nine11$EMS,nine11$weekday)[,"Fri"]) #friday
## [1] 7992
sum(table(nine11$EMS,nine11$weekday)[,"Mon"]) #monday 
## [1] 7638

The number of EMS calls on Mondays =3779 and Friday = 3938 The number of non-EMS calls on Mondays =3859 and Friday = 4054

Z-test(proportions) To test two proportions we will utilize the following functions:

n.fri = sum(table(nine11$EMS,nine11$weekday)[,"Fri"]) #friday
n.mon = sum(table(nine11$EMS,nine11$weekday)[,"Mon"]) #monday 
x.fri = table(nine11$EMS,nine11$weekday)["EMS","Fri"] #friday
x.mon = table(nine11$EMS,nine11$weekday)["EMS","Mon"] #monday 

x.of.EMS = c(Mon=x.mon, Fri=x.fri)
n.of.EMS = c(Mon=n.mon, Fri=n.fri)

Now x.of.EMS is a vector of counts of successes for Monday and Friday respectively. n.of.EMS is a vector of counts of trials for the two days.

Z-test(two proportions)

To run a prop.test we use:

prop.test(x.of.EMS,n.of.EMS,alternative="greater")
## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  x.of.EMS out of n.of.EMS
## X-squared = 0.055946, df = 1, p-value = 0.4065
## alternative hypothesis: greater
## 95 percent confidence interval:
##  -0.01126679  1.00000000
## sample estimates:
##    prop 1    prop 2 
## 0.4947630 0.4927427

The results for the test showed that the null cannot be rejected because the p-value is larger than 0.05. Therefore, there is not enough evidence to claim that friday’s proportion of EMS calls is higher than that of Monday’s.

An alternative test for this is to explicitly use a Pearson’s Chi-squared test(contigency table) or to use a binomial test. However, the binomial test is not practical and computationally time consuming.

Chi-squared assumptions

The assumptions for Chi-squared test for independence are as following

I.The data is a random sample of single population where on each on each individual in SRS , we have observed the EMS and Weekday variable

II.Each observation is independent of the other

  1. All expected counts are at least 5

I am assuming the data was collected in a way that condition I and II are satisfied.Hence,it is sufficient to look at the expected count table. In R, the easiest way to get this count is by running the chisq.test function and reading out the result:

weekdayvsEMS<-with(data=nine11, chisq.test(weekday,EMS))
weekdayvsEMS$expected
##        EMS
## weekday   nonEMS      EMS
##   Sun   3260.044 3195.956
##   Mon   3856.911 3781.089
##   Tues  4003.855 3925.145
##   Wed   4042.737 3963.263
##   Thurs 3912.457 3835.543
##   Fri   4035.668 3956.332
##   Sat   3543.328 3473.672

All the entries for the expected count table are more than 5. Thus, we can proceed with the test.

Chi-squared Test

We already ran the test above. Here is the result.

weekdayvsEMS
## 
##  Pearson's Chi-squared test
## 
## data:  weekday and EMS
## X-squared = 87.114, df = 6, p-value < 2.2e-16

The null hypothesis is rejected because p-value is less than 0.05.Therefore, we can conclude that there is evidence for a relationship between EMS and Weekdays. To use another test, we could consider utilizing logistic regression which can be used to perform the test for independence. (http://web.pdx.edu/~newsomj/pa551/lectur21.htm)

To investigate the difference we computed percentages for EMS and nonEMS

with(nine11, table(weekday,EMS))
##        EMS
## weekday nonEMS  EMS
##   Sun     2957 3499
##   Mon     3859 3779
##   Tues    4113 3816
##   Wed     4220 3786
##   Thurs   3998 3750
##   Fri     4054 3938
##   Sat     3454 3563
100*with(nine11, table(weekday,EMS))/(nrow(nine11))
##        EMS
## weekday   nonEMS      EMS
##   Sun   5.601864 6.628652
##   Mon   7.310651 7.159095
##   Tues  7.791839 7.229190
##   Wed   7.994544 7.172356
##   Thurs 7.573978 7.104156
##   Fri   7.680067 7.460311
##   Sat   6.543402 6.749896

Conclusion

All the tests used in this report assume that dataset was a random sample. However, the dataset that we worked on was collected by somebody else. It appears the sampling method was convenience sampling. Hence, the random sampling assumption may not hold true.

Works cited :

Moore, David S., George P. McCabe, and Bruce A. Craig. Introduction to the Practice of Statistics. New York: W.H. Freeman and, a Macmillan Higher Education, 2014. Print.

http://web.pdx.edu/~newsomj/pa551/lectur21.htm

http://biostat.mc.vanderbilt.edu/wiki/pub/Main/AnesShortCourse/NonParametrics.pdf