Import data
library(ggplot2)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(plyr)
##
## Attaching package: 'plyr'
## The following object is masked from 'package:lubridate':
##
## here
options(scipen=10)
nine11<-read.csv("nine11calls.csv")
# make a subset
str(nine11)
## 'data.frame': 105571 obs. of 9 variables:
## $ lat : num 40.3 40.3 40.1 40.1 40.3 ...
## $ lng : num -75.6 -75.3 -75.4 -75.3 -75.6 ...
## $ desc : Factor w/ 105534 levels " ; ; 2016-03-09 @ 05:15:47;",..: 74667 11193 40243 2916 16901 14364 49924 19167 54811 10518 ...
## $ zip : int 19525 19446 19401 19401 NA 19446 19044 19426 19438 19462 ...
## $ title : Factor w/ 112 levels "EMS: ABDOMINAL PAINS",..: 10 21 84 16 23 37 44 52 59 110 ...
## $ timeStamp: Factor w/ 60527 levels "1/1/16 0:12",..: 7247 7247 7247 7247 7247 7247 7247 7247 7247 7247 ...
## $ twp : Factor w/ 69 levels "","ABINGTON",..: 36 20 37 37 30 23 21 49 32 42 ...
## $ addr : Factor w/ 22524 levels "",".","10TH AVE",..: 16107 2161 8578 537 3473 2892 10605 3988 11553 1952 ...
## $ e : int 1 1 1 1 1 1 1 1 1 1 ...
set.seed(1234)
nine11 = nine11[-sample(nrow(nine11))[1:(nrow(nine11)/2)],]
#change character to date
nine11$timeStamp<-as.character(nine11$timeStamp)
nine11$timeStamp <- as.POSIXct(nine11$timeStamp, format="%m/%d/%y %H:%M")
#To simplify this analysis, using my timestamp variable need to create new variables needed for the analysis from old variables. So that all dates can be assigned week days as following sun,mon,tues,wed,thurs,fri,sat.Utilizing this variable for looking at associaciation between emergencies and weekdays. extract weekdays( mon,tues,wed,thurs,fri,sat,sun) from timestamp
nine11$weekday<-wday(nine11$timeStamp, label=TRUE)
nine11$weekdayunord<-factor(nine11$weekday,ordered = FALSE)
#We want to look for whether or not an event involved EMS. This involves finding all levels in nine11$title that have EMS in the level name.
#grep searches a character string; returns a vector of positions in the string where EMS can be found. Checking the length tells us whether or not it's in there at all. sapply creates a vector out of the individual responses for each entry.
nine11$EMS <- sapply(nine11$title, function(s) length(grep("EMS", s)) > 0)
nine11$EMS = as.factor(nine11$EMS)
levels(nine11$EMS)<-c("nonEMS","EMS")
Let’s take a look at latitudes
summary(nine11$lat)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 32.39 40.10 40.15 40.16 40.23 40.58
This data set lists out 911 calls including EMS, Traffic,Fire in Montgomery County,PA. It was collected by montcoalert.com The date range for this data set is from Dec 10, 2015 to August 31, 2016. The data was orginally extracted from 911 call logs by data scientists who are interesting in building statistical models and visualizations of data in the 911 calls.
The original dataset had 8 variables and 105,571 observations.The 8 variables were latitude, longitude, description, zip, title, timestamp, township, address. For current analysis I will be using latitude, weekday (created using the timestamp variable), and EMS (created using title variable).
Here is a list of of my five null and alternative hypotheses:
Null: The mean latitude of calls is 40 Alternative: The mean latitude of calls is >40
Null:The mean latitude of the calls for EMS vs nonEMS calls are the same Alternative:The mean latitude of the calls for EMS vs nonEMS calls are different
Null:The proportion of nonEMS calls is equal to 0.5 (single) Alternative:The proportion of nonEMS calls is >0.5
Null: There is no difference between Friday’s or Monday’s proportion of EMS calls Alternative: Proportion of EMS calls on Friday is higher than Monday
Null: There is no relationship between EMS and Weekdays Alternative: There is a relationship between EMS and Weekdays
All the above hypotheses will be tested at significance level of 0.05
To test the above hypotheses in the above section, we will use one sample t-test, two sample t-test, Z-test (single proportion),Z-test (test on difference of proportions for two population) and Chi-squared test (test for independence).
For each of the tests, we will first check the assumptions, perform the statistical tests and interpet the test results.In addition,we will indicate possible alternative tests that could have been performed in each case. Laslty, if evidence exist against null hypothesis, we will reject the null hypothesis and accept the alternative hypothesis.
Before I proceed with conducting tests,I will check and justify the assumptions of each test.
The standard assumption for one sample t-test: The data is Simple Random Sample(SRS) ,from a normal distribution with mean and variance.
To check for normality of the data for one sample t-test , I will use the QQ-plot and histogram
qqnorm(nine11$lat)
qqline(nine11$lat)
ggplot(nine11,aes(lat)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There is departure from normality at both right and left tails. It appears that smaller values are more likely under the data than under normality. The histogram shows that there is skeweness in the distribution of the latitude of calls. However, because the sample size is large (n=52786), the violation of the normality assumption is not important because of central limit theorem. When the sample size is large the t-test is still valid even if the sample heavily is skewed (pg.441[of the F16 textbook]) This makes the t-test a robust test.
To test for the mean latitude of calls, we utilize the following function
results = t.test(nine11$lat,mu=40,alternative="greater")
results
##
## One Sample t-test
##
## data: nine11$lat
## t = 397.18, df = 52785, p-value < 2.2e-16
## alternative hypothesis: true mean is greater than 40
## 95 percent confidence interval:
## 40.15891 Inf
## sample estimates:
## mean of x
## 40.15957
We reject the null hypothesis because the p-value is less than 0.05. From this we can conclude that there is evidence that the mean latitude is >40
We have used one sample t-test here because we are testing whether the unknown population mean of the latitude of calls(which is a qunatitative variable) equal to a hypothetical value.Eventhough the assumptions for normality have not been made,the large sample(n=52786) justifies the use of one sample t-test.
My hypothesis is about a single population mean of a quantitative variable(latitude). The appropriate test would be one sample t-test as long the assumptions for the t-test have been made. Alternatively we could have used the sign test. However, one problem with that is that they are less powerful than one sample t-test (pg.438[of the F16 textbook]).
The standard assumption for two-sample t-test is the one-sample t-test assumption applied to each sample.
To check whether assumptions for two sample t-test have been made. We will use QQ-plot for each sample
qqnorm(nine11$lat[nine11$EMS=="EMS"],main="QQ-plot of lat of EMS calls")
qqline(nine11$lat[nine11$EMS=="EMS"])
We can see from the QQ-plot the quantiles are not lying on a straight line. In particular in the tails the quantiles are far away from the line. As a result, the normality assumption does not hold.
qqnorm(nine11$lat[nine11$EMS=="nonEMS"],main="QQ-plot of lat of non-EMS calls")
qqline(nine11$lat[nine11$EMS=="nonEMS"])
The results are similar as in the case of nonEMS. We can see from the QQ-plot of the quantiles that the quantiles are not lying on 45 degree line. In particular in the tails the quantiles are far away from the 45 degree line. As a result, the normality assumption does not hold.However, since both samples are large (n for EMS=26131 , n for nonEMS=26655 ) we can still use t-test.
To to test the mean latitude of calls between EMS and nonEMS are diffrent, we utilize the following function :
results = t.test(nine11$lat~nine11$EMS)
results
##
## Welch Two Sample t-test
##
## data: nine11$lat by nine11$EMS
## t = -19.515, df = 52210, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.01717391 -0.01403903
## sample estimates:
## mean in group nonEMS mean in group EMS
## 40.15185 40.16745
Analyzing my result we reject the null hypothesis because p-value is less than 0.05. We can conclude mean latitude of the calls for EMS vs nonEMS calls are different thus supporting the alternative hypothesis.
My hypothesis is about a mean difference in a quantitative variable(latitude) of two populations (EMS and nonEMS). The appropriate test would be a two sample t-test as long the assumptions for the t-test have been made. Alternatively we could have used the Wilcoxon Rank Sum test (http://biostat.mc.vanderbilt.edu/wiki/pub/Main/AnesShortCourse/NonParametrics.pdf) However, one problem with that is that they are less powerful than two-sample t-test (pg.438[of the F16 textbook])
We have conducted a two-sample t-test because we are dealing with means of two groups. A prop test or Chisquared would not have worked because we are not dealing with proportions or categorical variables.
To verify the suitability of z-test(two proportions) we will check whether the subsequent assumptions have been met.
I.The EMS calls data is a simple randmom sample(SRS) of size from a large population that contains an unknown population proportion of nonEMS calls. II.np0 and nq0 are greater or equal to 10 (where p0 hypothesized value of the proportion of nonEMS calls).
I am assuming that condition I was met( data collected by others)
To check condition II ,we do the following :
p0=0.5
n=52786
n*p0=26393
q0=1-p0
n*q0=26393
As we can see both are greater than 10 , thus we can use z-test.
To test if the proportion of nonEMS calls is >0.5 , we will use the following function
table(nine11$EMS)
##
## nonEMS EMS
## 26655 26131
sum(table(nine11$EMS))
## [1] 52786
results = prop.test(table(nine11$EMS)[1],sum(table(nine11$EMS)))
results
##
## 1-sample proportions test with continuity correction
##
## data: table(nine11$EMS)[1] out of sum(table(nine11$EMS)), null probability 0.5
## X-squared = 5.1818, df = 1, p-value = 0.02282
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.5006886 0.5092376
## sample estimates:
## p
## 0.5049634
Analyzing my results, we can reject the null hypothesis because p-value is less than 0.05. We can conclude that there is evidence that supports the alternative hypothesis in which the proportion of nonEMS calls are in fact greater than 0.5.
We have used a Z-test because the sample size is large. Z-test is quicker and valid. Alternatively, we could have used a binomial test but we have a large sample and thus it would be computionally demanding.If we had a small sample size the binomial test would have been a better test because in this case the Z-test would not have been valid.
I. The probability for any one call to be EMS or non-EMS is independent between calls, and approximately constant over each day.
II.The number of nonEMS calls and EMS calls in each sample is at least 10
sum(table(nine11$EMS,nine11$weekday)[,"Fri"]) #friday
## [1] 7992
sum(table(nine11$EMS,nine11$weekday)[,"Mon"]) #monday
## [1] 7638
The number of EMS calls on Mondays =3779 and Friday = 3938 The number of non-EMS calls on Mondays =3859 and Friday = 4054
Z-test(proportions) To test two proportions we will utilize the following functions:
n.fri = sum(table(nine11$EMS,nine11$weekday)[,"Fri"]) #friday
n.mon = sum(table(nine11$EMS,nine11$weekday)[,"Mon"]) #monday
x.fri = table(nine11$EMS,nine11$weekday)["EMS","Fri"] #friday
x.mon = table(nine11$EMS,nine11$weekday)["EMS","Mon"] #monday
x.of.EMS = c(Mon=x.mon, Fri=x.fri)
n.of.EMS = c(Mon=n.mon, Fri=n.fri)
Now x.of.EMS
is a vector of counts of successes for Monday and Friday respectively. n.of.EMS
is a vector of counts of trials for the two days.
To run a prop.test we use:
prop.test(x.of.EMS,n.of.EMS,alternative="greater")
##
## 2-sample test for equality of proportions with continuity
## correction
##
## data: x.of.EMS out of n.of.EMS
## X-squared = 0.055946, df = 1, p-value = 0.4065
## alternative hypothesis: greater
## 95 percent confidence interval:
## -0.01126679 1.00000000
## sample estimates:
## prop 1 prop 2
## 0.4947630 0.4927427
The results for the test showed that the null cannot be rejected because the p-value is larger than 0.05. Therefore, there is not enough evidence to claim that friday’s proportion of EMS calls is higher than that of Monday’s.
An alternative test for this is to explicitly use a Pearson’s Chi-squared test(contigency table) or to use a binomial test. However, the binomial test is not practical and computationally time consuming.
The assumptions for Chi-squared test for independence are as following
I.The data is a random sample of single population where on each on each individual in SRS , we have observed the EMS and Weekday variable
II.Each observation is independent of the other
I am assuming the data was collected in a way that condition I and II are satisfied.Hence,it is sufficient to look at the expected count table. In R, the easiest way to get this count is by running the chisq.test
function and reading out the result:
weekdayvsEMS<-with(data=nine11, chisq.test(weekday,EMS))
weekdayvsEMS$expected
## EMS
## weekday nonEMS EMS
## Sun 3260.044 3195.956
## Mon 3856.911 3781.089
## Tues 4003.855 3925.145
## Wed 4042.737 3963.263
## Thurs 3912.457 3835.543
## Fri 4035.668 3956.332
## Sat 3543.328 3473.672
All the entries for the expected count table are more than 5. Thus, we can proceed with the test.
We already ran the test above. Here is the result.
weekdayvsEMS
##
## Pearson's Chi-squared test
##
## data: weekday and EMS
## X-squared = 87.114, df = 6, p-value < 2.2e-16
The null hypothesis is rejected because p-value is less than 0.05.Therefore, we can conclude that there is evidence for a relationship between EMS and Weekdays. To use another test, we could consider utilizing logistic regression which can be used to perform the test for independence. (http://web.pdx.edu/~newsomj/pa551/lectur21.htm)
To investigate the difference we computed percentages for EMS and nonEMS
with(nine11, table(weekday,EMS))
## EMS
## weekday nonEMS EMS
## Sun 2957 3499
## Mon 3859 3779
## Tues 4113 3816
## Wed 4220 3786
## Thurs 3998 3750
## Fri 4054 3938
## Sat 3454 3563
100*with(nine11, table(weekday,EMS))/(nrow(nine11))
## EMS
## weekday nonEMS EMS
## Sun 5.601864 6.628652
## Mon 7.310651 7.159095
## Tues 7.791839 7.229190
## Wed 7.994544 7.172356
## Thurs 7.573978 7.104156
## Fri 7.680067 7.460311
## Sat 6.543402 6.749896
All the tests used in this report assume that dataset was a random sample. However, the dataset that we worked on was collected by somebody else. It appears the sampling method was convenience sampling. Hence, the random sampling assumption may not hold true.
Moore, David S., George P. McCabe, and Bruce A. Craig. Introduction to the Practice of Statistics. New York: W.H. Freeman and, a Macmillan Higher Education, 2014. Print.
http://web.pdx.edu/~newsomj/pa551/lectur21.htm
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/AnesShortCourse/NonParametrics.pdf