Project Description

Abstract

     Traffic is among the biggest causes of injuries or deaths in today’s mega cities. With Google map, people are feeling more convenient travelling but not more secured, for that there hasn’t been one product that can alert people about potential traffic accidents on streets they are driving through. Now I'd like to propose a Traffic Warning System to help predict on the potential risk of traffic accident in streets of Brooklyn, New York.

     I collected datasets from Google Bigquery Public Dataset and NYC Open Data Website. Generalized Linear Mixed Effect Model is considered for predicting the probability of traffic accidents based on time, traffic speed, weather, road construction and 311 service report. After evaluating the result of preliminary GLM model, I find speed to be insignificant to the model, while weather condition is of much significance in predicting occurrence of traffic accidents. With the final model I obtain 21.3% power with about 10% type I error rate on predicting occurrence of traffic accidents on 1023 streets of Brooklyn in testing set.

Introduction

Domain Problem

     New York City Council passed a local law in 2011 that allows the collection and publication of NYC traffic collision data, which enables researchers to study the temperal and spatial information of the traffic accidents. Based upon this data, I'd like to consider using GLMM models I learnt from class to perform on such datasets and see whether it can be used for real-life purpose.

     The collision datset itself is not adequate for building the model. After searching throughout the internet for other available metrices, I found the other datasets: NYC road construction record, NYC 311 service log and historical weather in Brooklyn. These datasets are useful because they directly relate to traffic conditions and they contains time and location information, i.e. can be matched to the traffic collision dataset.

     Given the time and range of this paper, I filter the collision dataset from 2017-03-01 to 2018-01-21. The reason for the end date of dataset is due to both the incomplete dataset from Google Bigquery (they claim that the data is up-to-date but in fact it is not) and the intersection of multiple datasets (I want most of observations contained all the metrices). Finally the cleaned dataset contain 25796 observations with 51 variables.

Exploratory Analysis

     First I'd like to explore where does those traffic collision happened. Both point map and density map are performed to visulize the accident in Brooklyn borough. In order the draw the density map, I use cumulative dataset to better show the trend of the collision cumulation as well as better pinpoint the hot zone of traffic collision. Sample Collision Information Map on 2017-3-22 indicates the accident importance and number of people injured in the traffic accidents happen on a single day of Mar22 2017. Note that the weather condition is snow, which I'll show later to be a significant factor to traffic accidents.

Figure 1: Sample Collision Information Map on 2017-3-22

     The Dynamic Density Map of Cumulative Collisions till 2018-01-21 shows the cumulative collision density from 2017-03-01 to 2018-01-21. Note that from top to bottom, Williamsburg, Altantic Avenue and Flatbush Avenue are among the highest frequency of collision accidents.

Dynamic Density Map of Cumulative Collision

     Next I want to visualize the contributing factors by their appearing frequency in the collision. I use Treemaps and select contributing factors of interests: Day of Week, Hour, 311 Service Complaint Type, etc.. From the treemap we can see that the previously mentioned Altantic Avenue and Flatbush Avenue are truly the most dangerous streets in Brooklyn. We can also notice that the main reason for accidents is Driver Inattention/Distraction, and the government should consider putting more emphasis on noticing people about its danger. In terms of weather condition, most accident happened when the weather is Fog or Smoke.

Treemap of Contributing Factors

Methods

Data Collection & Manipulation

     In order to perform the analysis, I have to merge the collision, weather, construction information, 311 service log and speed dataset together as a single dataframe. I first downloaded the whole dataset of all metrices, each in different excel sheets. I uniform the time variable in each sheet to POSIXct (timezone = EST) and filter the dataset by time range from 2017-03-01 to 2018-01-22. Next is merging the events of collision with the metrices at the same location. I utilize the geological information (longitude and latitude) for spatial matching with precision of about 300 meters in radius. The speed datasets consists of speed information from all traffic cameras with speed detector in NYC and is recorded every 5 minutes. The speed data is stored in large excel sheets and I manage to filter them by Brooklyn borough and merge together as a single excel sheet. The ultimate speed dataset contains speed data from 18 speed detector in Brooklyn. I first match the street of collision with the camera unique linkid, then match the time within 10 minutes of the accident (use average speed). The other metrices are matched to collision events in the same way. Given the relative small amount of available speed camera in Brooklyn, the ultimate model dataset with speed information narrows down to 304 observations with 7 linkid (i.e. 7 street levels).

Preliminary Model

     The model to my instinct is the one-level Generalized Linear Mixed Effect Model (GLMM):

\(
\begin{aligned}
logit(E[y_{ij}|\mu_i]) = \beta_0 &+ u_i + \beta_1*Speed_{ij} + \beta_{2}*1\{Time_{ij} = Morning\} + \beta_{3}*1\{Time_{ij} = Afternoon\} \\
&+ \beta_{4}*1\{Time_{ij} = Night\} + \beta_{5}*1\{Time_{ij} = Midnight\} + \beta_6*Rainfall_{ij}\\
& + \beta_7*Snowdepth_{ij} + \beta_{8}*1\{Weather\ Type_{ij}=Fog\}+ \beta_{9}*1\{Weather\ Type_{ij}=Thunder\} \\
&+ \beta_{10}*1\{Weather\ Type_{ij}=Smoke\}+ \beta_{11}*1\{Weather\ Type_{ij}=Ice Pellet\}\\
& + \beta_{12}*1\{Roadwork_{ij}=True\} + \beta_{13}*1\{311Service_{ij}=True\}
\end{aligned}
\)
\(
i:\ street,\ \ j:\ accident\ at\ street\ i,\ u_i\sim N(0, \sigma^2)
\)
\(
\begin{aligned}
y_{ij} &= 1,\ if\ accident\ j\ caused\ by\ nondriver\ factors\ happens\ at\ street\ i\\
y_{ij} &= 0,\ otherwise
\end{aligned}
\)

     Then I stratified-split the dataset by ratio of 8:2, write the formula into glmer() to fit on training set and see the result:

Generalized linear mixed model fit by maximum likelihood (Adaptive Gauss-Hermite Quadrature, nAGQ = 50) ['glmerMod']
 Family: binomial  ( logit )
Formula: y ~ speed + factor(hour_div) + RainFall + SnowDepth + Fog + Thumder +  
    Smoke + Ice_Pellets + (Construction == "Yes") + call4road +      (1 | linkid)
   Data: df_train
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
   352.7    401.6   -162.3    324.7      229 

Scaled residuals: 
       Min         1Q     Median         3Q        Max 
-1.9268315 -0.9206406 -0.6487477  1.0131595  1.4761933 

Random effects:
 Groups Name        Variance Std.Dev.
 linkid (Intercept) 0        0       
Number of obs: 243, groups:  linkid, 7

Fixed effects:
                              Estimate   Std. Error  z value Pr(>|z|)  
(Intercept)               -0.098198586  0.411153523 -0.23884 0.811232  
speed                      0.007086588  0.008555082  0.82835 0.407473  
factor(hour_div)Midnight   0.246678600  0.539778893  0.45700 0.647672  
factor(hour_div)Morning   -0.062978348  0.302321136 -0.20832 0.834982  
factor(hour_div)Night      0.047951092  0.448991589  0.10680 0.914950  
RainFall                   1.985541626  0.957263614  2.07418 0.038062 *
SnowDepth                  0.187680034  0.172223470  1.08975 0.275825  
FogTRUE                   -0.230906735  0.309801201 -0.74534 0.456067  
ThumderTRUE               -0.481695638  0.385351257 -1.25002 0.211293  
SmokeTRUE                 -0.078852210  0.298247460 -0.26439 0.791483  
Ice_PelletsTRUE            0.246364007  0.772278225  0.31901 0.749719  
Construction == "Yes"TRUE  0.964580224  1.235333665  0.78083 0.434905  
call4roadTRUE             -0.305557640  0.280056232 -1.09106 0.275247  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

     Note from the p-value of the fixed effect that except for RainFall, none of the factors are significant. Especially, the estimated coefficient for speed is 0.007, almost no effect on the model. In addition, the estimated variance for random effect is 0, which means that there are no random street level effect, and that is clearly not what I expect.

     One thing worthy of further discussion is the insignificant effect of speed. Our instinct tells us that the higher speed will leave shorter reaction time to change of traffic condition, hence will be a major effect in predicting collision. However, the model tells the opposite, of which the weather condition (such as RainFall in mm unit) plays a much more important part. Possible reasons for this is:

  • There are only 7 speed camera data in our training model, and they are distributed mostly in the 278 express way and around the Red Hook and Greenpoint region (please see the camera distribution map below). Given the limitation of the camera location, speed on the road is reasonably not very significant. For example, in the express way, traffic speed is indeed high but also large gap between vehicles, hence drivers could have equal amount of time to respond to sudden change of traffic and will not be significantly more dangerous with a higher speed.

  • The speed information is the average speed recorded by the speed camera on the road, with no direct relationship with the speed of the vehicles involved in the accident. And we know the accidents only happen in a fairly small amount of vehicles compare to the large volumn of traffic.

  • There may be multicollinearity in our predictors, and we should take a look.

Multicollinearity

Call:
omcdiag(x = df_vif, y = as.numeric(df_model$y))

Overall Multicollinearity Diagnostics

                       MC Results detection
Determinant |X'X|:         0.3511         0
Farrar Chi-Square:       314.6253         1
Red Indicator:             0.1253         0
Sum of Lambda Inverse:    14.2751         0
Theil's Method:            1.4489         1
Condition Number:          8.3896         0

1 --> COLLINEARITY is detected by the test 
0 --> COLLINEARITY is not detected by the test

     From the result we know that the determinant is relatively small and based on the majority voting I conclude that there is more likely to be no multicollinearity within our model dataset.

Modified Model

     I then decide to remove the speed variable from the dataset. By removing it I can then make use of a way larger datset (of 25796 observations) with other factors such as weather and construction. Then we also do the stratified-split of the dataset by on_street_name and further assign random intercept to each street in our one-level generalized linear mixed effect model.

\(
\begin{aligned}
logit(E[y_{ij}|\mu_i]) = \beta_0 &+ u_i + \beta_{1}*1\{Time_{ij} = Morning\} + \beta_{2}*1\{Time_{ij} = Afternoon\} + \beta_{3}*1\{Time_{ij} = Night\} \\
&+ \beta_{4}*1\{Time_{ij} = Midnight\} + \beta_5*Rainfall_{ij} + \beta_6*Snowdepth_{ij} + \beta_{7}*1\{Weather\ Type_{ij}=Fog\}\\
&+ \beta_{8}*1\{Weather\ Type_{ij}=Thunder\} + \beta_{9}*1\{Weather\ Type_{ij}=Smoke\}\\
&+ \beta_{10}*1\{Weather\ Type_{ij}=Ice Pellet\} + \beta_{11}*1\{Roadwork_{ij}=True\} \\
&+ \beta_{12}*1\{311Service_{ij}=True\}
\end{aligned}
\)

And apply the GLM model:

Generalized linear mixed model fit by maximum likelihood (Adaptive Gauss-Hermite Quadrature, nAGQ = 50) ['glmerMod']
 Family: binomial  ( logit )
Formula: y ~ factor(hour_div) + RainFall + SnowDepth + Fog + Thumder +      Smoke + Ice_Pellets + (Construction == "Yes") + call4road +  
    (1 | on_street_name)
   Data: df_train
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
 29871.5  29976.3 -14922.8  29845.5    23294 

Scaled residuals: 
       Min         1Q     Median         3Q        Max 
-1.4545914 -0.7893971 -0.5833948  1.1017576  2.0418863 

Random effects:
 Groups         Name        Variance  Std.Dev. 
 on_street_name (Intercept) 0.1436875 0.3790614
Number of obs: 23307, groups:  on_street_name, 1023

Fixed effects:
                              Estimate   Std. Error  z value   Pr(>|z|)    
(Intercept)               -0.352620101  0.039628157 -8.89822 < 2.22e-16 ***
factor(hour_div)Midnight  -0.189438446  0.048541798 -3.90258 9.5171e-05 ***
factor(hour_div)Morning   -0.107810768  0.033677525 -3.20127 0.00136824 ** 
factor(hour_div)Night      0.012758000  0.039408976  0.32373 0.74613992    
RainFall                  -0.018646139  0.050830662 -0.36683 0.71374689    
SnowDepth                  0.066616718  0.013212039  5.04212 4.6040e-07 ***
FogTRUE                    0.121616365  0.032064870  3.79282 0.00014894 ***
ThumderTRUE               -0.010666984  0.038142516 -0.27966 0.77973740    
SmokeTRUE                  0.073333952  0.031371865  2.33757 0.01940953 *  
Ice_PelletsTRUE           -0.283226486  0.099584507 -2.84408 0.00445396 ** 
Construction == "Yes"TRUE -0.185226448  0.092260275 -2.00765 0.04468037 *  
call4roadTRUE              0.002981955  0.029691397  0.10043 0.92000167    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
              (Intr) fctr(hr_dv)Md fctr(hr_dv)Mr fc(_)N RanFll SnwDpt FgTRUE ThTRUE SmTRUE I_PTRU C=="Y"
fctr(hr_dv)Md -0.237                                                                                    
fctr(hr_dv)Mr -0.356  0.286                                                                             
fctr(hr_d)N   -0.297  0.247         0.351                                                               
RainFall       0.020 -0.009         0.004        -0.003                                                 
SnowDepth     -0.146  0.012         0.003         0.009 -0.048                                          
FogTRUE       -0.279 -0.003        -0.005        -0.003 -0.327  0.072                                   
ThumderTRUE   -0.053  0.011         0.006        -0.002 -0.189  0.053 -0.159                            
SmokeTRUE     -0.127  0.003        -0.002        -0.001  0.028 -0.093 -0.260 -0.030                     
Ic_PlltTRUE    0.026 -0.007        -0.004        -0.009 -0.051 -0.380 -0.070  0.008 -0.068              
C=="Ys"TRUE   -0.094  0.007         0.011         0.017 -0.005  0.016 -0.008 -0.011  0.002 -0.001       
call4rdTRUE   -0.461 -0.004        -0.001        -0.003 -0.017  0.102 -0.006 -0.046 -0.007  0.014 -0.013

     Note from the result that now I obtained a lot significant coefficient values than the previous model by removing the speed variable. Among the variables, hours, weather conditions like SnowDepth and Fog have the most significant coefficients, and the coefficient for whether there is a construction on going in the street is also statistically significant. The result correspond with my assumption that the traffic accident that are caused by non-driver factors comes from bad weather conditions and roadworks. From the result, the 311 service call related to road conditions doesn't seems to be related to the accidents, although such calls are made before accidents happen.

     Next, I'd like to know if we are legitimate to remove those variables with non-significant coefficients. Wald's Test is used to perform the hypothesis testing.

Wald's Test

     The null hypothesis is that there are no difference between my current model and a simplified version. \(\beta_5\), \(\beta_8\) and \(\beta_{12}\) represent coefficients for RainFall, Thunder and 311 Service Call respectively.

\(
H_0:\ \beta_5=\beta_8=\beta_{12}=0
\)
summary_mod2 <- summary(mod.2)
L <- matrix(0, nrow = 3, ncol = 12)
L[1,5] <- L[2,8] <- L[3,12] <- 1

Fstar <- t(L %*% summary_mod2$coefficients[,1]) %*% 
  solve(L %*% summary(mod.2)$vcov %*% t(L)) %*%
  (L %*% summary_mod2$coefficients[,1])

pchisq(as.numeric(Fstar), df = 3, lower.tail = FALSE)

          [,1]
[1,] 0.9661755

     The return value is the p-value for the Wald's Test, and p-value >> 0.05 tells I am legitimate to remove the RainFall, Thunder and 311 Service Call from my model at the same time, i.e. I fail to reject the null hypothesis that those coefficients are all equal to 0.

Final Model

     In my final model, I take the remain variables from the last step and am ready to test my model on the testing dataset.

\(
\begin{aligned}
logit(E[y_{ij}|\mu_i]) = \beta_0 &+ u_i + \beta_{1}*1\{Time_{ij} = Morning\} + \beta_{2}*1\{Time_{ij} = Afternoon\} + \beta_{3}*1\{Time_{ij} = Night\} \\
&+ \beta_{4}*1\{Time_{ij} = Midnight\} + \beta_5*Snowdepth_{ij} + \beta_6*1\{Weather\ Type_{ij}=Fog\}\\
&+ \beta_{7}*1\{Weather\ Type_{ij}=Smoke\} + \beta_{8}*1\{Weather\ Type_{ij}=Ice Pellet\}\\
&+ \beta_{9}*1\{Roadwork_{ij}=True\}
\end{aligned}
\)
and have

Generalized linear mixed model fit by maximum likelihood (Adaptive Gauss-Hermite
  Quadrature, nAGQ = 50) [glmerMod]
 Family: binomial  ( logit )
Formula: y ~ factor(hour_div) + SnowDepth + Fog + Smoke + Ice_Pellets +  
    (Construction == "Yes") + (1 | on_street_name)
   Data: df_train
Control: glmerControl(optimizer = "bobyqa")

     AIC      BIC   logLik deviance df.resid 
 29865.8  29946.4 -14922.9  29845.8    23297 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.4560 -0.7903 -0.5828  1.1011  2.0478 

Random effects:
 Groups         Name        Variance Std.Dev.
 on_street_name (Intercept) 0.1435   0.3788  
Number of obs: 23307, groups:  on_street_name, 1023

Fixed effects:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)               -0.35163    0.03504 -10.035  < 2e-16 ***
factor(hour_div)Midnight  -0.18943    0.04854  -3.903 9.51e-05 ***
factor(hour_div)Morning   -0.10768    0.03368  -3.198  0.00139 ** 
factor(hour_div)Night      0.01269    0.03941   0.322  0.74738    
SnowDepth                  0.06649    0.01311   5.070 3.97e-07 ***
FogTRUE                    0.11527    0.02943   3.916 8.98e-05 ***
SmokeTRUE                  0.07339    0.03135   2.341  0.01923 *  
Ice_PelletsTRUE           -0.28526    0.09949  -2.867  0.00414 ** 
Construction == "Yes"TRUE -0.18569    0.09223  -2.013  0.04409 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:
              (Intr) fctr(hr_dv)Md fctr(hr_dv)Mr fc(_)N SnwDpt FgTRUE SmTRUE I_PTRU
fctr(hr_dv)Md -0.269                                                               
fctr(hr_dv)Mr -0.403  0.286                                                        
fctr(hr_d)N   -0.337  0.247         0.351                                          
SnowDepth     -0.108  0.012         0.003         0.010                            
FogTRUE       -0.363 -0.004        -0.002        -0.005  0.075                     
SmokeTRUE     -0.149  0.004        -0.001        -0.001 -0.090 -0.280              
Ic_PlltTRUE    0.037 -0.008        -0.004        -0.009 -0.388 -0.094 -0.067       
C=="Ys"TRUE   -0.114  0.007         0.011         0.017  0.018 -0.014  0.002 -0.001

     Now all the coefficients are significant, we can now move on to evaluate the predicting quality of our GLMM model.

Prediction Results

Prediction on Testing Set

     First we plot the response vs. link under the Training Set in order to find the decision threshold. The red points is where \(y_{ij}=1\) and black is where \(y{ij}=0\). Some of the red and black points mix together but there is still a noticable trend that when response > 0.47, red points are the majority.

Both link and response scores put cases in the same order

     Another way to evaluate the binary classfier is the ROC curve. Note that the AUC (area under the curve) is 0.6268.

ROC Curve

     As mentioned before, we can take the classfication threhold = 0.47 as decision criteria. And using this threshold we are able to evaluate the prediction on our testing set.

# prediction on df_test
y_pred = predict(mod.2, newdata = df_test, type = "response") > 0.47 # decision criteria
(tab_eval = table(y_pred, df_test$y))
##
## y_pred FALSE TRUE
## FALSE 1371 731
## TRUE 183 204

(alpha = tab_eval[2]/(tab_eval[1] + tab_eval[2]))
## [1] 0.1177606
(power = tab_eval[4]/(tab_eval[3] + tab_eval[4]))
## [1] 0.2181818

     Note from the table, when the Type I Error $\alpha$ is set to be around 10%, then the power of the model is about 22%, i.e. in every 5 car accidents that are caused by non-driver factors, then the model can successfully predict one.

Discussion

     There are both important findings and things to improve in my model. One of the most important thing I learn from the model is that on the contrary of what I expected, the speed is not a significant factor in the model. One thing need to be emphasized is that the speed data I use is not the speed of the vehicle involved in the accident but the speed of the street when the accident happens. It's very likely that the vehicles that are caught into accident are those outliers from the average speed of the street.

     My model is trying to predict the probabiltiy that vehicles will run into a traffic accident caused mainly by non-driver factors, and the final model satisfactorily perform just as what I expect, that bad weather and road conditions will increse the probability of having an accident as well as when the time is at night.

     In order to improve the model, the best way is to find more non-driver data related to the accident or the driver information, using which I can further distinguish the non-driver factor accidents from the accidents that are directly caused by misbehavior of the driver. In reality, most of the accidents nowadays are caused by driver-related issues, and my model prediction performance has its natural limitation.