Age and Crash

The number of traffic accidents versus The age of driver

Dataset Overview

The two datasets (2015 Accident dataset and 2015 Involved Person dataset) used in this byte are found via Fusion Table search and are from the FARS (Fatality Analysis Reporting System) Encyclopedia maintained by National Highway Traffic Security Administration (NHTSA). Accessed on Feb 6, 2017 15:05 PM EST.

  • The Accident dataset contains detailed information about crash characteristics and environmental conditions at the time of the crash. There is one record per accident.

    Figure 1: Distribution of traffic accidents in United States in 2015

  • The Involved Person dataset contains information describing all persons involved in the crash including motorists (i.e., drivers and passengers of in-transport motor vehicles) and non-motorists (e.g., pedestrians and pedalcyclists). It provides information such as age, sex, vehicle occupant restraint use, and injury severity. There is one record per person. (But there might be multiple records for each accident.)

    Figure 2: Number of drivers on age

    Figure 3: Number of drunk drivers on age

Based on information from these two datasets, we can learn the intuitive correlation between the occurence of traffic accidents and persons' age, and make some basic hypotheses about the cause of the outcome.

From Figure 2 we can see that from age 14 to 21, there is a sharp rise in the number of accidents (using the number of driver here, which equals to the number of accidents). Most of the rise between age 14 and 18 contributes to the eligibility to aquire a driver's license. However, from age 20, it becomes a different story.

From Figure 3, it is clear that roughly 20% of the accident is due to alcohol consumption for any given age. And this effect of alcohol is even more significant for young drivers between age 19 and age 21. It can be calculated that ratios between the increase in drunk drivers and the growth in accidents are about 0.41, 0.83, 1.33. Although, not all of the increase in drinking directly contributes to accident (because there might be overlap between the two), but we can still infer that drinking is one of the main causes for the growth in accidents. And the reason for alcohol to have such an effect could be that for most states in the United States, the legal drinking age (defined by the 1984 Minimum Legal Drinking Age Act) is 21. So, at the age of 21, the increase is the most significant. Also, for age 19 and 20, there is an impressive impact from alcohol because some states do allow a smalled minimum drinking age, and maybe some bars or restaurants simply don't check the exact age of their customer, if they seem to be old enough.

As a person keeps growing up, there will be an increase in his/her mental stability and experience, so it is reasonable that the number of accidents goes down as one's age goes up after 21. It should also be noted that there is a small rise round age 50. It might be midlife crisis pulling the strings.

The above analysis is made based on the avaible dataset. However, the relationship between these factors and accidents still remains in the realm of speculation. More data and detailed analysis are needed to derive the causal effects. For example, if we have the birthday data instead of age, we could find out the causal relationship between turning age 21 and occurence of accidents using Regression Discontinuity method.

Data Quality Analysis

This dataset contains all reported traffic accidents in 2015 (data for past years are available, but it would be too large for this web application to load efficiently without and subsampling), and detailed informations of the individuals involved in the accidents. There are 80,588 involved person records and 32,167 accident records. The overall completeness of the dataset is very satisfying. However, there are still some missing information in the dataset:
  • Missing Age field for 2.0% of all drivers. This missing (actually it is not missing, but recorded as Unreported, or Unknown instead) is due to many possible reasons: involved person forgot to report or died in the accident without any ID so that it cannot be recorded timely. This is true for some other fields, but none of them are relevant to the question in this byte. Considering the proportion of such missing field is very small. And after carefully examining the rawdata, I find out that those missing ages are distributed quite randomly. So, it is safe to simply remove these data.

    Figure 4: Valid/Invalid Age Ratio for Drivers



    with open("../data/person.csv") as f:
        persons = pd.read_csv(f)
        drivers = persons[persons['PER_TYP'].isin([1,1])]
        drivers['AGE_TYP'] = pd.cut(drivers['AGE'], bins=[-1, 120, 999], 
                labels=['Valid Age', 'Unknown/Unreported'])
        age_counts = drivers.AGE_TYP.value_counts()
      
        age_counts.plot.pie(colors=['yellowgreen', 'r'], figsize=(6, 6), 
                fontsize=15, shadow=True, autopct='%1.1f%%', 
                title='Age Validity', startangle=45)
    Table 1: Datalab Notebook Log
  • DOB of the person involved in the accident. This data is not collected because it is not needed for most accident recording purposes. But knowing the exact date of birth is important for analyzing the relationship between accident and age, because the effects of some factors need the accurate date before they can be precisely evaluated. For example, the legal driving age and legal drinking age act, which turn on and off cleanly acrossing the cutoff age, have a significant impact on the relationship between age and accident, but without the data, we can hardly make any causal statements about why the relationship between age and accident is changing like it now.
Examined all fields with python and google fusion table api. Most fields relevant to this byte are categorical (nominal) variables and all of the fields strictly stick to the specifications in the official guidebook.
Also, in the overview sections above, we can see that all data points in age-accident figure (Figure 2) and age-drunk figure (Figure 3) are consistent with the overall trend, there are no notable outliers except for those with Age equals to 998 or 999, which means Unreported or Unknown in the data.
In terms of data quality the datasets match most of my expectations. It would be better if it has a DOB field instead of age.
Lastly, as discussed in at the end of the Dataset Overview section, the age-accident distribution makes perfect sense. But in order to examine whether there is a causal effect between them, more data related to the person involved is needed.
The data is officially collected by NHTSA, all traffic accidents happened in the United Stated will be reported to them. The NHTSA uses the so-called National Automotive Sampling System (NASS) to collect data.
NASS is composed of two systems - the Crashworthiness Data System (CDS) and the General Estimates System (GES). CDS data focus on passenger vehicle crashes, and are used to investigate injury mechanisms to identify potential improvements in vehicle design. GES data focus on the bigger overall crash picture, and are used for problem size assessments and tracking trends. Data for CDS is collected randomly, and Data for GES come from a nationally representative sample of police reported motor vehicle crashes of all types, from minor to fatal. However, because the criteria for being representative is not clearly stated by NHTSA, so we can not be sure whether this would be a source of bias.
In order to further examine the correctness of this data, I found another source of traffic accident data on IIHS HLDI (Insurance Institute for Highway Safety - Highway Loss Data Institute) website. The data provided by HLDI is collected via post-crash car insurance reports. After comparing the stats from both sides, the correctness is reassured. Below is one of the features I compared: the number of drivers died in accident or en route to hospital per 100,000 people. Figure 5(1) is this number on each age. Considering HLDI data only provides this data in age group (Figure 5(3)), I used Datalab grouped the age, and got Figure 5(2). It can be inferred from the figure that these two datasets are consistent with each other very well.

(1) Number of drivers died on age (NHTSA data)

(2) Number of drivers died in age groups (NHTSA data)

(3) Number of drivers died in age groups (HLDI data)

Figure 5: Number of drivers died in accident or en route to hispital per 100,000 people

with open("../data/person.csv") as f:
    persons = pd.read_csv(f)
    drivers = persons[persons['INJ_SEV'].isin([4, 4])] # INJ_SEV == 4 when the person is fatally injured
    drivers['AGE_GRP'] = pd.cut(drivers['AGE'], 
                bins=[-1, 12, 15, 19, 24, 29, 34, 39, 44, 49, 54, 59, 64, 69, 74, 79, 84, 120], 
                labels=['<13', '13-15', '16-19', '20-24', '25-29', '30-34', '35-39', '40-44', 
                        '45-49', '50-54', '55-59', '60-64', '65-69', '70-74', '75-79', '80-84', '85+'])

    age_counts = drivers.AGE_GRP.value_counts(sort=False)
    
    age_counts.plot.bar(fontsize=15, alpha=0.8)

Table 2: Code used to aggregate records into age groups

Source and decription for the datasets in this byte are discussed at the beginning of the Dataset Overview section. The following will focus on preprocessing and accessibility of the data as well as the privacy issue.
Preprocesing: The datasets themselves are of high quality. All I have done is:
  1. Removing the columns that are not needed for byte 2 in both datasets, in order to examine the raw data more clearly.
  2. Filtering out non-driver records in the Person dataset, because I want to focus on the direct relationship of the driver's age and traffic accident, the other poor souls do not really matter in this case.
  3. Removing rows with Age being Unreported or Unknown (explained in the completeness section).
  4. Joining the two tables by accident case ID ST_CASE for further usage.
Accessibility: The datasets are all publicly accessible here.
Privacy: All identifiable personal information is not included in the cleaned datasets. All fields in the Person dataset are directly relevant to the accident except for the AGE, location data LATITUDE and LONGITUD, and injury severity INJ_SEV. It is possible to identify the person with all the above fields and the time of accident. To be 100% sure of privacy, I only kept the minimum number of fields needed, and removed the fields representing the month and day of the accident, as well as the injury severity of the driver. Now, only with AGE, LATITUDE and LONGITUD, it would no longer be possible to identify the driver.