How to use Python and Pandas to join historical weather data to road accident data

I am currently investigating the relationship between weather conditions and accidents by using a dataset of road accidents in Fairfax County, Virginia obtained from the Virginia Department of Motor Vehicles.

The first step to perform the analysis is to combine the accident data with the historical weather conditions. In this article, we will discuss how we take the raw accident data and integrate that data with historical weather data.

As we have over 3000 rows of accident data, we are going to automate the combination of the accident data with the weather data to produce a final dataset that we can use for continued analysis. To do this, we will write a Python script that uses Pandas as a way to easily load and process the accident data and combine it with the historical weather data.

The weather data will be sourced from the Visual Crossing Weather API that makes it simple to load accurate weather history observations for the time and locations specified in the accident reports.

Source Accident Data

The source accident data we will be using includes accident data for the first half of 2020 in Fairfax County, Virginia, USA. The data includes the location of the accident expressed as a latitude, longitude location and the date and time of the accident. In addition, the data includes various attributes about the accident such as amount damage and whether or not there were any injuries:

Document NumberCrash DateCrash TypeGPS LatitudeGPS Longitude# of Fatalities# of Injuries
2001750641/16/2020 0:11Property Damage38.948-77.3063500
2006951233/4/2020 7:57Property Damage38.95545-77.4020700

Historical weather data lookup

To look up the weather observations for each row of accident data, we need the location, which we can obtain using the latitude and longitude columns, and the date and time. The Weather API provides sub-hourly temporal resolution from many weather stations, so we can look up the weather conditions very accurately.

In addition to the weather, we will also request the sunset and sunrise times, so we can determine if there is relationship to the state of the sun compared to accidents.

The following represents the weather API query to look up the weather for our first row of accident data above:

https://weather.visualcrossing.com/VisualCrossingWebServices/rest/services/weatherdata/history?&contentType=json&aggregateMinutes=15&unitGroup=us&includeAstronomy=true&collectStationContributions=true&key=API_KEY&startDateTime=2020-01-16T00:11:00&endDateTime=2020-01-16T00:11:00&locations=38.94800,-77.30635

We specify the location using the locations parameter: &locations=38.94800,-77.30635

The date and time using the start and end date time parameters: &startDateTime=2020-01-16T00:11:00&endDateTime=2020-01-16T00:11:00

We then include a set of parameters to request data at the 15 minute resolution and to include astronomy details (sunrise and sunset in this case): &aggregateMinutes=15&unitGroup=us&includeAstronomy=true

For more information on the parameters above, see the Weather API Documentation.

The result of the weather API request is a simple JSON structure that includes the requested weather data:

We can use this data to select the weather data values that we are interested in investigating.

Using Python and Pandas to request weather data in bulk

As we have many rows of accident data, rather than manually look up the data for each row, we will automate the data retrieval. This will also position the next phase of research to analyze if any weather or astronomical conditions correlate with the accident.

We will use a python script to perform the automation. Within the script, we will use a Pandas DataFrame to load and manipulate the data. This will help position the script to both store the joined accident-weather dataset and also be read for more sophisticated analysis of the data.

The Python script

We will use the following libraries to help simplify the Python script:

import pandas as pd
import math
import datetime
import urllib.request
import json

Pandas will be used for the DataFrame and the math and datetime libraries will be used for some simple error checking, parsing and transformation. The urllib and json libraries will be used to submit and parse the weather API requests.

We first use the pandas library to load the accident data from CSV file into a DataFrame :

 raw_accident_data=pd.read_csv('Motor_Vehicle_Collisions_FairfaxCountyVa.csv') 

In our first test we will create an array that combines the crash weather data. Here is the empty output structure:

records=[]
labels = ['Crash Date', 'GPS Latitude', 'GPS Longitude','Document Number','Crash Type','temp','precip','wspd','sunrise','sunset','stationinfo'] 

Now we have the input data loaded and our output structured defined, we can iterate through each for of the data to load the weather data:

 for i, row in raw_accident_data.iterrows(): 

    latitude=row['GPS Latitude']
    longitude=row['GPS Longitude']
    documentNumber=row['Document Number']
    crashType=row['Crash Type']
    datetimeStr=row['Crash Date'] 

    if math.isnan(latitude) or math.isnan(longitude) or (latitude==0 and longitude==0):
        print("Bad latlon {},{}".format(latitude,longitude))
        continue

    if datetimeStr<start_date:
        print("Bad date {}".format(datetimeStr))
        continue

    try:
        date_time=datetime.datetime.strptime(datetimeStr, '%m/%d/%Y %H:%M')
    except ValueError:
        print("Bad date format {}".format(datetimeStr))
        continue

    latitude='{:.5f}'.format(latitude)
    longitude='{:.5f}'.format(longitude) 

In the first part of the loop we read the core information from the Pandas dataframe row. We then perform checks to make sure the input data is complete and useful. This includes ensuring the latitude, longitude and date time are valid. Finally, we ensure that the latitude has five decimal places.

Note that we parse the date time into an actual date time value so that we can then reformat it for use in the query in the next step.

We can now submit the Weather API request for the row values we have read above:

 query_params = '&contentType=json&aggregateMinutes=15&unitGroup=us&includeAstronomy=true&collectStationContributions=true&key={}&startDateTime={}&endDateTime={}&locations={},{}'

query_params=query_params.format(weather_api_key, date_time.isoformat(), date_time.isoformat(),latitude,longitude)
  
try:
   response = urllib.request.urlopen(weather_api_endpoint 
                                      +"?"+ query_params)
   data = response.read()
except Exception:
  print("Error reading from {}"
          .format(weather_api_endpoint +"?"+ query_params))
  continue
    
weatherData = json.loads(data.decode('utf-8')) 

errorCode=weatherData["errorCode"]
                   if 'errorCode' in weatherData else 0
    
if (errorCode>0):
   print("Error reading from errorCode {}, error={}"
              .format(weather_api_endpoint +
                  "?"+ query_params, errorCode))
  continue 

We insert the row parameters into the preformatted query parameters string. Then we submit the Weather API request.

We parse the request using the JSON library into a variable called ‘weatherData’. We check that no error code was returned (if it was then we continue to the next row).

We are now ready to extract the weather data we are interested in.

 locations=weatherData["locations"]
 for locationid in locations:  
    location=locations[locationid]
    for value in location["values"]:
       records.append((date_time.isoformat(),latitude,longitude,documentNumber,crashType,value["temp"],value["precip"],value["wspd"],value["sunrise"],value["sunset"],value["stationinfo"] ))

The json is a collection of locations which in turn include a collection of weather data values. In our case there is simple a single location (for the latitude, longitude we requested for the crash location) and a single value (or the time of the crash).

We can therefore create a single row of output data from the input pieces of data plus the temperature (“temp”), precipitation (“precip”), wind speed (“wspd”), the sunset and sunrise and also the “stationinfo”.

The station info tells us exactly where the weather data was found. This will be useful for tracing the data and consistency checking.

The final part of the script in this first step is to save the data as a CSV that includes the crash and weather data.

output_df = pd.DataFrame.from_records(records, columns=labels)
output_df.to_csv('output.csv', index=False) 

We use the Pandas data frame again to perform this conversion and save to disk. Here’s what the output looks like:

Crash DateGPS LatitudeGPS Longitudetempprecipwspdsunrisesunsetstationinfo
2020-01-01T00:01:0038.83146-77.2392343.42.82020-01-01T07:27:27-05:002020-01-01T16:57:28-05:00EW4235 Fairfax VA US (Id=E4235,2.2mi), EW5536 Springfield VA US (Id=E5536,2.8mi)

Closing thoughts

We are now ready to perform analysis on the data to see how weather affects accident events.

If you have data that you would like to combine with weather data to perform analysis and investigation, check out the Weather Data Services page where you can sign for a free trial, view and download historical and forecast weather data and even build Weather API queries like the ones we have used in this example. You can also see how to include weather data in Power BI, Microsoft Excel, MySQL or any programming language.

If you have any questions or comments, please post them below.

Leave a Reply

Your email address will not be published. Required fields are marked *