At Visual Crossing, we are always looking for interesting, new datasets to use as we explore different ways to create engaging and informative maps. We recently found the excellent open data from the Capital Bikeshare, the bike sharing service in the Washington, DC metro area. They have various datasets available, but the largest dataset is the one that looked most fun to map – that one is the full list of individual bike rental trips. It includes the bike pick-up and drop-off location as well as time and date. By taking this raw data and combining it with the Open Street Map bike routing data, we have been able to create various maps containing both interesting and potentially actionable results based on how bikes travel around Washington DC.
Finding the data
Often the hardest part of making an interesting map is finding interesting data. So, we were pleased to find this complete Trip History Data from the Capital Bikeshare. You can get your own copy here. The data includes the following information for every bike rental trip:
Duration – Duration of trip
Start Date – Includes start date and time
End Date – Includes end date and time
Start Station – Includes starting station name and number
End Station – Includes ending station name and number
Bike Number – Includes ID number of bike used for the trip
Member Type – Indicates whether user was a "registered" member (Annual Member, 30-Day Member or Day Key Member) or a "casual" rider (Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass)
We then added the longitude and latitude for each station to make it easy to perform calculations. We used the station data found on the DC.gov web site here. To merge these two datasets we put the data into a database and used SQL to make the join. The resultant dataset structure looked like this:
"ID","Duration","Start date","End date","Start station number","Start station","End station number","End station","Bike number","Member type","Start Latitude","Start Longitude","End Longitude","End Latitude"
1,679,"2018-05-01 00:00:00","2018-05-01 00:11:19",31302,"Wisconsin Ave & Newark St NW",31307,"3000 Connecticut Ave NW / National Zoo","W22771","Member",38.93,-77.07,-77.05,38.93
Note that these datasets are fairly large. For example, the month of May 2018 alone contains 375,000 rows.
Adding the predicted route
We decided to focus on analyses based around the predicted route between the pick-up and drop-off locations. Since the data gave us the pick-up and drop-off times, we could make an educated guess whether people took a direct route or a slow, circuitous route. The actual route taken is not available directly in the Capital Bikeshare data (understandably!) but by comparing expected time vs actual time, we could estimate whether the rider was trying to get to their destination as quickly as possible (maybe as a commuter) or taking a more leisurely route (such as a tourist).
We tried various approaches to calculate the routes. Given the size of the dataset, we found using our own in-house software to calculate the route based on Open Street Map dataset was the best and fastest solution to such bulk routing. You could also consider commercial routing options such as Google, but for our needs on this project they were too slow to get us the quick results that we needed. In some cases, we used cycling routing and some case walking-based routing as sometimes the bike route omitted likely short-cut routes and therefore predicted a high-speed bike ride that was nearly impossible. In addition to the route itself, calculating the route gave us the expected speed and distance which we would use in our analysis.
Exploring the data
Our next step was to start exploring the data by plotting the information on a map. Using an animation, we could quickly see a rather entertaining flow of bikes riding throughout the day across the city.
Using expected speed to predict tourist vs commuters
One thing that we identified was that there were clear patterns based on the time of day. During the morning and evening there appeared to be particular routes that were busy around some of the commuter hubs such as Union Station. During the later morning and afternoon, the tourist locations appeared busier. To enable us to drill down on this observation further, we plotted the top drop off locations on the map and let them grow in size based off the number of riders who dropped off at that location.
This helped us understand which sites were the most popular stops and when they were most busy. However, we also noticed that the data seemed to show two classes of rider - slow riders and fast riders. Did the apparent speed of the rider predict anything about the rider themselves?
To test this hypothesis we split the drop-off location counts into two categories - fast riders who dropped off within a factor or two of the expected rider time based on our routing calculating and leisure riders who took longer. The resulting animation shows that there is a clear correlation between slower riders visiting tourist locations and fast riders visiting centers of work or local shopping destinations. Among other things, the resulting animation clearly show the morning commute where fast riders drop off at more employment-centric locations in the morning and slower riders visit tourist locations later in the day.
Analyzing rider speed
We were also interested to see just how fast some of the riders were riding (assuming they took the route we calculated and didn't find a more direct way!) We filtered the data to find the top trip speeds across the dataset. The speeds were calculated using the predicted route length and the pick-up and drop-off times recorded in the original dataset.
This resulted in the following video.
This highlights that some people achieve pretty good speeds biking around the DC metro area. As predicted, these fast riders typically tend to be visiting locations primarily of interest to local residents such as employment locations.
There are many possibilities to further explore this data. For example, by simply observing the full, animated map you can notice many interesting trips that must have had a backstory - riders riding quickly across Washington DC at 3am or riders taking much, much longer to rider between locations than predicted.
When looking at the fastest trip dataset, we also noticed that many of the fastest routes are repeated every day at the same time. This probably indicates a particular rider that uses the bike share every day for their commute.
Both of the above maps would be interesting to view in the context of whether the rider is a registered, long-term member or a casual rider for a short period of time. Such information would likely predict commuters and locally resident riders vs tourists and visitors. If we plotted the maps based around this Member Type information, would we see additional patterns emerge?
We could also consider combining this data with additional external data such as weather. It is easy to assume that tourists change their destination based on the weather, but how about the commuters? Are there ways to use this insight to better manage the bikes based on weather predictions? Perhaps it may even help future tourists better plan and adjust their trips to account for various weather events.
Open data such as this bike share data is a fascinating starting point to start exploring data. Maps and particularly animations provide both visually interesting ways to explore the data as well as revealing patterns that are hard to see outside of a map. Do you have questions about this dataset?