In my previous post I started looking at the NY CitiBike data, and the differences between the way men and women use the service. It’s clear men use the service more, and that women’s journeys are generally longer; however it’s not clear if this is because they simply travel slower, or because they aren’t always taking the most direct route.
This is a difficult question to answer. Whilst it is easier to know whether the journey time from A to B is longer for men or women, it is not easy to know if the journey was direct. We could simply average it, and hope for the best; given 11 million records, this would probably be good enough, but there is another way making use of the Google maps API.
The Google maps API (accessed using the excellent ggmap) can be queried for distance and cycling journey time between two points. By using this method for a subsample of journeys, we can filter those journeys which take less time than Google suggests it should take, hence we can be more sure (although not absolutely so) that these journeys were direct. We can then have a look at the differences between journey times for men and women.
Note that this turned into a bit of a code heavy post. All the source code to produce this page is available here. You’ll see that I chunked up the steps and saved them out to multiple .RData files to make compilation of the .Rmd quicker.
Step One
Aggregate all observation by the journey made, and the hour when it was made:
Step Two
Calculate all the possible journeys, and pick a subset of those.
Load the stations data:
Create a table of all the unique journeys:
This gives 105606 journeys out of a possible 1.16281 × 105; plus it is possible for a journey to start and end at the same station, but we are not interested in those, as there is no way to know where someone has been.
Now merge journeys and stns dataframes to give a table with the complete station and journey data:
Step Three
Query the Google API and get journey times and distance for a subset of journeys.
The Google API limits you to 2500 queries a day, so 2500 seems like a good subset of queries to make. Note that since we can just query the journey dataframe we created in the previous step, not the individual journeys themselves, so we will actually be looking at many more journeys.
It’s also possible to query to Google routing API to get a route for each of these journeys. We can plot this to ensure that we have a good coverage across Manhattan Island, but it can also look really nice.
Now apply it…
And plot it out superimposed over a map of NYC using ggmap.
So it looks like we have a reasonable coverage of Manhattan island. You could always query the API on multiple days and combine the output if you wanted to include more than 2500 of the 110,000 possible journeys.
Step Four
Join journeys_merge data with actual journey data from bikes_agg.
Step Five
Now that we have that data combined, we can produce some plots and start to answer the question of whether women are really riding slower than men.
First, how does all this data look when separated by gender? In the following plot the solid line is a 1:1 lined between the journey time predicted by the Google API, and the actual journey time in seconds. The dashed line is an actual regression line from the data.
The first thing we can take away from this plot is that the Google API is quite optimistic about its predictions of cycling journey time sin New York City: most journeys take longer than it predicts; although of course, it is not clear whether this is simply because people are not taking the most direct route.
The second thing that comes out is that for men, it appears that for journeys of longer than 2000 seconds during the week, actual journey times are shorter than the Google prediction. Note that this is just a guide, this model would not satisfy the usual assumptions of a linear model.
The regression lines tend to suggest that mens’ journey time on the same journeys are simply shorter than womens’. We can see this more clearly in a boxplot:
However, it would better if we could look at the average of each journey time for each journey against each other for men and women, as the plots above are not from a uniform number of journeys for each sex. We can look at the distance taken, or the speed:
Almost across the board men simply are riding faster than women for the given journeys. But I was also finding something strange with a couple of journeys (cropped out of the above plot). Both appear to be taking place at spends of greater than 50 or 100 km/h…
So that is a bit bizarre…according to Google, two journeys of over 8 km are being completed in between 200 and 600 seconds. These are the journeys:
It’s not immediately clear what’s going on here. This is probably an issue with the way I hashed the data, as it appears to affect journeys from one particular station. I’ll investigate another day.