Exploring 1.5 Billion NYC Taxi/Uber rides with LL Notebook and BigQuery


The key challenge for exploring big data is that it often feels more like an e-mail thread rather than an in-person conversation. If you know exactly what questions to ask, you’ll get nice, precise answers. What you don’t have is the speed, dynamism, and nuance of a real conversation. In this article, I’ll explain how I used LL Notebook and Google BigQuery to explore 1.5 billion taxi and ride share trips in NYC from 2009 to 2016.

As a data scientist and ex New Yorker, the NYC Taxi dataset had a special appeal to me. Its detail and completeness has the potential to reveal characteristics of the great city that’s both interesting and actionable. Many articles have been written on the subject – including taxis at the airport, revealing the hottest nightlife spots, and Taxi vs. Uber vs. Lyft – and this certainly won’t be the last. It will, however, be the first where results aren’t presented as a static, curated end-point. You will be able to interact with the data using the same tool as I did to replicate the findings, tweak the parameters, and even ask new questions!

LL Notebook was designed to be data-size agnostic. Plotting 1.5 billion points is both computationally impractical and overwhelming for the user to perceive, which is why LL Notebook uses binned visualizations that work from aggregates. It queries for and caches pre-aggregated summaries from the database, summaries that are capable of answering questions that weren’t known a priori. This enables a fast and interactive experience that’s otherwise not possible for such a large dataset. Google BigQuery is great database for this purpose because of its speed, scale, and cost. Computing the pre-aggregated summaries only takes BigQuery on the order of 10 seconds for 1.5 billion records, enabling LL Notebook to support an iterative analysis workflow. I’ve outlined my experience consolidating the dataset and loading it into BigQuery in part II of this series. If this is your first time seeing LL Notebook, take one minute to read these steps.

After doing some initial exploration of everything from tipping behavior to the effect of rainy days, I want to share with you a few of the most interesting findings.

8 years of Airport Trips – launch Notebook

Taking the taxi from Manhattan to the Airport was always a little nerve-racking. At times I got stuck on the BQE for 2 hours, and other times I got to the airport 1.5 hours early. I learned the hard way that airport trip durations are highly variable. With this dataset, I hypothesized that it’s possible to make a more informed decision about when to leave for the airport.

Making such a decision is a lot like risk management in finance. It’s prudent to pay extra attention to the tail end of the distribution as that’s where the big negative outcomes live. In finance, a risk manager may look at the 1% VaR to determine the worst loss 1% of the time. With an airport trip, I can make a similar determination for how much time to allot if I want a 95% chance of making a flight. Or if I had an important meeting to make, perhaps I’d be more comfortable with a 99% chance.

One of my favorite visualizations on the subject is Todd Schneider’s chart, showing the variation in trip duration over time.

http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/

Let’s see how this manifests in LL Notebook.

Variation in trip duration (from Union Square to LGA on a weekday) over time. The fat tail during commute hours is striking. Try other neighborhood/airport combinations

If we fix the y-axis on trip duration, we can get a feel for what the ebb and flow of people from Union Square to JFK looks like.

And vice versa.

Let’s step back and explore the system as a whole.

We can easily see from where the bulk of people travel to the airport, and when. As anyone who has ever lived in big city would expect, the trip duration is skewed right.
Weekday trips to LGA. Different neighborhoods have very different profiles with respect to time and trip duration

How different are weekdays and weekends?

Median trip duration from Manhattan to LGA doesn’t look that different for weekday & weekend, however, the tail is much fatter for the weekend.
The tail is even fatter for weekday trips to JFK.

Explore these notebooks on your own: to airports & from airports

The Taxi-Uber War for New York City – launch Notebook

Another one of Todd Schneider’s analyses which piqued my interest was the rise of Uber. I wondered how the for hire vehicles (FHVs – which include ride sharing services Uber and Lyft) are changing the game, and whether they competed differently in different market segments.

FHVs overtook taxis in the second half of 2016, with Uber as the dominant FHV player.
We can see big dips for taxis at 5am & 5pm when shift changes occur. FHVs don’t experience this dip, in fact, they’re effectively picking up the slack.
Yellow Taxi is king in Manhattan, with Uber posing a challenge.
In Manhattan, Uber is doing especially well between 5pm and 5am, whereas Yellow Taxi is steadfast between 5am to 5pm.
The outer boroughs, on the other hand, is FHV territory, with Uber as the breakout leader
In the battle for Queens, Uber’s doing well, though Lyft has been taking (and maintaining) market share
In the battle for Brooklyn, Lyft was gaining market share until 2016 June, then uber started regaining.

By the end of 2016, it appears the ride share companies are doing quite well in the boroughs, and are closing in on Manhattan. Though Uber is the clear FHV leader, Lyft is proving to be a formidable contender in key battlegrounds.

There’s so much more in the data, explore for yourself and tweet us your findings at @lqdlandscape!

As you can tell from the example Notebooks, there are many more paths to explore with this dataset than is possible in one sitting. I was able to experience full empirical distributions and how they change with respect to one another. In doing so, I detected nuances in the data that led me down interesting exploratory paths. LL Notebook makes the exploration of 1.5 billion records feel fluid, inviting me to ask more questions as if I were having a real conversation with the data. If you’re not having conversations with your big data, drop us a message!


Published by

David Lin

Founder at LiquidLandscape

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax