Exploring 1.5 Billion NYC Taxi/Uber rides with LL Notebook and BigQuery

The key challenge for exploring big data is that it often feels more like an e-mail thread rather than an in-person conversation. If you know exactly what questions to ask, you’ll get nice, precise answers. What you don’t have is the speed, dynamism, and nuance of a real conversation. In this article, I’ll explain how I used LL Notebook and Google BigQuery to explore 1.5 billion taxi and ride share trips in NYC from 2009 to 2016.

As a data scientist and ex New Yorker, the NYC Taxi dataset had a special appeal to me. Its detail and completeness has the potential to reveal characteristics of the great city that’s both interesting and actionable. Many articles have been written on the subject – including taxis at the airport, revealing the hottest nightlife spots, and Taxi vs. Uber vs. Lyft – and this certainly won’t be the last. It will, however, be the first where results aren’t presented as a static, curated end-point. You will be able to interact with the data using the same tool as I did to replicate the findings, tweak the parameters, and even ask new questions!

LL Notebook was designed to be data-size agnostic. Plotting 1.5 billion points is both computationally impractical and overwhelming for the user to perceive, which is why LL Notebook uses binned visualizations that work from aggregates. It queries for and caches pre-aggregated summaries from the database, summaries that are capable of answering questions that weren’t known a priori. This enables a fast and interactive experience that’s otherwise not possible for such a large dataset. Google BigQuery is great database for this purpose because of its speed, scale, and cost. Computing the pre-aggregated summaries only takes BigQuery on the order of 10 seconds for 1.5 billion records, enabling LL Notebook to support an iterative analysis workflow. I’ve outlined my experience consolidating the dataset and loading it into BigQuery in part II of this series. If this is your first time seeing LL Notebook, take one minute to read these steps.

After doing some initial exploration of everything from tipping behavior to the effect of rainy days, I want to share with you a few of the most interesting findings.

8 years of Airport Trips – launch Notebook

Taking the taxi from Manhattan to the Airport was always a little nerve-racking. At times I got stuck on the BQE for 2 hours, and other times I got to the airport 1.5 hours early. I learned the hard way that airport trip durations are highly variable. With this dataset, I hypothesized that it’s possible to make a more informed decision about when to leave for the airport.

Making such a decision is a lot like risk management in finance. It’s prudent to pay extra attention to the tail end of the distribution as that’s where the big negative outcomes live. In finance, a risk manager may look at the 1% VaR to determine the worst loss 1% of the time. With an airport trip, I can make a similar determination for how much time to allot if I want a 95% chance of making a flight. Or if I had an important meeting to make, perhaps I’d be more comfortable with a 99% chance.

One of my favorite visualizations on the subject is Todd Schneider’s chart, showing the variation in trip duration over time.

http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/

Let’s see how this manifests in LL Notebook.

Variation in trip duration (from Union Square to LGA on a weekday) over time. The fat tail during commute hours is striking. Try other neighborhood/airport combinations

If we fix the y-axis on trip duration, we can get a feel for what the ebb and flow of people from Union Square to JFK looks like.

And vice versa.

Let’s step back and explore the system as a whole.

We can easily see from where the bulk of people travel to the airport, and when. As anyone who has ever lived in big city would expect, the trip duration is skewed right.
Weekday trips to LGA. Different neighborhoods have very different profiles with respect to time and trip duration

How different are weekdays and weekends?

Median trip duration from Manhattan to LGA doesn’t look that different for weekday & weekend, however, the tail is much fatter for the weekend.
The tail is even fatter for weekday trips to JFK.

Explore these notebooks on your own: to airports & from airports

The Taxi-Uber War for New York City – launch Notebook

Another one of Todd Schneider’s analyses which piqued my interest was the rise of Uber. I wondered how the for hire vehicles (FHVs – which include ride sharing services Uber and Lyft) are changing the game, and whether they competed differently in different market segments.

FHVs overtook taxis in the second half of 2016, with Uber as the dominant FHV player.
We can see big dips for taxis at 5am & 5pm when shift changes occur. FHVs don’t experience this dip, in fact, they’re effectively picking up the slack.
Yellow Taxi is king in Manhattan, with Uber posing a challenge.
In Manhattan, Uber is doing especially well between 5pm and 5am, whereas Yellow Taxi is steadfast between 5am to 5pm.
The outer boroughs, on the other hand, is FHV territory, with Uber as the breakout leader
In the battle for Queens, Uber’s doing well, though Lyft has been taking (and maintaining) market share
In the battle for Brooklyn, Lyft was gaining market share until 2016 June, then uber started regaining.

By the end of 2016, it appears the ride share companies are doing quite well in the boroughs, and are closing in on Manhattan. Though Uber is the clear FHV leader, Lyft is proving to be a formidable contender in key battlegrounds.

There’s so much more in the data, explore for yourself and tweet us your findings at @lqdlandscape!

As you can tell from the example Notebooks, there are many more paths to explore with this dataset than is possible in one sitting. I was able to experience full empirical distributions and how they change with respect to one another. In doing so, I detected nuances in the data that led me down interesting exploratory paths. LL Notebook makes the exploration of 1.5 billion records feel fluid, inviting me to ask more questions as if I were having a real conversation with the data. If you’re not having conversations with your big data, drop us a message!

No Unitaskers in Your Data Exploration Kitchen

What does Alton Brown, a Food Network personality, have to do with data exploration? More than you’d expect – his wise maxim that unitaskers don’t belong in your kitchen transcends the culinary world.

Alton brown holds up a strawberry slicer as he reviews Amazon’s dumbest kitchen gadgets.

The strawberry slicer is the quintessential unitasker. The only thing is does well is produce a fancy looking garnish, but it’s hardly the only tool for the job. In the kitchen, every tool makes every other tool a little less accessible, thus only the most versatile tools belong in a well-run kitchen. There, the chef is often seen with the chef’s knife, never the strawberry slicer.

What is a unitasker in the data exploration kitchen? It is a chart which may be attractive, but is really only capable of conveying one message. Like the strawberry slicer, it can make a good garnish – for example, when it’s time to present results in a newspaper article or in a powerpoint presentation. However, it doesn’t belong during the main preparation, where data exploration is done. That is done by the chef’s knife.

A unitasker – more pizzazz than substance. Looks pretty, communicates the overall shape of the data, but the difference between the years is nearly impossible to see. The layers are there more for form than function.

Data visualization is a vital part of the data analysis workflow. Early on in the process, the goal is for the analyst to get a broad sense of the dataset and to discover potentials paths to insight. The search space is huge at this point, as there are a million ways to slice and dice the data using a plethora of techniques. To make this problem tractable, we must take advantage of our keen sense of visual perception and domain expertise to reduce the search space. Charts offer a quick and effective way to spot the most relevant prospects.

At this stage, you want to look for possible relationships between many data dimensions. To observe many data dimensions simultaneously, we need to achieve a high information density using limited screen real estate. The screen here is our metaphorical kitchen, it’s where we can’t afford to be cluttered with unitaskers if we want to walk away with something useful. Each chart must be able reveal multiple aspects of the data to justify the space it takes up.

Our goal with LL Notebook is to provide users with the chef’s knife of data exploration. To the casual by-stander, LL Notebook’s charts look admittedly plain, but like the chef’s knife, they are multi-purpose, form follows function, and optimized for human cognition. With a little bit of practice, they are the general-purpose tools that you can rely on every single time.

Let’s see how LL Notebook differs in approach to the unitasker chart. The underlying dataset is from a smart meter capturing hourly electricity usage (exported via pge.com) from a single family household over 6 years. Instead of trying to gratuitously munge data dimensions together, each dimension can stand in its own right. Relationships between dimensions are revealed by applying filters and dragging the filters around. Think of it as a pivot table on steroids.

We can see the same overall shape of usage, and if we brush over the year dimension, the qualitative change over time becomes clear. The modes around 1.7kwh and 3.0kwh disappeared in the later years. Clearly, something was done in this household in terms of energy efficiency.

Since we’re dealing with multi-purpose charts, we can proceed to do so much more.

We can see the relationship between year and usage from another perspective. What immediately jumps out is the strong relationship between hour of day and usage. This makes intuitive sense as usage should follow the circadian rhythm of the household inhabitants. The linear relationship between usage and cost is clear as well, though it’s interesting that the two modes diverge in cost as usage increases. This reveals the tiered pricing of electricity.
We can filter year in conjunction to see if these effects are persistent over time. The modes diverge at a greater magnitude in the early years than for the later years, where the two cost modes barely emerged. Maybe the pricing tiers converged over time, or maybe the higher tiers were just not reached at all in the later years? Ask more questions!

In this brief example, you can already see how easy and intuitive it is to interact with and to perceive relationships in the data using only simple charts. In this instance, the household was able to verify that their energy efficiency investments were working, and to further tweak their energy usage. The instantaneous feedback in LL Notebook nudges the analyst to ask more questions. If you’re on your computer, visit the LL Notebook demo to explore this dataset using the chef’s knife of data exploration!