The key challenge for exploring big data is that it often feels more like an e-mail thread rather than an in-person conversation. If you know exactly what questions to ask, you’ll get nice, precise answers. What you don’t have is the speed, dynamism, and nuance of a real conversation. In this article, I’ll explain how I used LL Notebook and Google BigQuery to explore 1.5 billion taxi and ride share trips in NYC from 2009 to 2016.
As a data scientist and ex New Yorker, the NYC Taxi dataset had a special appeal to me. Its detail and completeness has the potential to reveal characteristics of the great city that’s both interesting and actionable. Many articles have been written on the subject – including taxis at the airport, revealing the hottest nightlife spots, and Taxi vs. Uber vs. Lyft – and this certainly won’t be the last. It will, however, be the first where results aren’t presented as a static, curated end-point. You will be able to interact with the data using the same tool as I did to replicate the findings, tweak the parameters, and even ask new questions!
LL Notebook was designed to be data-size agnostic. Plotting 1.5 billion points is both computationally impractical and overwhelming for the user to perceive, which is why LL Notebook uses binned visualizations that work from aggregates. It queries for and caches pre-aggregated summaries from the database, summaries that are capable of answering questions that weren’t known a priori. This enables a fast and interactive experience that’s otherwise not possible for such a large dataset. Google BigQuery is great database for this purpose because of its speed, scale, and cost. Computing the pre-aggregated summaries only takes BigQuery on the order of 10 seconds for 1.5 billion records, enabling LL Notebook to support an iterative analysis workflow. I’ve outlined my experience consolidating the dataset and loading it into BigQuery in part II of this series. If this is your first time seeing LL Notebook, take one minute to read these steps.
After doing some initial exploration of everything from tipping behavior to the effect of rainy days, I want to share with you a few of the most interesting findings.
8 years of Airport Trips – launch Notebook
Taking the taxi from Manhattan to the Airport was always a little nerve-racking. At times I got stuck on the BQE for 2 hours, and other times I got to the airport 1.5 hours early. I learned the hard way that airport trip durations are highly variable. With this dataset, I hypothesized that it’s possible to make a more informed decision about when to leave for the airport.
Making such a decision is a lot like risk management in finance. It’s prudent to pay extra attention to the tail end of the distribution as that’s where the big negative outcomes live. In finance, a risk manager may look at the 1% VaR to determine the worst loss 1% of the time. With an airport trip, I can make a similar determination for how much time to allot if I want a 95% chance of making a flight. Or if I had an important meeting to make, perhaps I’d be more comfortable with a 99% chance.
One of my favorite visualizations on the subject is Todd Schneider’s chart, showing the variation in trip duration over time.
Let’s see how this manifests in LL Notebook.
If we fix the y-axis on trip duration, we can get a feel for what the ebb and flow of people from Union Square to JFK looks like.
And vice versa.
Let’s step back and explore the system as a whole.
How different are weekdays and weekends?
The Taxi-Uber War for New York City – launch Notebook
Another one of Todd Schneider’s analyses which piqued my interest was the rise of Uber. I wondered how the for hire vehicles (FHVs – which include ride sharing services Uber and Lyft) are changing the game, and whether they competed differently in different market segments.
By the end of 2016, it appears the ride share companies are doing quite well in the boroughs, and are closing in on Manhattan. Though Uber is the clear FHV leader, Lyft is proving to be a formidable contender in key battlegrounds.
As you can tell from the example Notebooks, there are many more paths to explore with this dataset than is possible in one sitting. I was able to experience full empirical distributions and how they change with respect to one another. In doing so, I detected nuances in the data that led me down interesting exploratory paths. LL Notebook makes the exploration of 1.5 billion records feel fluid, inviting me to ask more questions as if I were having a real conversation with the data. If you’re not having conversations with your big data, drop us a message!