Exploring 1.5 Billion NYC Taxi/Uber rides with LL Notebook and BigQuery

The key challenge for exploring big data is that it often feels more like an e-mail thread rather than an in-person conversation. If you know exactly what questions to ask, you’ll get nice, precise answers. What you don’t have is the speed, dynamism, and nuance of a real conversation. In this article, I’ll explain how I used LL Notebook and Google BigQuery to explore 1.5 billion taxi and ride share trips in NYC from 2009 to 2016.

As a data scientist and ex New Yorker, the NYC Taxi dataset had a special appeal to me. Its detail and completeness has the potential to reveal characteristics of the great city that’s both interesting and actionable. Many articles have been written on the subject – including taxis at the airport, revealing the hottest nightlife spots, and Taxi vs. Uber vs. Lyft – and this certainly won’t be the last. It will, however, be the first where results aren’t presented as a static, curated end-point. You will be able to interact with the data using the same tool as I did to replicate the findings, tweak the parameters, and even ask new questions!

LL Notebook was designed to be data-size agnostic. Plotting 1.5 billion points is both computationally impractical and overwhelming for the user to perceive, which is why LL Notebook uses binned visualizations that work from aggregates. It queries for and caches pre-aggregated summaries from the database, summaries that are capable of answering questions that weren’t known a priori. This enables a fast and interactive experience that’s otherwise not possible for such a large dataset. Google BigQuery is great database for this purpose because of its speed, scale, and cost. Computing the pre-aggregated summaries only takes BigQuery on the order of 10 seconds for 1.5 billion records, enabling LL Notebook to support an iterative analysis workflow. I’ve outlined my experience consolidating the dataset and loading it into BigQuery in part II of this series. If this is your first time seeing LL Notebook, take one minute to read these steps.

After doing some initial exploration of everything from tipping behavior to the effect of rainy days, I want to share with you a few of the most interesting findings.

8 years of Airport Trips – launch Notebook

Taking the taxi from Manhattan to the Airport was always a little nerve-racking. At times I got stuck on the BQE for 2 hours, and other times I got to the airport 1.5 hours early. I learned the hard way that airport trip durations are highly variable. With this dataset, I hypothesized that it’s possible to make a more informed decision about when to leave for the airport.

Making such a decision is a lot like risk management in finance. It’s prudent to pay extra attention to the tail end of the distribution as that’s where the big negative outcomes live. In finance, a risk manager may look at the 1% VaR to determine the worst loss 1% of the time. With an airport trip, I can make a similar determination for how much time to allot if I want a 95% chance of making a flight. Or if I had an important meeting to make, perhaps I’d be more comfortable with a 99% chance.

One of my favorite visualizations on the subject is Todd Schneider’s chart, showing the variation in trip duration over time.


Let’s see how this manifests in LL Notebook.

Variation in trip duration (from Union Square to LGA on a weekday) over time. The fat tail during commute hours is striking. Try other neighborhood/airport combinations

If we fix the y-axis on trip duration, we can get a feel for what the ebb and flow of people from Union Square to JFK looks like.

And vice versa.

Let’s step back and explore the system as a whole.

We can easily see from where the bulk of people travel to the airport, and when. As anyone who has ever lived in big city would expect, the trip duration is skewed right.
Weekday trips to LGA. Different neighborhoods have very different profiles with respect to time and trip duration

How different are weekdays and weekends?

Median trip duration from Manhattan to LGA doesn’t look that different for weekday & weekend, however, the tail is much fatter for the weekend.
The tail is even fatter for weekday trips to JFK.

Explore these notebooks on your own: to airports & from airports

The Taxi-Uber War for New York City – launch Notebook

Another one of Todd Schneider’s analyses which piqued my interest was the rise of Uber. I wondered how the for hire vehicles (FHVs – which include ride sharing services Uber and Lyft) are changing the game, and whether they competed differently in different market segments.

FHVs overtook taxis in the second half of 2016, with Uber as the dominant FHV player.
We can see big dips for taxis at 5am & 5pm when shift changes occur. FHVs don’t experience this dip, in fact, they’re effectively picking up the slack.
Yellow Taxi is king in Manhattan, with Uber posing a challenge.
In Manhattan, Uber is doing especially well between 5pm and 5am, whereas Yellow Taxi is steadfast between 5am to 5pm.
The outer boroughs, on the other hand, is FHV territory, with Uber as the breakout leader
In the battle for Queens, Uber’s doing well, though Lyft has been taking (and maintaining) market share
In the battle for Brooklyn, Lyft was gaining market share until 2016 June, then uber started regaining.

By the end of 2016, it appears the ride share companies are doing quite well in the boroughs, and are closing in on Manhattan. Though Uber is the clear FHV leader, Lyft is proving to be a formidable contender in key battlegrounds.

There’s so much more in the data, explore for yourself and tweet us your findings at @lqdlandscape!

As you can tell from the example Notebooks, there are many more paths to explore with this dataset than is possible in one sitting. I was able to experience full empirical distributions and how they change with respect to one another. In doing so, I detected nuances in the data that led me down interesting exploratory paths. LL Notebook makes the exploration of 1.5 billion records feel fluid, inviting me to ask more questions as if I were having a real conversation with the data. If you’re not having conversations with your big data, drop us a message!

Practical Time Series Visualization using D3 + OM

At LiquidLandscape, we believe interactive visuals are the best way to explore and understand multi-dimensional data. This holds especially true when dealing with time series data, where being able to travel back and forth in time is crucial for understanding how other dimensions relate. When we were tasked with creating an insightful visualization for portfolio (time series) data, we knew that something highly interactive would be necessary.

We wanted to present multiple interactive visualizations on a single page, showing the data and relationships from different perspectives. We found the combination of D3 and OM to be perfect for this task. D3’s ability to create beautiful visualizations and OM’s sensible approach to state makes for an awesome marriage of design and engineering. Realizing this benefit isn’t free – the two libraries come from different worlds, with conflicting philosophies. However, by following a few guidelines and by looking at our code example, developing in (and building on top of) this model becomes fairly simple.

In order of importance, our goals are:

  1. data consistency/correctness – ’nuff said
  2. responsiveness – the benefits of an interactive visualization greatly diminishes with decreasing responsiveness. When we can’t interact with the data at human-speed, much is lost.
  3. object constancy – with so many changing variables, visual assistance is helpful for following changes in the data


We’re big fans of how much D3 simplified the creation of beautiful visualizations and knew that it would play a big role.

Clojure(Script) was already a key part of our technical arsenal. The language is natural for data manipulation, and its persistent data structures are a great fit for snapshotting real-time data and making it available as a time series. Since accessing the time series data becomes trivial, this opens up the possibility of having playback functionality (play, pause, rewind) for your visualization app.

As for the UI itself, Facebook’s React, and its ClojureScript interface, OM, seemed ideal for developing rich UIs that are easy to reason about.

OM from 10,000 ft

At its core, OM is about having a single, atomic representation of application state, and having hierarchical components render the latest snapshot of that state. DOM manipulation is expensive, so when application state changes, OM calculates and executes the minimum set of corresponding DOM updates. A big value-added of OM over React (besides the fact that it’s part of the Clojure ecosystem) is that changes in application state are very efficiently detected due to the fact you can simply do reference equality checks on persistent data structures.

D3 + OM, I’m familiar with both. This should be a no-brainer, right?

This is fine and well if you’re having OM render static html to reflect the data, OM will take care of efficiently rendering the data. However, we’re interested in creating highly interactive visuals using D3. D3 has a very powerful, declarative way of expressing how visual elements relate to data using selections. In my opinion, this is beautifully done, and we don’t want OM components to take over that responsibility. In addition, object constancy (via transitions) is crucial for understanding time series data, hence we don’t want to simply replace DOM elements.

That said, a mix of D3 and OM sounds like a good way to approach this. But, OM/cljs works with (immutable) persistent data structures while D3 works with (mutable) native js data structures. Some data marshalling would be necessary for this to work, right? Where’s the data demarcation?

This is the approach that worked for us given our goals:

  • Raw/canonical time series data should be held atomically in OM – to take advantage of efficiently representing time series data in persistent data structures.
  • D3 works with native javascript data structures, and data marshalling (via clj->js and js->clj) is expensive, so we want to minimize this translation.
  • Therefore, it is reasonable to represent a (time) snapshot of the data as a cursor
  • AND, be comfortable with having some mutable data (stored in component state) necessary for a dynamic visualization.

There is a performance hit when naively using multiple visual OM components. Remember that OM renders consistent snapshots, meaning that it will want to render all applicable components on the page in a single render pass. With sophisticated visuals, this may result in an unacceptable refresh-rate. The problem and proposed solution is detailed below in the Inter-Component Communication section.

An Example

Let’s now go through an example to see how building something using D3 + OM would look. The advent of fracking and the recent fall in price of crude oil has been an important macro-economic driver of the economy. As an example, we’ll create a visualization app showing oil production on a US map and how it changes with price of oil over time. We’ll be continuing the rest of the article using this example.

The full source is available github

OM Component life-cycle with D3

Let’s create a map-chart component, and see how it uses the OM life-cycle protocols.

map chart

om.core/IInitState – initializes local state. Good for storing core.async channels, D3 mutable state, etc.

(init-state [_]
  (let [width 960 height 500
        projection (-> js/d3 .-geo .albersUsa (.scale 1000) (.translate #js [(/ width 2) (/ height 2)]))]
    {:width width :height height
     :path (-> js/d3 .-geo .path (.projection projection))
     :color (-> js/d3 .-scale .linear
              (.range #js ["green" "red"]))
     :comm (chan)
     :us nil ;topojson data
     :svg nil

om.core/IWillMount – any set-up tasks and core.async loops to consume from core.async channels. core.async is very important in OM because outside of the render phase, you cannot treat cursors as values . Any user-driven event (keyboard, mouse-click, etc) is outside of the render phase, so you should relay those events via core.async channels to core.async loops created during IWillMount.

(will-mount [_]
  (c/shared-async->local-state owner [:oil-prod-max-val]) ;subscribe to updates to shared-async-state
  (let [{:keys [comm]} (om/get-state owner)]
    (-> js/d3 (.json "/data/us-named.json"
                  (fn [error us]
                    (put! comm [:us us])))) ;callbacks are not part of the render phase, need to relay
    (go (while true
          (let [[k v] (<! comm)]
            (case k
              :us (om/update-state! owner #(assoc % :us v))

om.core/IRender/om.core/IRenderState – create placeholder div for SVG element

(render [_]

om.core/IDidMount – create SVG element

(did-mount [_]
  (let [{:keys [width height projection comm]} (om/get-state owner)
        svg (-> js/d3 (.select "#map") (.append "svg")
              (.attr "width" width) (.attr "height" height))]
    (om/update-state! owner #(assoc % :svg svg))))

om.core/IDidUpdate – main hook for rendering and transitioning visual via D3.

(did-update [_ prev-snapshot prev-state]
  (let [{:keys [svg us path color oil-prod-max-val]} (om/get-state owner)]
    (when us
      (-> color (.domain #js [0 oil-prod-max-val]))
      (let [states (-> svg (.selectAll ".state")
                     (.data (-> js/topojson (.feature us (-> us .-objects .-states)) .-features)))]
        (-> states
            (.attr "fill" (fn [d-]
                             (let [code (-> d- .-properties .-code)
                                   prod (or (get snapshot code) 0)]
                               (color prod))))
            (.append "path")
              (.attr "class" (fn [d-] (str "state " (-> d- .-properties .-code))))
              (.attr "fill" (fn [d-]
                              (let [code (-> d- .-properties .-code)
                                    prod (or (get snapshot code) 0)]
                                (color prod))))
              (.attr "d" path)
            (.append "title"))
        (-> states (.select "path title")
            (.text (fn [d-]
                           (let [code (-> d- .-properties .-code)
                                 prod (or (get snapshot code) 0)]
                             (NUMBER_FORMAT prod)))))

We’ll also create a root component, called app which lays out the entire OM app, and is responsible for loading all data in om/IWillMount. You can check this in the full source.

Where’s the state with D3 + OM?

It’s important to have a clear understanding of the different levels where state can be kept, and what each level is appropriate for.

  • shared state – this is created during om.core/root under :shared and where a global pub/sub core.async channel (used to communicate shared async state changes) can be kept.

    (om/root app app-state {:target (. js/document (getElementById "app"))
                            :shared (let [publisher (chan)]
                                      {:publisher publisher
                                       :publication (pub publisher first)
                                       :init {:oil-prod-max-val 0
                                              :ts (js/Date. 0)}
  • app state – the data. Available to OM components as cursors.

    (def app-state
      (atom {:oil-data (sorted-map) ; {date1 {:production {"CA" 232 ...} "SpotPrice" 23} date2 {...}
  • component/local state – state only relevant within the component, appropriate for storing some mutable state for D3 visuals. Initialized in om/IInitState, accessed via om/get-state, and modified via om/set-state! and om/update-state!

  • shared async state – a concept we introduced for Inter-Component Communication. It’s state which is asynchronously communicated between components and eventually consistent. Basically, a more formalized take on Publish & Notification Channels. Usage detailed below.

Inter-Component Communication

Discovering hidden gems in data often require simultaneously having multiple visuals on the same page to see the data from different perspectives. These visuals must be correlated and dynamic in order for the user to detect complex relationships. Using shared async state as a means of inter-component communication can be very effective.

How is shared async state different than app state?

  • it’s not the data (which is what app state is appropriate for), but rather how to display the data.
  • app state can only be accessed via cursors, which has a couple limitations:
    • must be a map or vector. Can be unnatural for passing display-related values like color, timestamp, index, etc.
    • cursors are hierarchical, meaning a component’s cursor is a subset of its parent’s cursor. This is partially mitigated via reference cursors, but I’ve personally found this approach to be less explicit about what the dependencies are.
  • following React/OM’s philosophy, we don’t care how the related components are rendered, just that we’ll end up with a consistent view of the related components. Changes in app state will only render consistent views, whereas changes in shared async state will eventually render consistent views.
    • This means that with related components A & B, component A could be rendering changes in :ts at 10 Hz, whereas the more computationally expensive component B could be rendering changes in :ts at a much lower rate of 1 Hz.
    • By relaxing the consistent-rendering constraint, we decouple the visual renderings of related components to achieve a higher (perceived) refresh-rate.

In our example, we’ll add a component, timeseries, which will show a time series of the price of oil, and function as a time-slider for map-chart. They have timestamp :ts as their shared async state.

timeseries graph

shared async state is implemented using core.async channels:

  1. determine the shared async state between components, this could mean time, zoom-level, etc.
  2. every component is capable of changing shared async state via the global pub/sub core.async channel
  3. each component should subscribe to and keep a copy (in local state) of the shared async state it’s interested in. Copying an update to local state will automatically trigger rerenders via IDidUpdate. A convenience function d3om.core/shared-async->local-state does this, and should be called in om/IWillMount.

In this way, visual representations of the components are eventually consistent, and you gain a whole lot of responsiveness.


There is definitely some amount of overhead to using D3 + OM over using just D3 to create a visualization. Without having to create multiple, correlated visualizations, using D3 alone is often the better choice. However, D3 + OM really excels when you want to build modular visual components that are reusable and easy to combine (and interact) with other components built in the same way. It’s an approach that works well under changing requirements and pays off in the long run.

It’s worth mentioning that some examples in the D3 gallery don’t easily translate to OM-land. Patterns that are natural in js (using mutable data, callbacks, etc) don’t necessarily translate to the stricter, immutable world of Clojure/OM. In some instances, the more tractable approach is to first fully understand the example, then rewrite it with a Clojure mind-set.

That’s all for now. Join us next time as we explore 3D + OM!


Alternative Tools

As usual, we evaluated existing tools for the job – in this case, using D3 from ClojureScript. In the end, we decided to use JavaScript interop directly. Here are some libraries we evaluated:

  • strokes – uses mrhyde to “remove as much of the interop glue and clj->js->clj data marshalling from application code as possible”

    • the library has fallen behind and doesn’t work with the latest cljs. The polyfill approach means that there will always be a maintenance cost, and with an constantly changing/improving library like cljs, the cost is high.
    • seamless interop is very attractive, but this isn’t what mrhyde attempts to do, nor would it be possible. Instead, it has a blurred line approach, which makes the data structures from both (js & cljs) sides partially compatible with APIs from the other side. Ultimately, this means that you need to painstakingly know precisely what “partially” means. Since js data structures are inherently different than cljs data structures, we felt it more appropriate to explicitly separate the two sides.
  • c2 – inspired from d3 to “build declarative mappings from your data to HTML or SVG markup”.

What’s mentioned in this article but not illustrated in our example?

  • object constancy – there’s no movement in the map-chart, but if there were, transitions would be appropriate.
  • using persistent data structures to snapshot streaming, realtime data. The example loads all data on start up.

Coding convention

  • Follow D3’s indentation convention
  • Appending variable names with - to represent a js data structure (e.g. d-). Again, to be explicit about the differences.
  • Appending s to sequential data structures, ss to nested sequential data structures, and so on (e.g. seriess for a 2d vector). Apply this to irregular plurals, at the cost of improper grammar (e.g. datas).

Other useful links

discuss this on Hacker News