Exploring 1.5 Billion NYC Taxi/Uber rides – Part II: Importing & Building LL Notebook

In Part I, I demonstrated how easily LL Notebook was able to reveal interesting characteristics in the NYC Taxi dataset using highly interactive visualizations. In this article, we’ll take a step back and examine the process of getting the data into BigQuery, importing into LL Notebook, and finally building a LL Notebook.

Importing to BigQuery

Importing the data into a clean form is always a big part of any data analysis workflow. This was no exception, but parallelizing the work using GCP made the process much faster. There are already a few public NYC Taxi datasets, but none of them were exactly what I wanted – which is Yellow, Green and FHV data from 2009~2016 with consistent neighborhoods mappings. Here they are with a brief description:

A good place to start from the raw data is toddwschneider. This is what was needed to clean/enrich the raw data:

  • There were slight schema changes – mostly taken care of by Todd’s import scripts
  • Inconsistency in identifying pickup/dropoff locations
    • Lon/lat was used prior to 2016H1, and taxi zone location IDs after that
    • the shapefile for taxi location ids isn’t all that useful since it isn’t based on lon/lat.
    • Todd’s scripts will map the lon/lat to 2010 Census Tract (2010ct), Neighborhood Tabulation Area (NTA), and Public Use Microdata Areas (PUMAs) – for more info – using PostGIS.
    • In order of most granular to least: 2010ct (1335) < taxi location id (265) < NTA (195) < PUMA (55)
    • For this analysis, I decided it’s best to use taxi location IDs and map 2010ct to it using this (approximate) mapping

Next step is running Todd’s import script. It’s claimed to take 3 days on a laptop. I’m impatient, so ran the jobs in parallel using Compute Engines in a few hours at a cost of about $10.

  1. Provision a Compute Engine to download all the files into GCS. I used Ubuntu trusty for all jobs.

     GCS_BUCKET="gs://<your-gcs-bucket>"
     sudo apt-get install git-core
     sudo apt-get install parallel
     git clone https://github.com/toddwschneider/nyc-taxi-data.git
     cat nyc-taxi-data/raw_data_urls.txt | parallel -j10 wget {}        ## download file all locally
     ls *csv | parallel -j2 gzip {}                                     ## gzip them
     gsutil -m cp *gz ${GCS_BUCKET}                                     ## cp to GCS
    
  2. Provision Compute Engines to run Todd’s import script. I parallelized the work by provisioning one VM for each year of Yellow, one for all of Green, one for all of FHV. All data with lon/lat are processed this way, data with locationIDs can be directly loaded into BigQuery.

     ## pattern for the dataset to process
     TARGET="yellow_tripdata_2010-02"
     ME=`whoami`
     GCS_BUCKET="gs://<your-gcs-bucket>"
    
     ## install tools
     sudo apt-get update && \
     sudo apt-get -y install postgresql postgresql-client postgresql-contrib && \
     sudo apt-get -y install postgis* `# http://gis.stackexchange.com/questions/71302/error-when-trying-to-run-create-extension-postgis#answer-108401` && \
     sudo apt-get install git-core && \
     sudo apt-get install parallel
    
     ## get code
     git clone https://github.com/toddwschneider/nyc-taxi-data.git
    
     ## get files & unzip
     gsutil -m cp ${GCS_BUCKET}/${TARGET}* nyc-taxi-data/data  
     ls -1 nyc-taxi-data/data/*gz | parallel -j2 gzip -d {}
    
     ## run Todd's script
     sudo -u postgres -i
     cd /home/${ME}/nyc-taxi-data/
     ./initialize_database.sh
     nohup ./import_trip_data.sh &         # long running job
     tail -f /var/lib/postgresql/nohup.out
    
     ## export to CSV (help from http://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html)
     mkdir -p /home/${ME}/nyc-taxi-data/trips && \
     sudo chown -R postgres:postgres /home/${ME}/nyc-taxi-data/trips && \
     sudo -u postgres psql  nyc-taxi-data postgres
    
     COPY (
         SELECT trips.id,
                trips.vendor_id,
                trips.pickup_datetime,
                trips.dropoff_datetime,
                trips.store_and_fwd_flag,
                trips.rate_code_id,
                trips.pickup_longitude,
                trips.pickup_latitude,
                trips.dropoff_longitude,
                trips.dropoff_latitude,
                trips.passenger_count,
                trips.trip_distance,
                trips.fare_amount,
                trips.extra,
                trips.mta_tax,
                trips.tip_amount,
                trips.tolls_amount,
                trips.ehail_fee,
                trips.improvement_surcharge,
                trips.total_amount,
                trips.payment_type,
                trips.trip_type,
                trips.pickup,
                trips.dropoff,
                cab_types.type cab_type,
                weather.precipitation rain,
                weather.snow_depth,
                weather.snowfall,
                weather.max_temperature max_temp,
                weather.min_temperature min_temp,
                weather.average_wind_speed wind,
                pick_up.gid pickup_nyct2010_gid,
                pick_up.ctlabel pickup_ctlabel,
                pick_up.borocode pickup_borocode,
                pick_up.boroname pickup_boroname,
                pick_up.ct2010 pickup_ct2010,
                pick_up.boroct2010 pickup_boroct2010,
                pick_up.cdeligibil pickup_cdeligibil,
                pick_up.ntacode pickup_ntacode,
                pick_up.ntaname pickup_ntaname,
                pick_up.puma pickup_puma,
                drop_off.gid dropoff_nyct2010_gid,
                drop_off.ctlabel dropoff_ctlabel,
                drop_off.borocode dropoff_borocode,
                drop_off.boroname dropoff_boroname,
                drop_off.ct2010 dropoff_ct2010,
                drop_off.boroct2010 dropoff_boroct2010,
                drop_off.cdeligibil dropoff_cdeligibil,
                drop_off.ntacode dropoff_ntacode,
                drop_off.ntaname dropoff_ntaname,
                drop_off.puma dropoff_puma
         FROM trips
         LEFT JOIN cab_types
             ON trips.cab_type_id = cab_types.id
         LEFT JOIN central_park_weather_observations weather
             ON weather.date = trips.pickup_datetime::date
         LEFT JOIN nyct2010 pick_up
             ON pick_up.gid = trips.pickup_nyct2010_gid
         LEFT JOIN nyct2010 drop_off
             ON drop_off.gid = trips.dropoff_nyct2010_gid
     ) TO PROGRAM
         'split -l 20000000 --filter="gzip > /home/${ME}/nyc-taxi-data/trips/trips_\$FILE.csv.gz"'
         WITH CSV;
    
     ## cp csv to GCS
     TARGET=yellow_2012
     sudo -u postgres -i
     chmod 777 -R /home/${ME}/nyc-taxi-data/trips
     ^d 
     cd /home/${ME}/nyc-taxi-data/trips
     ls -1 | parallel mv {} ${TARGET}_{} # rename
     gsutil -m cp /home/${ME}/nyc-taxi-data/trips/* ${GCS_BUCKET}
    
  3. Import from GCS to BigQuery, this can be done locally on your machine.

     ## local load into bq
     TARGET="yellow_2010_trips"
     GCS_BUCKET="gs://<your-gcs-bucket>"
     gsutil ls ${GCS_BUCKET}/${TARGET}* | parallel \
     bq --nosync load <your-bq-project-and-dataset>.${TARGET} \
     {} \
     id,vendor_id,pickup_datetime:timestamp,dropoff_datetime:timestamp,store_and_fwd_flag,rate_code_id,pickup_longitude:float,pickup_latitude:float,dropoff_longitude:float,dropoff_latitude:float,passenger_count:integer,trip_distance:float,fare_amount:float,extra:float,mta_tax:float,tip_amount:float,tolls_amount:float,ehail_fee:float,improvement_surcharge:float,total_amount:float,payment_type,trip_type,pickup,dropoff,cab_type,precipitation:float,snow_depth:float,snowfall:float,max_temp:float,min_temp:float,average_wind_speed:float,pickup_nyct2010_gid,pickup_ctlabel,pickup_borocode,pickup_boroname,pickup_ct2010,pickup_boroct2010,pickup_cdeligibil,pickup_ntacode,pickup_ntaname,pickup_puma,dropoff_nyct2010_gid,dropoff_ctlabel,dropoff_borocode,dropoff_boroname,dropoff_ct2010,dropoff_boroct2010,dropoff_cdeligibil,dropoff_ntacode,dropoff_ntaname,dropoff_puma
    
  4. Once imported, I performed a join in BigQuery to map from ct2010 to (taxi) locationIDs plus any other auxiliary columns into the final tables for the analysis.

Importing BigQuery table & Building a LL Notebook

Importing is dead simple, just link your Google account, select the table, then click “Import”. LL Notebook will do some pre-computations on the dataset.

Building is just as easy. Select the dimensions you care about, LL Notebook take care of querying BigQuery. When the button on the lower-right is green, click it and you’re good to go!

As anyone who’s ever done data analysis knows, it’s not uncommon to spend a majority of the time preparing the data and this was certainly the case here. Once imported, I was able to interrogate the data with ease using LL Notebook. Follow us for the third and final part of this series, as we extract more insights from the NYC Taxi dataset.

Exploring 1.5 Billion NYC Taxi/Uber rides with LL Notebook and BigQuery

The key challenge for exploring big data is that it often feels more like an e-mail thread rather than an in-person conversation. If you know exactly what questions to ask, you’ll get nice, precise answers. What you don’t have is the speed, dynamism, and nuance of a real conversation. In this article, I’ll explain how I used LL Notebook and Google BigQuery to explore 1.5 billion taxi and ride share trips in NYC from 2009 to 2016.

As a data scientist and ex New Yorker, the NYC Taxi dataset had a special appeal to me. Its detail and completeness has the potential to reveal characteristics of the great city that’s both interesting and actionable. Many articles have been written on the subject – including taxis at the airport, revealing the hottest nightlife spots, and Taxi vs. Uber vs. Lyft – and this certainly won’t be the last. It will, however, be the first where results aren’t presented as a static, curated end-point. You will be able to interact with the data using the same tool as I did to replicate the findings, tweak the parameters, and even ask new questions!

LL Notebook was designed to be data-size agnostic. Plotting 1.5 billion points is both computationally impractical and overwhelming for the user to perceive, which is why LL Notebook uses binned visualizations that work from aggregates. It queries for and caches pre-aggregated summaries from the database, summaries that are capable of answering questions that weren’t known a priori. This enables a fast and interactive experience that’s otherwise not possible for such a large dataset. Google BigQuery is great database for this purpose because of its speed, scale, and cost. Computing the pre-aggregated summaries only takes BigQuery on the order of 10 seconds for 1.5 billion records, enabling LL Notebook to support an iterative analysis workflow. I’ve outlined my experience consolidating the dataset and loading it into BigQuery in part II of this series. If this is your first time seeing LL Notebook, take one minute to read these steps.

After doing some initial exploration of everything from tipping behavior to the effect of rainy days, I want to share with you a few of the most interesting findings.

8 years of Airport Trips – launch Notebook

Taking the taxi from Manhattan to the Airport was always a little nerve-racking. At times I got stuck on the BQE for 2 hours, and other times I got to the airport 1.5 hours early. I learned the hard way that airport trip durations are highly variable. With this dataset, I hypothesized that it’s possible to make a more informed decision about when to leave for the airport.

Making such a decision is a lot like risk management in finance. It’s prudent to pay extra attention to the tail end of the distribution as that’s where the big negative outcomes live. In finance, a risk manager may look at the 1% VaR to determine the worst loss 1% of the time. With an airport trip, I can make a similar determination for how much time to allot if I want a 95% chance of making a flight. Or if I had an important meeting to make, perhaps I’d be more comfortable with a 99% chance.

One of my favorite visualizations on the subject is Todd Schneider’s chart, showing the variation in trip duration over time.

http://toddwschneider.com/posts/analyzing-1-1-billion-nyc-taxi-and-uber-trips-with-a-vengeance/

Let’s see how this manifests in LL Notebook.

Variation in trip duration (from Union Square to LGA on a weekday) over time. The fat tail during commute hours is striking. Try other neighborhood/airport combinations

If we fix the y-axis on trip duration, we can get a feel for what the ebb and flow of people from Union Square to JFK looks like.

And vice versa.

Let’s step back and explore the system as a whole.

We can easily see from where the bulk of people travel to the airport, and when. As anyone who has ever lived in big city would expect, the trip duration is skewed right.
Weekday trips to LGA. Different neighborhoods have very different profiles with respect to time and trip duration

How different are weekdays and weekends?

Median trip duration from Manhattan to LGA doesn’t look that different for weekday & weekend, however, the tail is much fatter for the weekend.
The tail is even fatter for weekday trips to JFK.

Explore these notebooks on your own: to airports & from airports

The Taxi-Uber War for New York City – launch Notebook

Another one of Todd Schneider’s analyses which piqued my interest was the rise of Uber. I wondered how the for hire vehicles (FHVs – which include ride sharing services Uber and Lyft) are changing the game, and whether they competed differently in different market segments.

FHVs overtook taxis in the second half of 2016, with Uber as the dominant FHV player.
We can see big dips for taxis at 5am & 5pm when shift changes occur. FHVs don’t experience this dip, in fact, they’re effectively picking up the slack.
Yellow Taxi is king in Manhattan, with Uber posing a challenge.
In Manhattan, Uber is doing especially well between 5pm and 5am, whereas Yellow Taxi is steadfast between 5am to 5pm.
The outer boroughs, on the other hand, is FHV territory, with Uber as the breakout leader
In the battle for Queens, Uber’s doing well, though Lyft has been taking (and maintaining) market share
In the battle for Brooklyn, Lyft was gaining market share until 2016 June, then uber started regaining.

By the end of 2016, it appears the ride share companies are doing quite well in the boroughs, and are closing in on Manhattan. Though Uber is the clear FHV leader, Lyft is proving to be a formidable contender in key battlegrounds.

There’s so much more in the data, explore for yourself and tweet us your findings at @lqdlandscape!

As you can tell from the example Notebooks, there are many more paths to explore with this dataset than is possible in one sitting. I was able to experience full empirical distributions and how they change with respect to one another. In doing so, I detected nuances in the data that led me down interesting exploratory paths. LL Notebook makes the exploration of 1.5 billion records feel fluid, inviting me to ask more questions as if I were having a real conversation with the data. If you’re not having conversations with your big data, drop us a message!

No Unitaskers in Your Data Exploration Kitchen

What does Alton Brown, a Food Network personality, have to do with data exploration? More than you’d expect – his wise maxim that unitaskers don’t belong in your kitchen transcends the culinary world.

Alton brown holds up a strawberry slicer as he reviews Amazon’s dumbest kitchen gadgets.

The strawberry slicer is the quintessential unitasker. The only thing is does well is produce a fancy looking garnish, but it’s hardly the only tool for the job. In the kitchen, every tool makes every other tool a little less accessible, thus only the most versatile tools belong in a well-run kitchen. There, the chef is often seen with the chef’s knife, never the strawberry slicer.

What is a unitasker in the data exploration kitchen? It is a chart which may be attractive, but is really only capable of conveying one message. Like the strawberry slicer, it can make a good garnish – for example, when it’s time to present results in a newspaper article or in a powerpoint presentation. However, it doesn’t belong during the main preparation, where data exploration is done. That is done by the chef’s knife.

A unitasker – more pizzazz than substance. Looks pretty, communicates the overall shape of the data, but the difference between the years is nearly impossible to see. The layers are there more for form than function.

Data visualization is a vital part of the data analysis workflow. Early on in the process, the goal is for the analyst to get a broad sense of the dataset and to discover potentials paths to insight. The search space is huge at this point, as there are a million ways to slice and dice the data using a plethora of techniques. To make this problem tractable, we must take advantage of our keen sense of visual perception and domain expertise to reduce the search space. Charts offer a quick and effective way to spot the most relevant prospects.

At this stage, you want to look for possible relationships between many data dimensions. To observe many data dimensions simultaneously, we need to achieve a high information density using limited screen real estate. The screen here is our metaphorical kitchen, it’s where we can’t afford to be cluttered with unitaskers if we want to walk away with something useful. Each chart must be able reveal multiple aspects of the data to justify the space it takes up.

Our goal with LL Notebook is to provide users with the chef’s knife of data exploration. To the casual by-stander, LL Notebook’s charts look admittedly plain, but like the chef’s knife, they are multi-purpose, form follows function, and optimized for human cognition. With a little bit of practice, they are the general-purpose tools that you can rely on every single time.

Let’s see how LL Notebook differs in approach to the unitasker chart. The underlying dataset is from a smart meter capturing hourly electricity usage (exported via pge.com) from a single family household over 6 years. Instead of trying to gratuitously munge data dimensions together, each dimension can stand in its own right. Relationships between dimensions are revealed by applying filters and dragging the filters around. Think of it as a pivot table on steroids.

We can see the same overall shape of usage, and if we brush over the year dimension, the qualitative change over time becomes clear. The modes around 1.7kwh and 3.0kwh disappeared in the later years. Clearly, something was done in this household in terms of energy efficiency.

Since we’re dealing with multi-purpose charts, we can proceed to do so much more.

We can see the relationship between year and usage from another perspective. What immediately jumps out is the strong relationship between hour of day and usage. This makes intuitive sense as usage should follow the circadian rhythm of the household inhabitants. The linear relationship between usage and cost is clear as well, though it’s interesting that the two modes diverge in cost as usage increases. This reveals the tiered pricing of electricity.
We can filter year in conjunction to see if these effects are persistent over time. The modes diverge at a greater magnitude in the early years than for the later years, where the two cost modes barely emerged. Maybe the pricing tiers converged over time, or maybe the higher tiers were just not reached at all in the later years? Ask more questions!

In this brief example, you can already see how easy and intuitive it is to interact with and to perceive relationships in the data using only simple charts. In this instance, the household was able to verify that their energy efficiency investments were working, and to further tweak their energy usage. The instantaneous feedback in LL Notebook nudges the analyst to ask more questions. If you’re on your computer, visit the LL Notebook demo to explore this dataset using the chef’s knife of data exploration!

Practical Time Series Visualization using D3 + OM

At LiquidLandscape, we believe interactive visuals are the best way to explore and understand multi-dimensional data. This holds especially true when dealing with time series data, where being able to travel back and forth in time is crucial for understanding how other dimensions relate. When we were tasked with creating an insightful visualization for portfolio (time series) data, we knew that something highly interactive would be necessary.

We wanted to present multiple interactive visualizations on a single page, showing the data and relationships from different perspectives. We found the combination of D3 and OM to be perfect for this task. D3’s ability to create beautiful visualizations and OM’s sensible approach to state makes for an awesome marriage of design and engineering. Realizing this benefit isn’t free – the two libraries come from different worlds, with conflicting philosophies. However, by following a few guidelines and by looking at our code example, developing in (and building on top of) this model becomes fairly simple.

In order of importance, our goals are:

  1. data consistency/correctness – ’nuff said
  2. responsiveness – the benefits of an interactive visualization greatly diminishes with decreasing responsiveness. When we can’t interact with the data at human-speed, much is lost.
  3. object constancy – with so many changing variables, visual assistance is helpful for following changes in the data

Tools

We’re big fans of how much D3 simplified the creation of beautiful visualizations and knew that it would play a big role.

Clojure(Script) was already a key part of our technical arsenal. The language is natural for data manipulation, and its persistent data structures are a great fit for snapshotting real-time data and making it available as a time series. Since accessing the time series data becomes trivial, this opens up the possibility of having playback functionality (play, pause, rewind) for your visualization app.

As for the UI itself, Facebook’s React, and its ClojureScript interface, OM, seemed ideal for developing rich UIs that are easy to reason about.

OM from 10,000 ft

At its core, OM is about having a single, atomic representation of application state, and having hierarchical components render the latest snapshot of that state. DOM manipulation is expensive, so when application state changes, OM calculates and executes the minimum set of corresponding DOM updates. A big value-added of OM over React (besides the fact that it’s part of the Clojure ecosystem) is that changes in application state are very efficiently detected due to the fact you can simply do reference equality checks on persistent data structures.

D3 + OM, I’m familiar with both. This should be a no-brainer, right?

This is fine and well if you’re having OM render static html to reflect the data, OM will take care of efficiently rendering the data. However, we’re interested in creating highly interactive visuals using D3. D3 has a very powerful, declarative way of expressing how visual elements relate to data using selections. In my opinion, this is beautifully done, and we don’t want OM components to take over that responsibility. In addition, object constancy (via transitions) is crucial for understanding time series data, hence we don’t want to simply replace DOM elements.

That said, a mix of D3 and OM sounds like a good way to approach this. But, OM/cljs works with (immutable) persistent data structures while D3 works with (mutable) native js data structures. Some data marshalling would be necessary for this to work, right? Where’s the data demarcation?

This is the approach that worked for us given our goals:

  • Raw/canonical time series data should be held atomically in OM – to take advantage of efficiently representing time series data in persistent data structures.
  • D3 works with native javascript data structures, and data marshalling (via clj->js and js->clj) is expensive, so we want to minimize this translation.
  • Therefore, it is reasonable to represent a (time) snapshot of the data as a cursor
  • AND, be comfortable with having some mutable data (stored in component state) necessary for a dynamic visualization.

There is a performance hit when naively using multiple visual OM components. Remember that OM renders consistent snapshots, meaning that it will want to render all applicable components on the page in a single render pass. With sophisticated visuals, this may result in an unacceptable refresh-rate. The problem and proposed solution is detailed below in the Inter-Component Communication section.

An Example

Let’s now go through an example to see how building something using D3 + OM would look. The advent of fracking and the recent fall in price of crude oil has been an important macro-economic driver of the economy. As an example, we’ll create a visualization app showing oil production on a US map and how it changes with price of oil over time. We’ll be continuing the rest of the article using this example.

The full source is available github

OM Component life-cycle with D3

Let’s create a map-chart component, and see how it uses the OM life-cycle protocols.

map chart

om.core/IInitState – initializes local state. Good for storing core.async channels, D3 mutable state, etc.

(init-state [_]
  (let [width 960 height 500
        projection (-> js/d3 .-geo .albersUsa (.scale 1000) (.translate #js [(/ width 2) (/ height 2)]))]
    {:width width :height height
     :path (-> js/d3 .-geo .path (.projection projection))
     :color (-> js/d3 .-scale .linear
              (.range #js ["green" "red"]))
     :comm (chan)
     :us nil ;topojson data
     :svg nil
     }))

om.core/IWillMount – any set-up tasks and core.async loops to consume from core.async channels. core.async is very important in OM because outside of the render phase, you cannot treat cursors as values . Any user-driven event (keyboard, mouse-click, etc) is outside of the render phase, so you should relay those events via core.async channels to core.async loops created during IWillMount.

(will-mount [_]
  (c/shared-async->local-state owner [:oil-prod-max-val]) ;subscribe to updates to shared-async-state
  (let [{:keys [comm]} (om/get-state owner)]
    (-> js/d3 (.json "/data/us-named.json"
                  (fn [error us]
                    (put! comm [:us us])))) ;callbacks are not part of the render phase, need to relay
    (go (while true
          (let [[k v] (<! comm)]
            (case k
              :us (om/update-state! owner #(assoc % :us v))
              ))))))

om.core/IRender/om.core/IRenderState – create placeholder div for SVG element

(render [_]
  (html
    [:div#map]))

om.core/IDidMount – create SVG element

(did-mount [_]
  (let [{:keys [width height projection comm]} (om/get-state owner)
        svg (-> js/d3 (.select "#map") (.append "svg")
              (.attr "width" width) (.attr "height" height))]
    (om/update-state! owner #(assoc % :svg svg))))

om.core/IDidUpdate – main hook for rendering and transitioning visual via D3.

(did-update [_ prev-snapshot prev-state]
  (let [{:keys [svg us path color oil-prod-max-val]} (om/get-state owner)]
    (when us
      (-> color (.domain #js [0 oil-prod-max-val]))
      (let [states (-> svg (.selectAll ".state")
                     (.data (-> js/topojson (.feature us (-> us .-objects .-states)) .-features)))]
        (-> states
            (.attr "fill" (fn [d-]
                             (let [code (-> d- .-properties .-code)
                                   prod (or (get snapshot code) 0)]
                               (color prod))))
          (.enter)
            (.append "path")
              (.attr "class" (fn [d-] (str "state " (-> d- .-properties .-code))))
              (.attr "fill" (fn [d-]
                              (let [code (-> d- .-properties .-code)
                                    prod (or (get snapshot code) 0)]
                                (color prod))))
              (.attr "d" path)
            (.append "title"))
        (-> states (.select "path title")
            (.text (fn [d-]
                           (let [code (-> d- .-properties .-code)
                                 prod (or (get snapshot code) 0)]
                             (NUMBER_FORMAT prod)))))
        ))))

We’ll also create a root component, called app which lays out the entire OM app, and is responsible for loading all data in om/IWillMount. You can check this in the full source.

Where’s the state with D3 + OM?

It’s important to have a clear understanding of the different levels where state can be kept, and what each level is appropriate for.

  • shared state – this is created during om.core/root under :shared and where a global pub/sub core.async channel (used to communicate shared async state changes) can be kept.

    (om/root app app-state {:target (. js/document (getElementById "app"))
                            :shared (let [publisher (chan)]
                                      {:publisher publisher
                                       :publication (pub publisher first)
                                       :init {:oil-prod-max-val 0
                                              :ts (js/Date. 0)}
                                       })})
    
  • app state – the data. Available to OM components as cursors.

    (def app-state
      (atom {:oil-data (sorted-map) ; {date1 {:production {"CA" 232 ...} "SpotPrice" 23} date2 {...}
             }))
    
  • component/local state – state only relevant within the component, appropriate for storing some mutable state for D3 visuals. Initialized in om/IInitState, accessed via om/get-state, and modified via om/set-state! and om/update-state!

  • shared async state – a concept we introduced for Inter-Component Communication. It’s state which is asynchronously communicated between components and eventually consistent. Basically, a more formalized take on Publish & Notification Channels. Usage detailed below.

Inter-Component Communication

Discovering hidden gems in data often require simultaneously having multiple visuals on the same page to see the data from different perspectives. These visuals must be correlated and dynamic in order for the user to detect complex relationships. Using shared async state as a means of inter-component communication can be very effective.

How is shared async state different than app state?

  • it’s not the data (which is what app state is appropriate for), but rather how to display the data.
  • app state can only be accessed via cursors, which has a couple limitations:
    • must be a map or vector. Can be unnatural for passing display-related values like color, timestamp, index, etc.
    • cursors are hierarchical, meaning a component’s cursor is a subset of its parent’s cursor. This is partially mitigated via reference cursors, but I’ve personally found this approach to be less explicit about what the dependencies are.
  • following React/OM’s philosophy, we don’t care how the related components are rendered, just that we’ll end up with a consistent view of the related components. Changes in app state will only render consistent views, whereas changes in shared async state will eventually render consistent views.
    • This means that with related components A & B, component A could be rendering changes in :ts at 10 Hz, whereas the more computationally expensive component B could be rendering changes in :ts at a much lower rate of 1 Hz.
    • By relaxing the consistent-rendering constraint, we decouple the visual renderings of related components to achieve a higher (perceived) refresh-rate.

In our example, we’ll add a component, timeseries, which will show a time series of the price of oil, and function as a time-slider for map-chart. They have timestamp :ts as their shared async state.

timeseries graph

shared async state is implemented using core.async channels:

  1. determine the shared async state between components, this could mean time, zoom-level, etc.
  2. every component is capable of changing shared async state via the global pub/sub core.async channel
  3. each component should subscribe to and keep a copy (in local state) of the shared async state it’s interested in. Copying an update to local state will automatically trigger rerenders via IDidUpdate. A convenience function d3om.core/shared-async->local-state does this, and should be called in om/IWillMount.

In this way, visual representations of the components are eventually consistent, and you gain a whole lot of responsiveness.

Conclusion

There is definitely some amount of overhead to using D3 + OM over using just D3 to create a visualization. Without having to create multiple, correlated visualizations, using D3 alone is often the better choice. However, D3 + OM really excels when you want to build modular visual components that are reusable and easy to combine (and interact) with other components built in the same way. It’s an approach that works well under changing requirements and pays off in the long run.

It’s worth mentioning that some examples in the D3 gallery don’t easily translate to OM-land. Patterns that are natural in js (using mutable data, callbacks, etc) don’t necessarily translate to the stricter, immutable world of Clojure/OM. In some instances, the more tractable approach is to first fully understand the example, then rewrite it with a Clojure mind-set.

That’s all for now. Join us next time as we explore 3D + OM!

Appendix

Alternative Tools

As usual, we evaluated existing tools for the job – in this case, using D3 from ClojureScript. In the end, we decided to use JavaScript interop directly. Here are some libraries we evaluated:

  • strokes – uses mrhyde to “remove as much of the interop glue and clj->js->clj data marshalling from application code as possible”

    • the library has fallen behind and doesn’t work with the latest cljs. The polyfill approach means that there will always be a maintenance cost, and with an constantly changing/improving library like cljs, the cost is high.
    • seamless interop is very attractive, but this isn’t what mrhyde attempts to do, nor would it be possible. Instead, it has a blurred line approach, which makes the data structures from both (js & cljs) sides partially compatible with APIs from the other side. Ultimately, this means that you need to painstakingly know precisely what “partially” means. Since js data structures are inherently different than cljs data structures, we felt it more appropriate to explicitly separate the two sides.
  • c2 – inspired from d3 to “build declarative mappings from your data to HTML or SVG markup”.

What’s mentioned in this article but not illustrated in our example?

  • object constancy – there’s no movement in the map-chart, but if there were, transitions would be appropriate.
  • using persistent data structures to snapshot streaming, realtime data. The example loads all data on start up.

Coding convention

  • Follow D3’s indentation convention
  • Appending variable names with - to represent a js data structure (e.g. d-). Again, to be explicit about the differences.
  • Appending s to sequential data structures, ss to nested sequential data structures, and so on (e.g. seriess for a 2d vector). Apply this to irregular plurals, at the cost of improper grammar (e.g. datas).

Other useful links

discuss this on Hacker News