Exploring 1.5 Billion NYC Taxi/Uber rides – Part II: Importing & Building LL Notebook

In Part I, I demonstrated how easily LL Notebook was able to reveal interesting characteristics in the NYC Taxi dataset using highly interactive visualizations. In this article, we’ll take a step back and examine the process of getting the data into BigQuery, importing into LL Notebook, and finally building a LL Notebook.

Importing to BigQuery

Importing the data into a clean form is always a big part of any data analysis workflow. This was no exception, but parallelizing the work using GCP made the process much faster. There are already a few public NYC Taxi datasets, but none of them were exactly what I wanted – which is Yellow, Green and FHV data from 2009~2016 with consistent neighborhoods mappings. Here they are with a brief description:

A good place to start from the raw data is toddwschneider. This is what was needed to clean/enrich the raw data:

  • There were slight schema changes – mostly taken care of by Todd’s import scripts
  • Inconsistency in identifying pickup/dropoff locations
    • Lon/lat was used prior to 2016H1, and taxi zone location IDs after that
    • the shapefile for taxi location ids isn’t all that useful since it isn’t based on lon/lat.
    • Todd’s scripts will map the lon/lat to 2010 Census Tract (2010ct), Neighborhood Tabulation Area (NTA), and Public Use Microdata Areas (PUMAs) – for more info – using PostGIS.
    • In order of most granular to least: 2010ct (1335) < taxi location id (265) < NTA (195) < PUMA (55)
    • For this analysis, I decided it’s best to use taxi location IDs and map 2010ct to it using this (approximate) mapping

Next step is running Todd’s import script. It’s claimed to take 3 days on a laptop. I’m impatient, so ran the jobs in parallel using Compute Engines in a few hours at a cost of about $10.

  1. Provision a Compute Engine to download all the files into GCS. I used Ubuntu trusty for all jobs.

     GCS_BUCKET="gs://<your-gcs-bucket>"
     sudo apt-get install git-core
     sudo apt-get install parallel
     git clone https://github.com/toddwschneider/nyc-taxi-data.git
     cat nyc-taxi-data/raw_data_urls.txt | parallel -j10 wget {}        ## download file all locally
     ls *csv | parallel -j2 gzip {}                                     ## gzip them
     gsutil -m cp *gz ${GCS_BUCKET}                                     ## cp to GCS
    
  2. Provision Compute Engines to run Todd’s import script. I parallelized the work by provisioning one VM for each year of Yellow, one for all of Green, one for all of FHV. All data with lon/lat are processed this way, data with locationIDs can be directly loaded into BigQuery.

     ## pattern for the dataset to process
     TARGET="yellow_tripdata_2010-02"
     ME=`whoami`
     GCS_BUCKET="gs://<your-gcs-bucket>"
    
     ## install tools
     sudo apt-get update && \
     sudo apt-get -y install postgresql postgresql-client postgresql-contrib && \
     sudo apt-get -y install postgis* `# http://gis.stackexchange.com/questions/71302/error-when-trying-to-run-create-extension-postgis#answer-108401` && \
     sudo apt-get install git-core && \
     sudo apt-get install parallel
    
     ## get code
     git clone https://github.com/toddwschneider/nyc-taxi-data.git
    
     ## get files & unzip
     gsutil -m cp ${GCS_BUCKET}/${TARGET}* nyc-taxi-data/data  
     ls -1 nyc-taxi-data/data/*gz | parallel -j2 gzip -d {}
    
     ## run Todd's script
     sudo -u postgres -i
     cd /home/${ME}/nyc-taxi-data/
     ./initialize_database.sh
     nohup ./import_trip_data.sh &         # long running job
     tail -f /var/lib/postgresql/nohup.out
    
     ## export to CSV (help from http://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html)
     mkdir -p /home/${ME}/nyc-taxi-data/trips && \
     sudo chown -R postgres:postgres /home/${ME}/nyc-taxi-data/trips && \
     sudo -u postgres psql  nyc-taxi-data postgres
    
     COPY (
         SELECT trips.id,
                trips.vendor_id,
                trips.pickup_datetime,
                trips.dropoff_datetime,
                trips.store_and_fwd_flag,
                trips.rate_code_id,
                trips.pickup_longitude,
                trips.pickup_latitude,
                trips.dropoff_longitude,
                trips.dropoff_latitude,
                trips.passenger_count,
                trips.trip_distance,
                trips.fare_amount,
                trips.extra,
                trips.mta_tax,
                trips.tip_amount,
                trips.tolls_amount,
                trips.ehail_fee,
                trips.improvement_surcharge,
                trips.total_amount,
                trips.payment_type,
                trips.trip_type,
                trips.pickup,
                trips.dropoff,
                cab_types.type cab_type,
                weather.precipitation rain,
                weather.snow_depth,
                weather.snowfall,
                weather.max_temperature max_temp,
                weather.min_temperature min_temp,
                weather.average_wind_speed wind,
                pick_up.gid pickup_nyct2010_gid,
                pick_up.ctlabel pickup_ctlabel,
                pick_up.borocode pickup_borocode,
                pick_up.boroname pickup_boroname,
                pick_up.ct2010 pickup_ct2010,
                pick_up.boroct2010 pickup_boroct2010,
                pick_up.cdeligibil pickup_cdeligibil,
                pick_up.ntacode pickup_ntacode,
                pick_up.ntaname pickup_ntaname,
                pick_up.puma pickup_puma,
                drop_off.gid dropoff_nyct2010_gid,
                drop_off.ctlabel dropoff_ctlabel,
                drop_off.borocode dropoff_borocode,
                drop_off.boroname dropoff_boroname,
                drop_off.ct2010 dropoff_ct2010,
                drop_off.boroct2010 dropoff_boroct2010,
                drop_off.cdeligibil dropoff_cdeligibil,
                drop_off.ntacode dropoff_ntacode,
                drop_off.ntaname dropoff_ntaname,
                drop_off.puma dropoff_puma
         FROM trips
         LEFT JOIN cab_types
             ON trips.cab_type_id = cab_types.id
         LEFT JOIN central_park_weather_observations weather
             ON weather.date = trips.pickup_datetime::date
         LEFT JOIN nyct2010 pick_up
             ON pick_up.gid = trips.pickup_nyct2010_gid
         LEFT JOIN nyct2010 drop_off
             ON drop_off.gid = trips.dropoff_nyct2010_gid
     ) TO PROGRAM
         'split -l 20000000 --filter="gzip > /home/${ME}/nyc-taxi-data/trips/trips_\$FILE.csv.gz"'
         WITH CSV;
    
     ## cp csv to GCS
     TARGET=yellow_2012
     sudo -u postgres -i
     chmod 777 -R /home/${ME}/nyc-taxi-data/trips
     ^d 
     cd /home/${ME}/nyc-taxi-data/trips
     ls -1 | parallel mv {} ${TARGET}_{} # rename
     gsutil -m cp /home/${ME}/nyc-taxi-data/trips/* ${GCS_BUCKET}
    
  3. Import from GCS to BigQuery, this can be done locally on your machine.

     ## local load into bq
     TARGET="yellow_2010_trips"
     GCS_BUCKET="gs://<your-gcs-bucket>"
     gsutil ls ${GCS_BUCKET}/${TARGET}* | parallel \
     bq --nosync load <your-bq-project-and-dataset>.${TARGET} \
     {} \
     id,vendor_id,pickup_datetime:timestamp,dropoff_datetime:timestamp,store_and_fwd_flag,rate_code_id,pickup_longitude:float,pickup_latitude:float,dropoff_longitude:float,dropoff_latitude:float,passenger_count:integer,trip_distance:float,fare_amount:float,extra:float,mta_tax:float,tip_amount:float,tolls_amount:float,ehail_fee:float,improvement_surcharge:float,total_amount:float,payment_type,trip_type,pickup,dropoff,cab_type,precipitation:float,snow_depth:float,snowfall:float,max_temp:float,min_temp:float,average_wind_speed:float,pickup_nyct2010_gid,pickup_ctlabel,pickup_borocode,pickup_boroname,pickup_ct2010,pickup_boroct2010,pickup_cdeligibil,pickup_ntacode,pickup_ntaname,pickup_puma,dropoff_nyct2010_gid,dropoff_ctlabel,dropoff_borocode,dropoff_boroname,dropoff_ct2010,dropoff_boroct2010,dropoff_cdeligibil,dropoff_ntacode,dropoff_ntaname,dropoff_puma
    
  4. Once imported, I performed a join in BigQuery to map from ct2010 to (taxi) locationIDs plus any other auxiliary columns into the final tables for the analysis.

Importing BigQuery table & Building a LL Notebook

Importing is dead simple, just link your Google account, select the table, then click “Import”. LL Notebook will do some pre-computations on the dataset.

Building is just as easy. Select the dimensions you care about, LL Notebook take care of querying BigQuery. When the button on the lower-right is green, click it and you’re good to go!

As anyone who’s ever done data analysis knows, it’s not uncommon to spend a majority of the time preparing the data and this was certainly the case here. Once imported, I was able to interrogate the data with ease using LL Notebook. Follow us for the third and final part of this series, as we extract more insights from the NYC Taxi dataset.

No Unitaskers in Your Data Exploration Kitchen

What does Alton Brown, a Food Network personality, have to do with data exploration? More than you’d expect – his wise maxim that unitaskers don’t belong in your kitchen transcends the culinary world.

Alton brown holds up a strawberry slicer as he reviews Amazon’s dumbest kitchen gadgets.

The strawberry slicer is the quintessential unitasker. The only thing is does well is produce a fancy looking garnish, but it’s hardly the only tool for the job. In the kitchen, every tool makes every other tool a little less accessible, thus only the most versatile tools belong in a well-run kitchen. There, the chef is often seen with the chef’s knife, never the strawberry slicer.

What is a unitasker in the data exploration kitchen? It is a chart which may be attractive, but is really only capable of conveying one message. Like the strawberry slicer, it can make a good garnish – for example, when it’s time to present results in a newspaper article or in a powerpoint presentation. However, it doesn’t belong during the main preparation, where data exploration is done. That is done by the chef’s knife.

A unitasker – more pizzazz than substance. Looks pretty, communicates the overall shape of the data, but the difference between the years is nearly impossible to see. The layers are there more for form than function.

Data visualization is a vital part of the data analysis workflow. Early on in the process, the goal is for the analyst to get a broad sense of the dataset and to discover potentials paths to insight. The search space is huge at this point, as there are a million ways to slice and dice the data using a plethora of techniques. To make this problem tractable, we must take advantage of our keen sense of visual perception and domain expertise to reduce the search space. Charts offer a quick and effective way to spot the most relevant prospects.

At this stage, you want to look for possible relationships between many data dimensions. To observe many data dimensions simultaneously, we need to achieve a high information density using limited screen real estate. The screen here is our metaphorical kitchen, it’s where we can’t afford to be cluttered with unitaskers if we want to walk away with something useful. Each chart must be able reveal multiple aspects of the data to justify the space it takes up.

Our goal with LL Notebook is to provide users with the chef’s knife of data exploration. To the casual by-stander, LL Notebook’s charts look admittedly plain, but like the chef’s knife, they are multi-purpose, form follows function, and optimized for human cognition. With a little bit of practice, they are the general-purpose tools that you can rely on every single time.

Let’s see how LL Notebook differs in approach to the unitasker chart. The underlying dataset is from a smart meter capturing hourly electricity usage (exported via pge.com) from a single family household over 6 years. Instead of trying to gratuitously munge data dimensions together, each dimension can stand in its own right. Relationships between dimensions are revealed by applying filters and dragging the filters around. Think of it as a pivot table on steroids.

We can see the same overall shape of usage, and if we brush over the year dimension, the qualitative change over time becomes clear. The modes around 1.7kwh and 3.0kwh disappeared in the later years. Clearly, something was done in this household in terms of energy efficiency.

Since we’re dealing with multi-purpose charts, we can proceed to do so much more.

We can see the relationship between year and usage from another perspective. What immediately jumps out is the strong relationship between hour of day and usage. This makes intuitive sense as usage should follow the circadian rhythm of the household inhabitants. The linear relationship between usage and cost is clear as well, though it’s interesting that the two modes diverge in cost as usage increases. This reveals the tiered pricing of electricity.
We can filter year in conjunction to see if these effects are persistent over time. The modes diverge at a greater magnitude in the early years than for the later years, where the two cost modes barely emerged. Maybe the pricing tiers converged over time, or maybe the higher tiers were just not reached at all in the later years? Ask more questions!

In this brief example, you can already see how easy and intuitive it is to interact with and to perceive relationships in the data using only simple charts. In this instance, the household was able to verify that their energy efficiency investments were working, and to further tweak their energy usage. The instantaneous feedback in LL Notebook nudges the analyst to ask more questions. If you’re on your computer, visit the LL Notebook demo to explore this dataset using the chef’s knife of data exploration!