Exploring 1.5 Billion NYC Taxi/Uber rides – Part II: Importing & Building LL Notebook


In Part I, I demonstrated how easily LL Notebook was able to reveal interesting characteristics in the NYC Taxi dataset using highly interactive visualizations. In this article, we’ll take a step back and examine the process of getting the data into BigQuery, importing into LL Notebook, and finally building a LL Notebook.

Importing to BigQuery

Importing the data into a clean form is always a big part of any data analysis workflow. This was no exception, but parallelizing the work using GCP made the process much faster. There are already a few public NYC Taxi datasets, but none of them were exactly what I wanted – which is Yellow, Green and FHV data from 2009~2016 with consistent neighborhoods mappings. Here they are with a brief description:

A good place to start from the raw data is toddwschneider. This is what was needed to clean/enrich the raw data:

  • There were slight schema changes – mostly taken care of by Todd’s import scripts
  • Inconsistency in identifying pickup/dropoff locations
    • Lon/lat was used prior to 2016H1, and taxi zone location IDs after that
    • the shapefile for taxi location ids isn’t all that useful since it isn’t based on lon/lat.
    • Todd’s scripts will map the lon/lat to 2010 Census Tract (2010ct), Neighborhood Tabulation Area (NTA), and Public Use Microdata Areas (PUMAs) – for more info – using PostGIS.
    • In order of most granular to least: 2010ct (1335) < taxi location id (265) < NTA (195) < PUMA (55)
    • For this analysis, I decided it’s best to use taxi location IDs and map 2010ct to it using this (approximate) mapping

Next step is running Todd’s import script. It’s claimed to take 3 days on a laptop. I’m impatient, so ran the jobs in parallel using Compute Engines in a few hours at a cost of about $10.

  1. Provision a Compute Engine to download all the files into GCS. I used Ubuntu trusty for all jobs.

     GCS_BUCKET="gs://<your-gcs-bucket>"
     sudo apt-get install git-core
     sudo apt-get install parallel
     git clone https://github.com/toddwschneider/nyc-taxi-data.git
     cat nyc-taxi-data/raw_data_urls.txt | parallel -j10 wget {}        ## download file all locally
     ls *csv | parallel -j2 gzip {}                                     ## gzip them
     gsutil -m cp *gz ${GCS_BUCKET}                                     ## cp to GCS
    
  2. Provision Compute Engines to run Todd’s import script. I parallelized the work by provisioning one VM for each year of Yellow, one for all of Green, one for all of FHV. All data with lon/lat are processed this way, data with locationIDs can be directly loaded into BigQuery.

     ## pattern for the dataset to process
     TARGET="yellow_tripdata_2010-02"
     ME=`whoami`
     GCS_BUCKET="gs://<your-gcs-bucket>"
    
     ## install tools
     sudo apt-get update && \
     sudo apt-get -y install postgresql postgresql-client postgresql-contrib && \
     sudo apt-get -y install postgis* `# http://gis.stackexchange.com/questions/71302/error-when-trying-to-run-create-extension-postgis#answer-108401` && \
     sudo apt-get install git-core && \
     sudo apt-get install parallel
    
     ## get code
     git clone https://github.com/toddwschneider/nyc-taxi-data.git
    
     ## get files & unzip
     gsutil -m cp ${GCS_BUCKET}/${TARGET}* nyc-taxi-data/data  
     ls -1 nyc-taxi-data/data/*gz | parallel -j2 gzip -d {}
    
     ## run Todd's script
     sudo -u postgres -i
     cd /home/${ME}/nyc-taxi-data/
     ./initialize_database.sh
     nohup ./import_trip_data.sh &         # long running job
     tail -f /var/lib/postgresql/nohup.out
    
     ## export to CSV (help from http://tech.marksblogg.com/billion-nyc-taxi-rides-redshift.html)
     mkdir -p /home/${ME}/nyc-taxi-data/trips && \
     sudo chown -R postgres:postgres /home/${ME}/nyc-taxi-data/trips && \
     sudo -u postgres psql  nyc-taxi-data postgres
    
     COPY (
         SELECT trips.id,
                trips.vendor_id,
                trips.pickup_datetime,
                trips.dropoff_datetime,
                trips.store_and_fwd_flag,
                trips.rate_code_id,
                trips.pickup_longitude,
                trips.pickup_latitude,
                trips.dropoff_longitude,
                trips.dropoff_latitude,
                trips.passenger_count,
                trips.trip_distance,
                trips.fare_amount,
                trips.extra,
                trips.mta_tax,
                trips.tip_amount,
                trips.tolls_amount,
                trips.ehail_fee,
                trips.improvement_surcharge,
                trips.total_amount,
                trips.payment_type,
                trips.trip_type,
                trips.pickup,
                trips.dropoff,
                cab_types.type cab_type,
                weather.precipitation rain,
                weather.snow_depth,
                weather.snowfall,
                weather.max_temperature max_temp,
                weather.min_temperature min_temp,
                weather.average_wind_speed wind,
                pick_up.gid pickup_nyct2010_gid,
                pick_up.ctlabel pickup_ctlabel,
                pick_up.borocode pickup_borocode,
                pick_up.boroname pickup_boroname,
                pick_up.ct2010 pickup_ct2010,
                pick_up.boroct2010 pickup_boroct2010,
                pick_up.cdeligibil pickup_cdeligibil,
                pick_up.ntacode pickup_ntacode,
                pick_up.ntaname pickup_ntaname,
                pick_up.puma pickup_puma,
                drop_off.gid dropoff_nyct2010_gid,
                drop_off.ctlabel dropoff_ctlabel,
                drop_off.borocode dropoff_borocode,
                drop_off.boroname dropoff_boroname,
                drop_off.ct2010 dropoff_ct2010,
                drop_off.boroct2010 dropoff_boroct2010,
                drop_off.cdeligibil dropoff_cdeligibil,
                drop_off.ntacode dropoff_ntacode,
                drop_off.ntaname dropoff_ntaname,
                drop_off.puma dropoff_puma
         FROM trips
         LEFT JOIN cab_types
             ON trips.cab_type_id = cab_types.id
         LEFT JOIN central_park_weather_observations weather
             ON weather.date = trips.pickup_datetime::date
         LEFT JOIN nyct2010 pick_up
             ON pick_up.gid = trips.pickup_nyct2010_gid
         LEFT JOIN nyct2010 drop_off
             ON drop_off.gid = trips.dropoff_nyct2010_gid
     ) TO PROGRAM
         'split -l 20000000 --filter="gzip > /home/${ME}/nyc-taxi-data/trips/trips_\$FILE.csv.gz"'
         WITH CSV;
    
     ## cp csv to GCS
     TARGET=yellow_2012
     sudo -u postgres -i
     chmod 777 -R /home/${ME}/nyc-taxi-data/trips
     ^d 
     cd /home/${ME}/nyc-taxi-data/trips
     ls -1 | parallel mv {} ${TARGET}_{} # rename
     gsutil -m cp /home/${ME}/nyc-taxi-data/trips/* ${GCS_BUCKET}
    
  3. Import from GCS to BigQuery, this can be done locally on your machine.

     ## local load into bq
     TARGET="yellow_2010_trips"
     GCS_BUCKET="gs://<your-gcs-bucket>"
     gsutil ls ${GCS_BUCKET}/${TARGET}* | parallel \
     bq --nosync load <your-bq-project-and-dataset>.${TARGET} \
     {} \
     id,vendor_id,pickup_datetime:timestamp,dropoff_datetime:timestamp,store_and_fwd_flag,rate_code_id,pickup_longitude:float,pickup_latitude:float,dropoff_longitude:float,dropoff_latitude:float,passenger_count:integer,trip_distance:float,fare_amount:float,extra:float,mta_tax:float,tip_amount:float,tolls_amount:float,ehail_fee:float,improvement_surcharge:float,total_amount:float,payment_type,trip_type,pickup,dropoff,cab_type,precipitation:float,snow_depth:float,snowfall:float,max_temp:float,min_temp:float,average_wind_speed:float,pickup_nyct2010_gid,pickup_ctlabel,pickup_borocode,pickup_boroname,pickup_ct2010,pickup_boroct2010,pickup_cdeligibil,pickup_ntacode,pickup_ntaname,pickup_puma,dropoff_nyct2010_gid,dropoff_ctlabel,dropoff_borocode,dropoff_boroname,dropoff_ct2010,dropoff_boroct2010,dropoff_cdeligibil,dropoff_ntacode,dropoff_ntaname,dropoff_puma
    
  4. Once imported, I performed a join in BigQuery to map from ct2010 to (taxi) locationIDs plus any other auxiliary columns into the final tables for the analysis.

Importing BigQuery table & Building a LL Notebook

Importing is dead simple, just link your Google account, select the table, then click “Import”. LL Notebook will do some pre-computations on the dataset.

Building is just as easy. Select the dimensions you care about, LL Notebook take care of querying BigQuery. When the button on the lower-right is green, click it and you’re good to go!

As anyone who’s ever done data analysis knows, it’s not uncommon to spend a majority of the time preparing the data and this was certainly the case here. Once imported, I was able to interrogate the data with ease using LL Notebook. Follow us for the third and final part of this series, as we extract more insights from the NYC Taxi dataset.


Published by

David Lin

Founder at LiquidLandscape

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax