[Student Project] Visualizing New York City Taxi Data
This blog is created by WeCloudData’s Data Science Bootcamp alumni Yaoyu Cui.
Please find the complete dashboard on https://goo.gl/gXGTEw
Tableau has been one of the most popular visualization tools among the Data Science community. Besides its ability of data preprocessing and programming, it also provides powerful mapping functionalities. In this blog, a specific task was given regarding a specific New York Taxi company’s pickup data for the year of 2014. The task specifies the use of Python, SQL tools, local weather, and Tableau. To make it more interesting and to demonstrate the mapping functionality of Tableau, I found a Shapefile of New York City (link below).
https://www1.nyc.gov/site/planning/data-maps/open-data/dwn-nynta.page
This is what it looks like in Tableau:
The task data contains four months of pickup locations (latitude and longitude), Date/Time, and Base, that’s it (see image below). A separate file of weather info was also provided including date, temperature, humidity, wind speed, precipitation, etc.
Data Preprocessing:
Before setting up the stage, we must ask what is the link between this data? Is there any useful information we can get out of it? To answer these questions, the data was broken down into more pieces as so:
The ‘week’ column represents the day of the week; note that the raw data of latitude and longitude was transformed into its neighborhood NTA name corresponding to NYC shapefile for a later purpose. The process was done in Python using a package called GeoPandas. The shapefile provided by NYC used an uncommon Coordinate Reference System (CRS). It took me quite a while to figure out the corresponding CRS code:
The three files were then joined in Tableau, and more columns were generated using Tableau functions:
Visualization:
The image below is the final outcome of the dashboard of Manhattan in April:
Note that Tableau provides many powerful interaction options. The dashboard was made out of three sheets, and the filter of one sheet will update on all sheets using the same data source. Tooltip of summary info will appear on hover. All the neighborhoods, days, and hours can work as a filter, and different filters can exist simultaneously (image below selecting rush hour of a certain day in a certain neighborhood):
Data Analysis:
Now let’s talk about the data and what we have found (Tableau provides data summary on sheet level, but not on the dashboard):
The data contains 1.8 million pickups in three months, 81% of which are from Manhattan and 18.76% are from Manhattan Midtown South.
From April to June 2014, New York City had seven consecutive rainy days, each lasted about two days. Out of the seven rainy days, there were five obvious abnormal pickup peaks from Manhattan. Expect on May 10th and June 9th, the pickups show no increase at all.
Other than the weather factor, the most influential factor is the day of the week. The bottom pickups are always on Mondays, where the peaks are on Fridays and Saturdays. Regarding the hour, a local peak would appear during the morning rush hour, by 14:00 the pickups would already surpass the morning peak, by 17:00 rush hour, it would triple the morning peak, having about 3 pickups per min.
Conclusion:
Tableau is a convenient tool for tasks like data science/analytics; it works well with SQL database. Built-in data preprocessing and programming function saves a considerable amount of time on editing. Tableau performs very well with geological data and visualizations. On top of all, Tableau provides many audience-friendly interaction features.
To see Yao’s original blog post please click here. To follow and see Yao’s latest blog posts, please click here.
To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see our upcoming course schedule.