Student Blog

Building Superset Dashboard and Pipeline using Apache Airflow and Google Cloud SQL

October 28, 2019

The blog is posted by WeCloudData’s Data Science Bootcamp student Ryan Kang

Like Amazon AWS, Google Cloud is a popular cloud used by data analytics companies. Google Cloud allows continuous automation of workflow and big data computation. In this blog, I will briefly introduce how I set up Google Cloud for workflow.

Each Google Cloud account includes a $360 free trial credit, and creating a Google Cloud account in the console is as easy as creating a Gmail account. So one should consider signing up now to take advantage of this great tool!

The Google Cloud Console has a variety of interesting features such as compute engines, virtual machines (VM), app engines, storage, etc. There are however three dominant components I will be expanding on. As shown in the graph below, the following components are used for a workflow:

  • Google Cloud VM, the remote server; functions like a regular computer
  • Google Cloud Storage, remote hard-disk
  • Google Cloud SQL, the remote SQL server

Keys for VM:

  1. Better CPU configurations cost more; If the virtual machine you build on Google Cloud has a better configuration, it will cost more, so it is important not to set it too high
  2. Setting a VM to “allowing access to API” is very important for data processing
  3. The connection between local and VM is SSH (security shell), which is conducted via Google Cloud SDK;  Through the command line we get into VM

Then later, install environments like we do when working on a local server. Notice VM’s system is Linux which makes the installation straightforward (for instance, pip install python). After environment setup, upload files from local to VM using the “gustil” command line. We can now run simple workflows.

When the workflow involves SQL or big data, we need a Google Cloud SQL instance to do the job.

Keys for SQL:

  1. Remember to select 2nd generation SQL
  2. Also, because names of SQL (or storage) are global, make sure the names are unique
  3. After setting up Google Cloud SQL,  connect to Google Cloud SQL as such:

Workflow is orchestrated using the Python package titled “Airbnb airflow”, which was developed by Airbnb data engineers. Orchestration includes data collection, database computation, documentation, and storage.  It has a UI which allows for easy scheduling and checking.

Finally, we need data visualization. Visualization is realized on your local computer which connects to Google Cloud SQL. Visualization tools include Tableau, ds3.js, and python packages like superset. The table above demonstrates a real-time visualization of Twitter activities.

You should now have a good idea of the basics on how workflow works in Google Cloud. If you have any questions, please visit WeCloudData for more information or comment below.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see our upcoming course schedule.

Join our programs and advance your career in Business IntelligenceData EngineeringData Science

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Other blogs you might like
Learning Guide
Objectives This tutorial will walk you through installing the user-friendly Linux sysadmin web console tool Cockpit Prerequisites Installed Linux…
by WeCloudData Faculty
December 24, 2021
Data Job: Elevate your career with a compelling resume tailored for success. Uncover the transformative power of OpenAI API…
by WeCloudData
January 24, 2024
Job Market
Ringing in the New Year: Reflecting on the 2023 Data Science Job Market and Embracing 2024’s Opportunities Well, Christmas…
by WeCloudData
January 19, 2024

Kick start your career transformation