Blog

Student Blog

Building Superset Dashboard and Pipeline using Apache Airflow and Google Cloud SQL

October 28, 2019

The blog is posted by WeCloudData’s Data Science Bootcamp student Ryan Kang

Like Amazon AWS, Google Cloud is a popular cloud used by data analytics companies. Google Cloud allows continuous automation of workflow and big data computation. In this blog, I will briefly introduce how I set up Google Cloud for workflow.

Each Google Cloud account includes a $360 free trial credit, and creating a Google Cloud account in the console is as easy as creating a Gmail account. So one should consider signing up now to take advantage of this great tool!

The Google Cloud Console has a variety of interesting features such as compute engines, virtual machines (VM), app engines, storage, etc. There are however three dominant components I will be expanding on. As shown in the graph below, the following components are used for a workflow:

  • Google Cloud VM, the remote server; functions like a regular computer
  • Google Cloud Storage, remote hard-disk
  • Google Cloud SQL, the remote SQL server

Keys for VM:

  1. Better CPU configurations cost more; If the virtual machine you build on Google Cloud has a better configuration, it will cost more, so it is important not to set it too high
  2. Setting a VM to “allowing access to API” is very important for data processing
  3. The connection between local and VM is SSH (security shell), which is conducted via Google Cloud SDK;  Through the command line we get into VM

Then later, install environments like we do when working on a local server. Notice VM’s system is Linux which makes the installation straightforward (for instance, pip install python). After environment setup, upload files from local to VM using the “gustil” command line. We can now run simple workflows.

When the workflow involves SQL or big data, we need a Google Cloud SQL instance to do the job.

Keys for SQL:

  1. Remember to select 2nd generation SQL
  2. Also, because names of SQL (or storage) are global, make sure the names are unique
  3. After setting up Google Cloud SQL,  connect to Google Cloud SQL as such:

Workflow is orchestrated using the Python package titled “Airbnb airflow”, which was developed by Airbnb data engineers. Orchestration includes data collection, database computation, documentation, and storage.  It has a UI which allows for easy scheduling and checking.

Finally, we need data visualization. Visualization is realized on your local computer which connects to Google Cloud SQL. Visualization tools include Tableau, ds3.js, and python packages like superset. The table above demonstrates a real-time visualization of Twitter activities.

You should now have a good idea of the basics on how workflow works in Google Cloud. If you have any questions, please visit WeCloudData for more information or comment below.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see our upcoming course schedule.

SPEAK TO OUR ADVISOR
Join our programs and advance your career in Business IntelligenceData EngineeringData Science

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Name*
Other blogs you might like
Blog, Job Market, Learning Guide
Data scientists and Machine Learning engineers are both hot careers to follow with the recent advancement in technology. Both…
by WeCloudData
February 12, 2025
Blog, Learning Guide
SQL(Structured query language ) is a domain-specific programming language for processing and storing information in a relational database. A…
by WeCloudData
March 11, 2025
Blog
Revolutionize healthcare with machine learning for early disease detection. Explore cutting-edge solutions improving everyday health, ensuring timely interventions for…
by WeCloudData
June 17, 2025

Kick start your career transformation