Blog

Student Blog

Building Superset Dashboard and Pipeline using Apache Airflow and Google Cloud SQL

October 28, 2019

The blog is posted by WeCloudData’s Data Science Bootcamp student Ryan Kang

Like Amazon AWS, Google Cloud is a popular cloud used by data analytics companies. Google Cloud allows continuous automation of workflow and big data computation. In this blog, I will briefly introduce how I set up Google Cloud for workflow.

Each Google Cloud account includes a $360 free trial credit, and creating a Google Cloud account in the console is as easy as creating a Gmail account. So one should consider signing up now to take advantage of this great tool!

The Google Cloud Console has a variety of interesting features such as compute engines, virtual machines (VM), app engines, storage, etc. There are however three dominant components I will be expanding on. As shown in the graph below, the following components are used for a workflow:

  • Google Cloud VM, the remote server; functions like a regular computer
  • Google Cloud Storage, remote hard-disk
  • Google Cloud SQL, the remote SQL server

Keys for VM:

  1. Better CPU configurations cost more; If the virtual machine you build on Google Cloud has a better configuration, it will cost more, so it is important not to set it too high
  2. Setting a VM to “allowing access to API” is very important for data processing
  3. The connection between local and VM is SSH (security shell), which is conducted via Google Cloud SDK;  Through the command line we get into VM

Then later, install environments like we do when working on a local server. Notice VM’s system is Linux which makes the installation straightforward (for instance, pip install python). After environment setup, upload files from local to VM using the “gustil” command line. We can now run simple workflows.

When the workflow involves SQL or big data, we need a Google Cloud SQL instance to do the job.

Keys for SQL:

  1. Remember to select 2nd generation SQL
  2. Also, because names of SQL (or storage) are global, make sure the names are unique
  3. After setting up Google Cloud SQL,  connect to Google Cloud SQL as such:

Workflow is orchestrated using the Python package titled “Airbnb airflow”, which was developed by Airbnb data engineers. Orchestration includes data collection, database computation, documentation, and storage.  It has a UI which allows for easy scheduling and checking.

Finally, we need data visualization. Visualization is realized on your local computer which connects to Google Cloud SQL. Visualization tools include Tableau, ds3.js, and python packages like superset. The table above demonstrates a real-time visualization of Twitter activities.

You should now have a good idea of the basics on how workflow works in Google Cloud. If you have any questions, please visit WeCloudData for more information or comment below.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see our upcoming course schedule.

SPEAK TO OUR ADVISOR
Join our programs and advance your career in Business IntelligenceData EngineeringData Science

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Name*
Other blogs you might like
Blog
Welcome to the fourth blog in WeCloudData’s Prompt Engineering Series! In the previous blog we explored basic prompt engineering…
by WeCloudData
January 21, 2025
Blog, Consulting
A public cloud is a type of cloud computing in which a third-party service provider provides computing resources via…
by WeCloudData
April 15, 2025
Blog, Learning Guide
Everything revolves around data. Organizations use insights extracted from the data to make informed decisions. The modern data world…
by WeCloudData
March 5, 2025

Kick start your career transformation