The blog is posted by WeCloudData’s Data Science Bootcamp student Ryan Kang.
Like Amazon AWS, Google Cloud is a popular cloud used by data analytics companies. Google Cloud allows continuous automation of workflow and big data computation. In this blog, I will briefly introduce how I set up Google Cloud for workflow.
Each Google Cloud account includes a $360 free trial credit, and creating a Google Cloud account in the console is as easy as creating a Gmail account. So one should consider signing up now to take advantage of this great tool!
The Google Cloud Console has a variety of interesting features such as compute engines, virtual machines (VM), app engines, storage, etc. There are however three dominant components I will be expanding on. As shown in the graph below, the following components are used for a workflow:
- Google Cloud VM, the remote server; functions like a regular computer
- Google Cloud Storage, remote hard-disk
- Google Cloud SQL, the remote SQL server
Keys for VM:
- Better CPU configurations cost more; If the virtual machine you build on Google Cloud has a better configuration, it will cost more, so it is important not to set it too high
- Setting a VM to “allowing access to API” is very important for data processing
- The connection between local and VM is SSH (security shell), which is conducted via Google Cloud SDK; Through the command line we get into VM
Then later, install environments like we do when working on a local server. Notice VM’s system is Linux which makes the installation straightforward (for instance, pip install python). After environment setup, upload files from local to VM using the “gustil” command line. We can now run simple workflows.
When the workflow involves SQL or big data, we need a Google Cloud SQL instance to do the job.
Keys for SQL:
- Remember to select 2nd generation SQL
- Also, because names of SQL (or storage) are global, make sure the names are unique
- After setting up Google Cloud SQL, connect to Google Cloud SQL as such:
Workflow is orchestrated using the Python package titled “Airbnb airflow”, which was developed by Airbnb data engineers. Orchestration includes data collection, database computation, documentation, and storage. It has a UI which allows for easy scheduling and checking.
Finally, we need data visualization. Visualization is realized on your local computer which connects to Google Cloud SQL. Visualization tools include Tableau, ds3.js, and python packages like superset. The table above demonstrates a real-time visualization of Twitter activities.
You should now have a good idea of the basics on how workflow works in Google Cloud. If you have any questions, please visit WeCloudData for more information or comment below.
To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see our upcoming course schedule.