This blog series is posted by WeCloudData’s Data Science Immersive Bootcamp student Bob Huang (Linkedin)
OVERVIEW:
The digital marketing project gives you the ability to manage and analyze your marketing data from different platforms such
as Google Analytic, Gmail, Eventbrite, and Google Ad. You can find your emails based on their sent status, campaign, and
type to easily create and edit your email content. You can visualize the summary of public marketing event data and analyze
the conversion rate. You can also create customized dashboards using the acquired data for your purpose.
Part 1 of this blog will mainly focus on the tools and data pipeline infrastructure.
FEATURE SUMMARY:
The prospective scope of this project is closed to Software as a service (SaaS). According to Wikipedia, software as a service
is a software licensing and delivery model in which software is licensed on a subscription basis and is centrally hosted.
It is sometimes referred to as “on-demand software”, and was formerly referred to as “software plus services” by Microsoft.
We feature in the following aspects:
- Low cost: Monthly GCP service cost starting from $40.
- Customizable: We select useful data when doing data ingestion, choose storage methods, customize dashboard
layout, do mini machine learning projects on data, migrate marketing data with other data, etc. - Secure: Using Kubernetes clustering, all authentication keys will not be exposed to the public.
Data ownership: You own all the data that we retrieve using APIs from various social platforms. If you don’t
want to do data update, we are able to provide you program that query all historical data once to do analyst using
the traditional method like Excel. - Open source: All programs, services, and applications are open source product without cost. If you have
a strong technical team, you can maintain the services once it is deployed without our support. - Easy to use: Superset is an easy to learn software that non-technical person can use to build customized
dashboard. - We have good recommendations about how you can use your data to build meaningful visualizations and perform useful statistics
or machine learning analysis. Details will be in the second part of the blog post. - Big data: All the components that this project uses are scalable. For example, Kubernetes can do scaling
and load balancing automatically.
PROCEDURES:
- Collect data from different sources: Build a Docker container that hosts Apache Airflow with various DAGs
that gather emails, event registrations and other information from different sources and store them into Google BigQuery. - Visualization using Apache Superset: Build a Docker container that hosts Apache Superset. Connect Superset
to BigQuery then create dashboards to display data. - Host Docker applications on Google Cloud: Create a Google Compute instance with Kubernetes that host multiple
Dockers that serve the entire project. The advantage of Kubernetes includes auto scaling, application isolation,
also good security. - Extensions: Consider creating more Docker containers to set up applications like machine learning model
based on email data, Dash-Plotly application, etc.
FLOWCHART:
PROJECT COMPONENTS:
- Google Cloud Platform – Project monitoring (https://cloud.google.com/) Google Cloud Platform, offered by Google, is a suite of cloud computing services that run on the same infrastructure that Google uses internally for its end-user products. Alongside a set of management tools, it provides a series of modular cloud services including computing, data storage, data analytics, and machine learning.
- Apache Superset – Front end (https://superset.incubator.apache.org/) Superset is an easy to use data visualization tools that have fantastic templates. Non- technical people can quickly learn and create customizable dashboards based on business purposes. It supports various database connections and has security modules.Superset recently supports BigQuery connections. To containerize Superset, we refer to this Github example. (https://github.com/amancevice/superset/blob/master/Dockerfile) For Kubernetes deployment, do the followings:To set up Superset-BigQuery connection, create tables … follow Superset official documentation.
- BigQuery – Back end (https://cloud.google.com/bigquery/) Follow Google official instructions to create datasets (equivalent to a database) in BigQuery. Generate Keys with different permissions to read or write data. For the backend, we can also use Cloud SQL, MySQL, Redshift, MongoDB, PostgreSQL, etc. We will adjust according to customers’ needs.
- Apache Airflow – Automation (https://airflow.apache.org/) Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.Build a Docker image that host airflow. Write different DAGs that can gather data from different sources using those credential. (One source per DAG.) Create DAGs that delete obsolete data. To build the image, we mainly refer to this Github example (https://medium.com/@shahnewazk/dockerizing-airflow-58a8888bd72d), using supervisor. Mainly we follow documents in this Github repository. We write one DAG to get data from one source and store data into multiple tables. Table schemas are already predetermined by inspecting the data.In the Python code, we need to parse the full query response and store them into different columns in the table. DAG properties such as retry times, failure email, and run frequency can be specified in the .py script. DAGs will be all stored in AIRFLOW_HOME/dags. Since all the authentication files, passwords, tokens are in this Docker container, we cannot expose it to the public. Kubernetes provide a ClusterIP deployment method that will secure the Airflow Docker container, as follows:DAG script sample and some explanation:Kubernetes – Container hosting (https://kubernetes.io/) Create a Google Compute Engine instance with Kubernetes that host multiple Docker containers that serve this project. One application one Docker container. For Docker images, we build locally and push it to Google Container Registry then deploy to Kubernetes cluster.Docker – Containerization (https://www.docker.com/) Docker is a computer program that performs operating-system-level virtualization, also known as “containerization”. It is an open platform for developers and sysadmins to build, ship, and run distributed applications, whether on laptops, data center VMs, or the cloud. Sample DockerFile for Apache Airflow:
We set up environment variables, copy files, install dependencies and run commands to build docker.
To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see our upcoming course schedule.