Program  

Courses
Location
Corporate
Our Students
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science & Big Data
Data Engineering

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING
Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Location

Choose to learn at your comfort home or at one of our campuses

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Our Students

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Let’s get together and enjoy the fun from treasure hunting in massive real-world datasets

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Blog

Student Blog

Analyzing Kinesis Data Streams of Tweets Using Kinesis Data Analytics

June 23, 2020

The blog is posted by WeCloudData’s student Amany Abdelhalim.

In this article, I am illustrating how to collect tweets into a kinesis data stream and then analyze the tweets using kinesis data analytics.

The steps that I followed:

  1. Create a kinesis data stream.

 

I created a kinesis data stream which I called “twitter” with one shard.

2. Prepare the script that will collect the tweets and write them into the kinesis data stream.

I prepared the following python script, where I select 11 attributes from each tweet and make sure to write them into the “twitter” kinesis data stream that I created in the first step. I ran the script from my local machine, but you can run the script on an EC2 instance and you can even run the script using nohup to ensure that the script runs in the background even if after disconnecting the ssh session.

Python Script

In the above script I am hard coding my twitter credentials which is not recommended. There are other safer options available, such as using environment variables or passing arguments to your script.

3. Create an application in kinesis data analytics that will be used to analyze the data in the kinesis data stream.

I created an application in kinesis data analytics and I called it “twitter_analysis”. I also chose to process the data using SQL which is the default option, then I clicked create application.

After the application was successfully created, I clicked on connect streaming data in order to choose the source of the data stream. The source of the data stream can only be one streaming data source.

 

There is two options, where you can choose an existing source that you have created before or you can configure a new stream.

The default is “choose a source”, I selected the kinesis data stream that I created before which is the “twitter” data stream.

I hit the “Discover schema” button.

The schema was successfully discovered as shown below.

The name of the “twitter” kinesis data stream that I have to use in the SQL editor is shown below which is “SOURCE_SQL_STREAM_001”.

I clicked on the “Go to SQL editor” button.

I was prompted with a message asking me to start running the kinesis data analytics application “twitter_analysis” that I created. I chose the option “Yes, start application”.

The following shows a sample of the streaming data coming from the source kinesis data stream “twitter”, which is referred to as the “SOURCE_SQL_STREAM_001” stream.

Twitter Data Stream

The first tab “Save and run SQL” will allow you to write SQL statements and run the code on the streaming source data.

SQL Editor

The follwoing window opens when you select the tab “Add SQL from templates” which will show you some ready made templates that allow you to perform some analysis on the stream data such as anomaly detection.

SQL Templates

Below, I will show three examples of SQL statements that I wrote in the SQL Editor and I hit the tab “save and run SQL” to display the results. The following examples is just for illustrating how to write SQL in the SQL Editor and show the results, much more useful queries can be performed on the streaming data after cleaning it.

Example1:

In the following example, I am only selecting the tweets column.

As you can see, first I created a stream that will be holding the output that I desire, I called the stream “TEMP_STREAM”.

Then I prepared a PUMP by the keywords “CREATE OR REPLACE PUMP” to insert into the output stream “TEMP_STREAM” the values of the tweet coulmn selected from the source stream “SOURCE_SQL_STREAM_001”.

TEMP_STREAM

 

The following shows the output of the “TEMP_STREAM”.

 

Example2:

In the following example, I am only selecting the tweets that have the word trump present.

As you can see, first I created a stream that will be holding the output that I desire, I called the stream “TEMP_STREAM”.

Then I prepared a PUMP by the keywords “CREATE OR REPLACE PUMP” to insert into the output stream “TEMP_STREAM” the values of the tweet coulmn selected from the source stream “SOURCE_SQL_STREAM_001” that have the word “trump” present.

TEMP_STREAM

The following shows the output of the “TEMP_STREAM”.

 

Example3:

In the following example, I am only selecting the tweets that have a have a negative sentiment.

As you can see, first I created a stream that will be holding the output that I desire, I called the stream “TEMP_STREAM”.

Then I prepared a PUMP by the keywords “CREATE OR REPLACE PUMP” to insert into the output stream “TEMP_STREAM” the values of the tweet coulmn selected from the source stream “SOURCE_SQL_STREAM_001” that have a negative sentiment.

The following shows the output of the “TEMP_STREAM” which is updated every 2 to 10 seconds if new results are available.

The output stream gets updated with new results every 2–10 seconds. So as you can see new tweets were added as time goes by and tweets with negative sentiment gets added to the source stream.

Note that the in-application streams such as the “TEMP_STREAM” above can be connected to a Kinesis stream, or to a Firehose delivery stream, to continuously deliver SQL results to AWS destinations.

The limit of destinations is three destinations for each application. You will be allowed either to select an existing destination or create a new one.

You can also choose the output format whether Json or CSV.

As a Note if you choose your destination to be kinesis firehose, you can write the results in redshift and display the results on Superset dashboard.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see the learning path. To read more posts from Amany, check out her Medium posts here.

Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s student Luis Vieira. I will be showing how to build a real-time dashboard on…
by Student WeCloudData
October 21, 2020
Uncategorized
Take a central role The Bank of Canada has a vision to be “a leading central bank—dynamic, engaged and…
by Shaohua Zhang
May 21, 2020
Uncategorized
Big Data for Data Scientists – Info Session from WeCloudData…
by WeCloudData
November 9, 2019
Previous
Next

Kick start your career transformation

WeCloudData

WeCloudData is the leading data science and AI academy. Our blended learning courses have helped thousands of learners and many enterprises make successful leaps in their data journeys.

Sign up for newsletter
This field is for validation purposes and should be left unchanged.