Program  

Courses
Location
Corporate
Our Students
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science & Big Data

Become a data engineer by learning how to build end-to-end data pipelines

 

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING
Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Location

Choose to learn at your comfort home or at one of our campuses

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Our Students

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Our free open source courses and workshops gives you the skills and knowledge needed to transform your career in tech

Blog

Student Blog

Data Analysis on Twitter Data Using DynamoDB and Hive

June 26, 2020

The blog is posted by WeCloudData’s student Amany Abdelhalim.

There are two steps that I followed to create this pipeline :

1) Collect Twitter Feeds and Ingest into DynamoDB

2) Copy the Twitter Data from DynamoDB to Hive

First: Collect Twitter Feeds and Ingest into DynamoDB

In order to create a pipeline where I collect tweets on a specific topic and write them in a DynamoDB Table.

I launched an EC2 instance and installed the following:

· Python3

· Python packages (tweepy, boto, awscli, extblob)

I created a table in DynamoDB that I will use to write the tweets to.

I prepared a script “collect_tweets.py “ that will collect tweets related to the “trump” topic and write them to a DynamoDB table. In the script I am extracting 11 fields from each tweet which are the id, user name, screen name, tweet text, followers , geo, created at, the sentiment of each tweet, polarity and subjectivity.

I copied the collect_tweets.py script from my local machine to the EC2 instance.

scp -i ~/privateKeyFile.pem collect_tweets.py ec2-user@ec2–xx–xxx–xxx–xxx.compute-1.amazonaws.com:/home/ec2-user

I ran the script on EC2 using nohup to ensure that the script runs in the background even if after disconnecting the ssh session.

nohup python3 collect_tweets.py 2>&1

I used tail to check if the script is working.

tail nohup.out

tweets printed on the shell

I checked twitter_table in DynamoDB and it has 2816 tweet records written to it before I stopped running the script on the EC2.

Number of Records in the DynamoDB table
DynamoDB Table Sample Records

Second: Copying the twitter data from DynamoDb to hive

I Launched an EMR cluster, with the following tools Hadoop, Hive, Presto, HBase.

I connected to Hue and created 2 external tables and copied the data from the “twitter_table” from DynamoDB to the hive table“twitter_hive”.

The following is the “twitter_hive” table:

twitter_hive table

The following is the “twitter_ddb” table that was used to copy the data from the “twitter_table” from DynamoDB to the hive table “twitter_hive”.

twitter_ddb table

I copied the data from the “twitter_ddb” table to the “twitter_hive” table.

I tested that the data was copied successfully to the hive table by performing some queries.

In the following query, I am selecting the first 10 records of the twitter_hive table.

Query:

select * from twitter_hive limit 10;

Output:

In the following query, I am selecting the sentiment, polarity and subjectivity form the first 10 records of the twitter_hive table.

Query:

Output:

In the following query, I am calculating the top 10 most popular hashtags in the twitter dataset

Query:

Output:

In the following query, I am checking which locations have the most number of tweets about “trump”.

Query:

Output:

In the following query, I am checking which locations have the most negative sentiment about trump.

Query:

Output:

In the following query, I am doing a word count to find a popular keyword in the dataset, and then calculating the average polarity of all the tweets that contain that keyword.

First, I did a word count to find a popular keyword in the dataset.

Query:

Output:

Second, I chose the word “Trump” that occurred 1124 times to calculate the average polarity of all the tweets that contain “Trump”.

Query:

Output:

Note: the tweets were used as is, where actually optimally they shouldn’t. Data should be cleaned before being used for analysis. One form of cleaning could be removing the stop words, another form of cleaning could be removing the mentions, ..etc.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, click here to see the learning path. To read more posts from Amany, check out her Medium posts here.

Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s  student Sneha Mehrin. If you are a computer programmer or working in any…
by Student WeCloudData
October 13, 2020
Student Blog
This blog is posted by WeCloudData’s Data Science Bootcamp student Austin Jung. Customer churn is a common business problem…
by Student WeCloudData
October 28, 2019
Learning Guide
Objectives This tutorial is one part of a containers series of tutorials that will walk the reader through installation…
by WeCloudData Faculty
October 11, 2022
Previous
Next

Kick start your career transformation

Learn basic skills with our free WeCloudOpen Courses!

Join our free SQL and Python coding courses now and gain the skills and knowledge you need to start your career.