Preprocessing Criteo Dataset for Prediction of Click Through Rate on Ads

The blog is posted by WeCloudData’s student Amany Abdelhalim. In this post, I will be taking you through the steps that I performed to preprocess the Criteo Data set. Some Aspects to Consider when Preprocessing the Data Criteo data set is an online advertising dataset released by Criteo Labs. It contains feature values and click feedback […]

An Introduction To Spark and Its Behavior.

The blog is posted by WeCloudData’s Big Data course student Abhilash Mohapatra. Checklist Followed: Mapreduce, Hadoop and Spark. Spark Architecture. Spark in Cluster. Predicate Pushdown, Broadcasting and Accumulators. 1. Mapreduce, Hadoop and Spark For this section, let the below table represents data stored in S3 which is to be processed. Below table represents the Map and Shuffle […]

Looking to Upskill During the Pandemic? Here’s What Bootcamp Grads Have to Say on COVID-19 Experience

The newest article by Taylor Nichols on switchup shows that the move to online was more popular than people thought it would be. Turns out change can bring new opportunities and be great! Last Updated: September 21, 2020 Click on the link below and check out the article for yourself! https://www.prweb.com/releases/switchups_new_coding_bootcamp_rankings_offer_chance_to_boost_skills_and_career_opportunities_during_pandemic/prweb17413105.htm Key Insights Remote tools and […]

Data Analysis on Twitter Data Using DynamoDB and Hive

The blog is posted by WeCloudData’s student Amany Abdelhalim. There are two steps that I followed to create this pipeline : 1) Collect Twitter Feeds and Ingest into DynamoDB 2) Copy the Twitter Data from DynamoDB to Hive First: Collect Twitter Feeds and Ingest into DynamoDB In order to create a pipeline where I collect tweets on a […]

Analyzing Kinesis Data Streams of Tweets Using Kinesis Data Analytics

The blog is posted by WeCloudData’s student Amany Abdelhalim. In this article, I am illustrating how to collect tweets into a kinesis data stream and then analyze the tweets using kinesis data analytics. The steps that I followed: Create a kinesis data stream.   I created a kinesis data stream which I called “twitter” with […]

Embarrassingly Parallel Model Training on Spark — Pandas UDF

The blog is posted by WeCloudData’s Big Data course student Udayan Maurya. Spark is one of the most popular tool to perform map-reduce tasks efficiently on large scale distributed data-sets. Additionally, Spark comes with MLlib package to perform Machine Learning on distributed data. On the flip side Python has very mature libraries: Numpy, Pandas, Scikit-Learn, […]

Let’s Read Customer Reviews (actually-make machines do it!)

The blog is posted by WeCloudData’s Bid Data course student Udayan Maurya. Customer reviews are invaluable information to understand the gap in your product market fit. If you sell your products on e-platforms: Amazon, Ebay, Appstore, Playstore, Youtube, etc. then you are in luck. You have direct access to your customers mind. However, to leverage customer’s […]

Live Twitter Sentiment Analysis

The blog is posted by WeCloudData’s Big Data course student Udayan Maurya. This Live Twitter Sentiment Analyzer helps track present sentiment for a given track word. In this document, I will describe the work flow I followed to develop this SaaS app. Contents Data Pipeline Map Data Collection Preparing Data for Data Analysis Training the […]

From Web Scraping to Useful Data Frames — How to Scrape a Website

The blog is posted by WeCloudData’s Big Data course student Laurent Risser. Toronto is known for its crazy housing market. It’s getting harder and harder to find an affordable and convenient place. Searching for “How to find an apartment in Toronto” on Google leads to dozens of pages of advice, which is a pretty good indicator […]

An Introduction to Big Data & ML Pipeline in AWS

The blog is posted by WeCloudData’s Big Data course student Abhilash Mohapatra. This story represents an easy path for below items in AWS : Build an Big Data Pipeline for both Static and Streaming Data. Process Data in Apache Hadoop using Hive. Load processed data to Data Warehouse solution like Redshift and RDS like MySQL. […]

An Introduction to Data Pipeline with Spark in AWS

The blog is posted by WeCloudData’s Big Data course student Abhilash Mohapatra. This story represents an easy path to Transform Data using PySpark. Along with Transformation, Spark Memory Management is also taken care. Here Freddie-Mac Acquisition and Performance Data from year 1999–2018 is used to create a Single o/p file which can further be used for Data Analysis or Building Machine […]

Eric’s Career Switch Journey from Civil to Data

It has been approximately one year since I decided to make a career switch from Civil Engineering to the Data Science. After working as a Data Analyst at Slalom for 3 months, I think now would be a good time to share my experience. I will try to present this blog as 3 distinct parts: […]

Kijiji House Price Analysis using Python

This is the first project that I have done for WeCloudData. The purpose of this project is to find the relationship between housing prices in Toronto(GTA) in relation to location, house size, number of bedrooms and number of bathrooms. We start by scraping data from Kijiji through the URL requests. Then we parse our data source […]

Predictive Churn Modeling Using Python

This blog is posted by WeCloudData’s Data Science Bootcamp student Austin Jung. Customer churn is a common business problem in many industries. Losing customers is costly for any business, so identifying unhappy customers early on gives you a chance to offer them incentives to stay. In this post, I am going to talk about machine […]

Building Superset Dashboard and Pipeline using Apache Airflow and Google Cloud SQL

The blog is posted by WeCloudData’s Data Science Bootcamp student Ryan Kang.  Like Amazon AWS, Google Cloud is a popular cloud used by data analytics companies. Google Cloud allows continuous automation of workflow and big data computation. In this blog, I will briefly introduce how I set up Google Cloud for workflow. Each Google Cloud account […]

Web Scraping – Fishing Ontario

The blog is posted by WeCloudData’s Data Science Bootcamp student Weichen Lu. Once, I was talking with my colleague about outdoor activities, and he told me that he is a fishing enthusiast. It didn’t bring up my attention at first since I am not a fishing guy. However, he proposed an idea to use Google […]

Visualizing New York City Taxi Data

[Student Project] Visualizing New York City Taxi Data This blog is created by WeCloudData’s Data Science Bootcamp alumni Yaoyu Cui. Please find the complete dashboard on https://goo.gl/gXGTEw Tableau has been one of the most popular visualization tools among the Data Science community. Besides its ability of data preprocessing and programming, it also provides powerful mapping […]

Credit Scoring Using Machine Learning

The credit score is a numeric expression measuring people’s creditworthiness. The banking usually utilizes it as a method to support the decision-making about credit applications. In this blog, I will talk about how to develop a standard scorecard with Python (Pandas, Sklearn), which is the most popular and simplest form for credit scoring, to measure […]

Fraud Analytics: ML Tutorial on Dealing with an Imbalanced Dataset

This blog is posted by WeCloudData’s Immersive Bootcamp student Anthony Chen. Fraud analytics provide a certain challenge that people may glance over at first. The problem of the imbalanced dataset. How do we approach it? What angle should we start at? What kind of performance measures do we use? The goal of this article is […]