Program  

Courses
Corporate
Our Students
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science & Big Data

Become a modern data engineer by learning cloud, Airflow, Spark, Data lake/warehouse, NoSQL, and real-time data pipelines

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING

Beginner

Intermediate

Advanced

Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Our Students

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Let’s get together and enjoy the fun from treasure hunting in massive real-world datasets

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Blog

Student Blog

Let’s Read Customer Reviews (actually-make machines do it!)

May 28, 2020

The blog is posted by WeCloudData’s Bid Data course student Udayan Maurya.

Customer reviews are invaluable information to understand the gap in your product market fit. If you sell your products on e-platforms: Amazon, Ebay, Appstore, Playstore, Youtube, etc. then you are in luck. You have direct access to your customers mind. However, to leverage customer’s input in your product development you need more than just “Star-Ratings”. Star-Ratings are useful to gauge at a high level how your product is being received in the market, but it does not provide any actionable insights into what you should do to improve your product offerings.

Reading review texts is defiantly a better option to extract customer insights, you get to understand about what features are helping customers solve their problems and what features are causing problems to customers. Additionally, customers often mention their expectation in reviews. However, reading reviews manually causes following problems:

  • Reviews are voluminous, for popular products reviews can range form few thousands to millions, making it infeasible for a human reader to completely read them.
  • Human readers have difficulty in organizing information in unbiased fashion. Humans may get caught up with impression from a single review, which may not be representative, or most resourceful to act on.
  • Human emotion and language biases can also hampers in extraction of actionable insights from reviews.

Enter AI

Natural language processing equips us with tools, which can help us read and understand customer reviews. We can develop insights on our product features. Know what features of our product are working, what features need improvement and what are customer expectations from our product.

In following sections I will provide NLP Topic Analysis I have applied to research Amazon Customer Reviews data. The analysis helps in identifying customer pain points with the products.

Contents

  1. Preparing Data for Data Analysis
  2. Training the Machine Learning Model
  3. Developing the Dashboard
  4. Analysis of Results
  5. Further Scope

Tools/Software Used

  • AWS: EC2, and S3
  • Databricks Platform for developing ML model
  • Python Libraries:
  • NLP: Gensim, Spacy
  • Numerical Packages: Numpy, Pandas, PySpark
  • Model Development: Spark (Pandas UDF), Gensim (LDA Model)
  • Dashboard Development: Panel, Bokeh

1. Preparing Data for Data Analysis

Data used is the study is publically available data provided by Amazon:

Amazon Customer Reviews Dataset

Amazon Customer Reviews (a.k.a. Product Reviews) is one of Amazon’s iconic products. In a period of over two decades…

s3.amazonaws.com

For analysis Amazon S3 bucket (original location of data) has been mounted on Databricks instance. Following is schematic flow of entire process:

Amazon Review data-set is massive collection of 130+ million customer reviews, which ranges from year 1995–2015. The data-set contains “.tzv.gz”(Zipped tab separated) files. There are 46 files with english reviews (total size ~32GB). Each file comprise of reviews for different product segment.Due to non-trivial size of input data Spark on Databricks Platform has been used to train Machine Learning mode. Data-set has following variables:

In out Analysis we have performed Topic Modeling on “review_body”.

Following steps are taken to paperer the data:

  • Using Regular Expressions Email addresses, and URLs are removed.
  • Documents are tokenized using Gensim’s “simple_preprocess”
  • Documents are lemmatized using Spacy “en_core_web_lg” dictionary

2. Training the Machine Learning Model

LDA models are very popular for the task of Topic modeling. Spark MLlib comes packaged with LDA models. Additionally, we have Gensim’s LdaMallet models, which are very good at finding useful topics in text corpus. Upon training on a smaller sample of data. Gensim model produced much more comprehensive results compared to Spark’s LDA model.

Technical bottleneck with Gensim is that it runs on single cluster machine and cannot be trained on partitioned. Due to non-trivial size to be consumed. Utilizing Spark cluster capabilities provides huge performance benefits, in terms of parallel training, and reducing overall training time.

To get best of both worlds Gemsim LDA models are trained on Spark cluster in embarrassingly parallel fashion. Please review my other blog-post on technical details of Training Embarrassingly Parallel Models on Spark using Pandas UDF. As our objective in this use case is to obtain one model for each product segment, embarrassingly parallel on single partitions of Spark RDDs help us run model training in parallel for each product segment significantly reducing the total training time.

To determine number of topics to be use we have used Coherence Score to compare LDA models.

Coherence Score of Electronics Segment

Coherence score is a numerical way to compare Topic models. The higher the score the better the model is. In above case model with 8-topics performs the best.

3. Developing the Dashboard

Best models for Segments: Health-Personal-Care, Luggage, Apparel , and Electronics are deployed on AWS-EC2 instance. The model dashboard has be developed using Bokeh, and Panel libraries. One product , Yogamat, Bag, Coat, and Micro-SD card, from each segment is selected for Analyzing.

4. Analysis of Results

Below is an image of Application in action:

  1. App allows us to select from various products from drop-down menu.
  2. Select star-ratings you want to see topics for. You can select multiple ratings at any time. This allows us to see topics for lowest rated reviews and can help product developers to improve products.
  3. Indicator of all ratings selected for analysis.
  4. Read the most discussed topics for the ratings you have selected. This part is more art then science, and requires business acumen. From topic words we can guess what is bother our customers. Fore example here in 1-star reviews on can reasonable say that customers are not happy with Yogamat being “Teared”, “Wear” in short time. Or there is “Stretching”, “Slipping”, “Sliding” of material. Therefore, as a product developer we can focus on reviewing material quality of out product we can test/improve martial manufacturing or supplier to mitigate this most pressing issue for the product.
  5. Topic Distribution histogram shows how many reviews correspond to a topic (one review is mapped to only one of most representative topic).
  6. Displays how may of total reviews qualify for the rating selection.
  7. Well this is the most seen one, this is the review distribution by ratings.

5. Further Scope

The Model can be developed to include bi-gram and tri-gram tokens to discover more characteristic topics. Additionally, more common (trivial words ) can be removed form corpus by segment to make useful/meaning carrying words more prominent.

The Model/Methodology is very transferable to other kinds of reviews. For example reviews on Appstore, Playstore, YouTube, etc.

To find out more about the courses our students have taken to complete these projects and what you can learn from WeCloudData, view the learning path. To read more posts from Udayan, check out her Medium posts here.

Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s student Luis Vieira. I will be showing how to build a real-time dashboard on…
by Student WeCloudData
October 21, 2020
Uncategorized
Take a central role The Bank of Canada has a vision to be “a leading central bank—dynamic, engaged and…
by Shaohua Zhang
May 21, 2020
Uncategorized
Big Data for Data Scientists – Info Session from WeCloudData…
by WeCloudData
November 9, 2019
Previous
Next

Kick start your career transformation

WeCloudData

WeCloudData is the leading data science and AI academy. Our blended learning courses have helped thousands of learners and many enterprises make successful leaps in their data journeys.

Sign up for newsletter
This field is for validation purposes and should be left unchanged.