Program  

Courses
Location
Corporate
Student Success
Resources
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science Bootcamp

Become a data engineer by learning how to build end-to-end data pipelines

 

Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
AI ENGINEERING
Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients

Location

Choose to learn at your comfort home or at one of our campuses

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Student Success

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career

Resources

Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started

Our free courses and workshops gives you the skills and knowledge needed to transform your career in tech

Blog

Consulting

Consulting Case Study: Topic Modelling on Technician Notes

October 19, 2021

Client Info

Our client is one of Canada’s largest construction vehicle suppliers. They employ thousands of skilled technicians across multiple provinces to support their clients and are known for their excellent customer service and quality of work. The technicians handle repair and maintenance services for their client’s purchased vehicles.

For each job that a technician performs, the technician will write a note (often the length of 30 words or less) outlining the complaint from the customer, the cause of the issue, and the solution implemented to solve the task. These notes are categorized into a high-level job type (i.e. repair) and component category (i.e. electrical); however these categories offer very little description into the actual work performed.

Over the past several years, there have been well over a million notes which have been written that have yet to be used by the client.

Problem Statement

Before the note data can be effectively utilized by the client, the notes can be further clustered based on topic similarity. This additional layer of categorization will help simplify the analytical steps downstream and aid in more descriptive analysis of vehicle work order data.

Using data from 2018 onward (to limit number of notes analyzed), the WeCloudData team was tasked with:

  1. Labelling the technical note data found in work order tables into N number of topics through topic modelling
  2. Utilizing technical notes to aid in more descriptive analysis (context) of work order data alongside the job type and component category assigned to a note

Methodology

Topic Model Approaches

Two topic modelling approaches were proposed:

  • Latent Dirichlet Allocation (LDA)
    • This assumes multiple topics per document and has been used for a variety of use cases involving both short, medium and long notes.
    • LDA is the most common topic modelling approach and has several supporting tools which can better visualize the cluster-word outputs. For instance, pyLDAvis can be used in order to visualize the cluster-word outputs on a 2D plane.

This figure demonstrates a high level overview of the steps involved in LDA. Preprocessing the text input can occur in a number of ways and the main input for LDA is the number of topics you would like the model to create. Manual inference is used to determine if the model outputs make sense.

  • Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM)
    • This assumes one topic per document and is used only for short text (<= 30 words).
    • Packages like PyLDAvis which are supported for LDA may also be tweaked to be utilized for GSDMM outputs.

This figure demonstrates a high level overview of the steps involved in GSDMM. Like LDA, preprocessing the text input can occur in a number of ways and the main input for GSDMM is the number of topics you would like the model to create. Manual inference is used to determine if the model outputs make sense.

Text Preprocessing Pipeline

This architecture outlines the 8 steps involved in text preprocessing used in both LDA and GSDMM topic modelling. These steps required the aid of our client’s business analyst in order to better understand the data we were working with.

Key Findings

The figure above shows that a cluster size of 8 results in the highest coherence score (C_vScore) for the LDA model. A similar finding was found with GSDMM. Moreover, other metrics which we were able to produce include:

  • an intertopic distance map (via PyLDAvis)
  • the cluster word frequency and importance by topic.

Secondary to the above, another key finding was that the 8 cluster outputs from GSDMM overlapped significantly with the cluster outputs from LDA. Given this overlap and that LDA was not only easier to use, but also had a greater number of supporting tools, the WeCloudData team focused their time only on LDA’s outputs for the manual inference step which was to be performed with our client’s business analyst.

Conclusion

Over the course of this project, the WeCloudData team was able to:

  1. Build a comprehensive text preprocessing pipeline to clean the technical notes
  2. Compare and contrast the cluster-word representations from LDA and GSDMM
  3. Find that GSDMM overlapped with LDA and as a result, keep the LDA only outcomes to present to the client for manual inference
SPEAK TO OUR ADVISOR
Join our programs and advance your career in Data ScienceMachine Learning Engineering

"*" indicates required fields

Name*
This field is for validation purposes and should be left unchanged.
Other blogs you might like
Blog
I almost called this blog ‘Things I Would Have Loved to Have Known Before Starting Out on a Career…
by Cherice
September 14, 2023
Student Blog
The blog is posted by WeCloudData’s Data Engineering course student Rupal Bhatt.  Here is a Donut Chart prepared from…
by Student WeCloudData
January 8, 2020
Student Blog
The blog is posted by WeCloudData’s student Amany Abdelhalim. In this article, I am illustrating how to collect tweets…
by Student WeCloudData
June 23, 2020
Previous
Next

Kick start your career transformation