Our Students
Bootcamp Programs
Short Courses
Portfolio Courses
Bootcamp Programs

Launch your career in Data and AI through our bootcamp programs

  • Industry-leading curriculum
  • Real portfolio/industry projects
  • Career support program
  • Both Full-time & Part-time options.
Data Science & Big Data

Become a data engineer by learning how to build end-to-end data pipelines


Become a data analyst through building hands-on data/business use cases

Become an AI/ML engineer by getting specialized in deep learning, computer vision, NLP, and MLOps

Become a DevOps Engineer by learning AWS, Docker, Kubernetes, IaaS, IaC (Terraform), and CI/CD

Short Courses

Improve your data & AI skills through self-paced and instructor-led courses

  • Industry-leading curriculum
  • Portfolio projects
  • Part-time flexible schedule
Portfolio Courses

Learn to build impressive data/AI portfolio projects that get you hired

  • Portfolio project workshops
  • Work on real industry data & AI project
  • Job readiness assessment
  • Career support & job referrals

Build data strategies and solve ML challenges for real clients

Help real clients build BI dashboard and tell data stories

Build end to end data pipelines in the cloud for real clients


Choose to learn at your comfort home or at one of our campuses

Corporate Partners

We’ve partnered with many companies on corporate upskilling, branding events, talent acquisition, as well as consulting services.

AI/Data Transformations with our customized and proven curriculum

Do you need expert help on data strategies and project implementations? 

Hire Data, AI, and Engineering talents from WeCloudData

Our Students

Meet our amazing alumni working in the Data industry

Read our students’ stories on how WeCloudData have transformed their career


Check out our events and blog posts to learn and connect with like-minded professionals working in the industry

Read blogs and updates from our community and alumni

Explore different Data Science career paths and how to get started



Consulting Case Study: Topic Modelling on Technician Notes

October 19, 2021

Client Info

Our client is one of Canada’s largest construction vehicle suppliers. They employ thousands of skilled technicians across multiple provinces to support their clients and are known for their excellent customer service and quality of work. The technicians handle repair and maintenance services for their client’s purchased vehicles.

For each job that a technician performs, the technician will write a note (often the length of 30 words or less) outlining the complaint from the customer, the cause of the issue, and the solution implemented to solve the task. These notes are categorized into a high-level job type (i.e. repair) and component category (i.e. electrical); however these categories offer very little description into the actual work performed.

Over the past several years, there have been well over a million notes which have been written that have yet to be used by the client.

Problem Statement

Before the note data can be effectively utilized by the client, the notes can be further clustered based on topic similarity. This additional layer of categorization will help simplify the analytical steps downstream and aid in more descriptive analysis of vehicle work order data.

Using data from 2018 onward (to limit number of notes analyzed), the WeCloudData team was tasked with:

  1. Labelling the technical note data found in work order tables into N number of topics through topic modelling
  2. Utilizing technical notes to aid in more descriptive analysis (context) of work order data alongside the job type and component category assigned to a note


Topic Model Approaches

Two topic modelling approaches were proposed:

  • Latent Dirichlet Allocation (LDA)
    • This assumes multiple topics per document and has been used for a variety of use cases involving both short, medium and long notes.
    • LDA is the most common topic modelling approach and has several supporting tools which can better visualize the cluster-word outputs. For instance, pyLDAvis can be used in order to visualize the cluster-word outputs on a 2D plane.

This figure demonstrates a high level overview of the steps involved in LDA. Preprocessing the text input can occur in a number of ways and the main input for LDA is the number of topics you would like the model to create. Manual inference is used to determine if the model outputs make sense.

  • Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM)
    • This assumes one topic per document and is used only for short text (<= 30 words).
    • Packages like PyLDAvis which are supported for LDA may also be tweaked to be utilized for GSDMM outputs.

This figure demonstrates a high level overview of the steps involved in GSDMM. Like LDA, preprocessing the text input can occur in a number of ways and the main input for GSDMM is the number of topics you would like the model to create. Manual inference is used to determine if the model outputs make sense.

Text Preprocessing Pipeline

This architecture outlines the 8 steps involved in text preprocessing used in both LDA and GSDMM topic modelling. These steps required the aid of our client’s business analyst in order to better understand the data we were working with.

Key Findings

The figure above shows that a cluster size of 8 results in the highest coherence score (C_vScore) for the LDA model. A similar finding was found with GSDMM. Moreover, other metrics which we were able to produce include:

  • an intertopic distance map (via PyLDAvis)
  • the cluster word frequency and importance by topic.

Secondary to the above, another key finding was that the 8 cluster outputs from GSDMM overlapped significantly with the cluster outputs from LDA. Given this overlap and that LDA was not only easier to use, but also had a greater number of supporting tools, the WeCloudData team focused their time only on LDA’s outputs for the manual inference step which was to be performed with our client’s business analyst.


Over the course of this project, the WeCloudData team was able to:

  1. Build a comprehensive text preprocessing pipeline to clean the technical notes
  2. Compare and contrast the cluster-word representations from LDA and GSDMM
  3. Find that GSDMM overlapped with LDA and as a result, keep the LDA only outcomes to present to the client for manual inference
Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s student Luis Vieira. I will be showing how to build a real-time dashboard on…
by Student WeCloudData
October 21, 2020
Big Data for Data Scientists – Info Session from WeCloudData…
by WeCloudData
November 9, 2019

Kick start your career transformation