Blog

Consulting

Consulting Case Study: Topic Modelling on Technician Notes

October 19, 2021

Client Info

Our client is one of Canada’s largest construction vehicle suppliers. They employ thousands of skilled technicians across multiple provinces to support their clients and are known for their excellent customer service and quality of work. The technicians handle repair and maintenance services for their client’s purchased vehicles.

For each job that a technician performs, the technician will write a note (often the length of 30 words or less) outlining the complaint from the customer, the cause of the issue, and the solution implemented to solve the task. These notes are categorized into a high-level job type (i.e. repair) and component category (i.e. electrical); however these categories offer very little description into the actual work performed.

Over the past several years, there have been well over a million notes which have been written that have yet to be used by the client.

Problem Statement

Before the note data can be effectively utilized by the client, the notes can be further clustered based on topic similarity. This additional layer of categorization will help simplify the analytical steps downstream and aid in more descriptive analysis of vehicle work order data.

Using data from 2018 onward (to limit number of notes analyzed), the WeCloudData team was tasked with:

Labelling the technical note data found in work order tables into N number of topics through topic modelling
Utilizing technical notes to aid in more descriptive analysis (context) of work order data alongside the job type and component category assigned to a note

Methodology

Topic Model Approaches

Two topic modelling approaches were proposed:

Latent Dirichlet Allocation (LDA)
- This assumes multiple topics per document and has been used for a variety of use cases involving both short, medium and long notes.
- LDA is the most common topic modelling approach and has several supporting tools which can better visualize the cluster-word outputs. For instance, pyLDAvis can be used in order to visualize the cluster-word outputs on a 2D plane.

This figure demonstrates a high level overview of the steps involved in LDA. Preprocessing the text input can occur in a number of ways and the main input for LDA is the number of topics you would like the model to create. Manual inference is used to determine if the model outputs make sense.

Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM)
- This assumes one topic per document and is used only for short text (<= 30 words).
- Packages like PyLDAvis which are supported for LDA may also be tweaked to be utilized for GSDMM outputs.

This figure demonstrates a high level overview of the steps involved in GSDMM. Like LDA, preprocessing the text input can occur in a number of ways and the main input for GSDMM is the number of topics you would like the model to create. Manual inference is used to determine if the model outputs make sense.

Text Preprocessing Pipeline

This architecture outlines the 8 steps involved in text preprocessing used in both LDA and GSDMM topic modelling. These steps required the aid of our client’s business analyst in order to better understand the data we were working with.

Key Findings

The figure above shows that a cluster size of 8 results in the highest coherence score (C_vScore) for the LDA model. A similar finding was found with GSDMM. Moreover, other metrics which we were able to produce include:

an intertopic distance map (via PyLDAvis)
the cluster word frequency and importance by topic.

Secondary to the above, another key finding was that the 8 cluster outputs from GSDMM overlapped significantly with the cluster outputs from LDA. Given this overlap and that LDA was not only easier to use, but also had a greater number of supporting tools, the WeCloudData team focused their time only on LDA’s outputs for the manual inference step which was to be performed with our client’s business analyst.

Conclusion

Over the course of this project, the WeCloudData team was able to:

Build a comprehensive text preprocessing pipeline to clean the technical notes
Compare and contrast the cluster-word representations from LDA and GSDMM
Find that GSDMM overlapped with LDA and as a result, keep the LDA only outcomes to present to the client for manual inference

SPEAK TO OUR ADVISOR

Join our programs and advance your career in Data ScienceMachine Learning Engineering

"*" indicates required fields

Name*

First Last

Email*

Phone Number*

Phone

This field is for validation purposes and should be left unchanged.

Other blogs you might like

WeCloud Courses, WeCloud Faculty

Life is Science Fiction: AI Project Teaser – by Rhys Williams

Life is Science Fiction: AI Project Teaser – by Rhys Williams I walked away from David Greig’s stage adaption…

by WeCloudData Faculty

October 29, 2019

Student Blog

Fraud Analytics: ML Tutorial on Dealing with an Imbalanced Dataset

This blog is posted by WeCloudData’s Immersive Bootcamp student Anthony Chen. Fraud analytics provide a certain challenge that people…

by Student WeCloudData

October 28, 2019

WeCloud Faculty

Views on Data Science Education from a Project Manager

This blog post was written by BeamData’s Project Manager and WeCloudData’s Assistant Instructor, Shan Gao. My Philosophy of Teaching…

by Shan Gao

September 29, 2021

Bootcamp Programs

Short Courses

Project Course

Learning Path

Data Science >

Data Engineering >

Data/BI Analytics >

Machine Learning Engineering >

DevOps Engineering >

Corporate Partners

Corporate Training >

DevOps >

Consulting Services >

Talent Program >

Resources

Blogs >

Career Guides >

WeCloudOpen >