Blog

Consulting

Consulting Case Study: Topic Modelling on Technician Notes

October 19, 2021

Client Info

Our client is one of Canada’s largest construction vehicle suppliers. They employ thousands of skilled technicians across multiple provinces to support their clients and are known for their excellent customer service and quality of work. The technicians handle repair and maintenance services for their client’s purchased vehicles.

For each job that a technician performs, the technician will write a note (often the length of 30 words or less) outlining the complaint from the customer, the cause of the issue, and the solution implemented to solve the task. These notes are categorized into a high-level job type (i.e. repair) and component category (i.e. electrical); however these categories offer very little description into the actual work performed.

Over the past several years, there have been well over a million notes which have been written that have yet to be used by the client.

Problem Statement

Before the note data can be effectively utilized by the client, the notes can be further clustered based on topic similarity. This additional layer of categorization will help simplify the analytical steps downstream and aid in more descriptive analysis of vehicle work order data.

Using data from 2018 onward (to limit number of notes analyzed), the WeCloudData team was tasked with:

  1. Labelling the technical note data found in work order tables into N number of topics through topic modelling
  2. Utilizing technical notes to aid in more descriptive analysis (context) of work order data alongside the job type and component category assigned to a note

Methodology

Topic Model Approaches

Two topic modelling approaches were proposed:

  • Latent Dirichlet Allocation (LDA)
    • This assumes multiple topics per document and has been used for a variety of use cases involving both short, medium and long notes.
    • LDA is the most common topic modelling approach and has several supporting tools which can better visualize the cluster-word outputs. For instance, pyLDAvis can be used in order to visualize the cluster-word outputs on a 2D plane.

This figure demonstrates a high level overview of the steps involved in LDA. Preprocessing the text input can occur in a number of ways and the main input for LDA is the number of topics you would like the model to create. Manual inference is used to determine if the model outputs make sense.

  • Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM)
    • This assumes one topic per document and is used only for short text (<= 30 words).
    • Packages like PyLDAvis which are supported for LDA may also be tweaked to be utilized for GSDMM outputs.

This figure demonstrates a high level overview of the steps involved in GSDMM. Like LDA, preprocessing the text input can occur in a number of ways and the main input for GSDMM is the number of topics you would like the model to create. Manual inference is used to determine if the model outputs make sense.

Text Preprocessing Pipeline

This architecture outlines the 8 steps involved in text preprocessing used in both LDA and GSDMM topic modelling. These steps required the aid of our client’s business analyst in order to better understand the data we were working with.

Key Findings

The figure above shows that a cluster size of 8 results in the highest coherence score (C_vScore) for the LDA model. A similar finding was found with GSDMM. Moreover, other metrics which we were able to produce include:

  • an intertopic distance map (via PyLDAvis)
  • the cluster word frequency and importance by topic.

Secondary to the above, another key finding was that the 8 cluster outputs from GSDMM overlapped significantly with the cluster outputs from LDA. Given this overlap and that LDA was not only easier to use, but also had a greater number of supporting tools, the WeCloudData team focused their time only on LDA’s outputs for the manual inference step which was to be performed with our client’s business analyst.

Conclusion

Over the course of this project, the WeCloudData team was able to:

  1. Build a comprehensive text preprocessing pipeline to clean the technical notes
  2. Compare and contrast the cluster-word representations from LDA and GSDMM
  3. Find that GSDMM overlapped with LDA and as a result, keep the LDA only outcomes to present to the client for manual inference
SPEAK TO OUR ADVISOR
Join our programs and advance your career in Data ScienceMachine Learning Engineering

"*" indicates required fields

Name*
This field is for validation purposes and should be left unchanged.
Other blogs you might like
Student Blog
The blog is posted by WeCloudData’s  student Sneha Mehrin. An overview on how to process data in spark using…
by Student WeCloudData
November 2, 2020
WeCloud Faculty
This blog post was written by BeamData’s Project Manager and WeCloudData’s Assistant Instructor, Shan Gao. My Philosophy of Teaching…
by Shan Gao
September 29, 2021
Uncategorized
Data engineering is a hot topic in recent years, mainly due to the rise of artificial intelligence, big data,…
by john
June 21, 2024

Kick start your career transformation

This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.