Our client is one of Canada’s largest construction vehicle suppliers. They employ thousands of skilled technicians across multiple provinces to support their clients and are known for their excellent customer service and quality of work. The technicians handle repair and maintenance services for their client’s purchased vehicles.
For each job that a technician performs, the technician will write a note (often the length of 30 words or less) outlining the complaint from the customer, the cause of the issue, and the solution implemented to solve the task. These notes are categorized into a high-level job type (i.e. repair) and component category (i.e. electrical); however these categories offer very little description into the actual work performed.
Over the past several years, there have been well over a million notes which have been written that have yet to be used by the client.
Before the note data can be effectively utilized by the client, the notes can be further clustered based on topic similarity. This additional layer of categorization will help simplify the analytical steps downstream and aid in more descriptive analysis of vehicle work order data.
Using data from 2018 onward (to limit number of notes analyzed), the WeCloudData team was tasked with:
- Labelling the technical note data found in work order tables into N number of topics through topic modelling
- Utilizing technical notes to aid in more descriptive analysis (context) of work order data alongside the job type and component category assigned to a note
Topic Model Approaches
Two topic modelling approaches were proposed:
- Latent Dirichlet Allocation (LDA)
- This assumes multiple topics per document and has been used for a variety of use cases involving both short, medium and long notes.
- LDA is the most common topic modelling approach and has several supporting tools which can better visualize the cluster-word outputs. For instance, pyLDAvis can be used in order to visualize the cluster-word outputs on a 2D plane.
This figure demonstrates a high level overview of the steps involved in LDA. Preprocessing the text input can occur in a number of ways and the main input for LDA is the number of topics you would like the model to create. Manual inference is used to determine if the model outputs make sense.
- Gibbs Sampling Dirichlet Multinomial Mixture (GSDMM)
- This assumes one topic per document and is used only for short text (<= 30 words).
- Packages like PyLDAvis which are supported for LDA may also be tweaked to be utilized for GSDMM outputs.
This figure demonstrates a high level overview of the steps involved in GSDMM. Like LDA, preprocessing the text input can occur in a number of ways and the main input for GSDMM is the number of topics you would like the model to create. Manual inference is used to determine if the model outputs make sense.
Text Preprocessing Pipeline
This architecture outlines the 8 steps involved in text preprocessing used in both LDA and GSDMM topic modelling. These steps required the aid of our client’s business analyst in order to better understand the data we were working with.
The figure above shows that a cluster size of 8 results in the highest coherence score (C_vScore) for the LDA model. A similar finding was found with GSDMM. Moreover, other metrics which we were able to produce include:
- an intertopic distance map (via PyLDAvis)
- the cluster word frequency and importance by topic.
Secondary to the above, another key finding was that the 8 cluster outputs from GSDMM overlapped significantly with the cluster outputs from LDA. Given this overlap and that LDA was not only easier to use, but also had a greater number of supporting tools, the WeCloudData team focused their time only on LDA’s outputs for the manual inference step which was to be performed with our client’s business analyst.
Over the course of this project, the WeCloudData team was able to:
- Build a comprehensive text preprocessing pipeline to clean the technical notes
- Compare and contrast the cluster-word representations from LDA and GSDMM
- Find that GSDMM overlapped with LDA and as a result, keep the LDA only outcomes to present to the client for manual inference