Online Shopping Redefined: Predicting Shopper Behavior with Machine Learning

February 7, 2024


Imagine you’re walking through your favorite shopping district, your smartphone in hand, casually browsing through a mix of online suggestions and storefront displays. Each choice you make, from lingering in front of a window display to clicking through a web page, feeds into an invisible network of data, analyzed by algorithms to predict your next move. This is no longer science fiction, but a reality powered by machine learning.

In the world of retail, understanding shopper behavior has become paramount. Businesses are turning to machine learning to sift through vast amounts of data generated by every click, purchase, and even passive browsing. By analyzing these patterns, companies can predict future buying behaviors, tailor recommendations, and personalize shopping experiences to an unprecedented degree. This technological evolution is reshaping the retail landscape, offering insights that were once beyond reach.

The application of machine learning in predicting shopper behavior exemplifies a significant shift in how data can be used to enhance the customer experience. As McKinsey & Company highlights, this digital transformation is driven by the need to understand the nuanced preferences of consumers, including the millennial and GenZ generations’ inclination towards innovative brands and conscious consumption​​ [1]

Google’s chief economist, Hal Varian, refers to this data-driven optimization as “computer kaizen,” emphasizing the continuous improvement and experimentation that machine learning brings to traditional business processes​ [2]. Machine learning’s role extends beyond analyzing consumer trends. It also impacts operational aspects such as product sales, customer retention, and even treasury pricing, offering a holistic approach to understanding and serving the modern shopper​ [3]

As businesses embrace machine learning, they are not just adopting a set of algorithms but integrating a strategic vision that places data at the core of decision-making processes. This integration challenges companies to rethink their strategies, ensuring that technology investments are aligned with broader business objectives to foster sustainable growth​​.

In addition to the insights provided by McKinsey, the application of machine learning (ML) and artificial intelligence (AI) in the retail sector is further explored by other industry leaders, offering a broader perspective on how these technologies are driving innovation. 

Dataiku discusses the high-value use cases for AI in retail and consumer packaged goods (CPG), emphasizing the transformative impact on marketing, sales, and supply-chain management. These applications range from hyper-targeted marketing campaigns and customer journey optimization to revolutionizing warehouse and store management. The ability of AI to personalize product recommendations by analyzing aggregated user data is highlighted as a key advantage, enabling businesses to deliver unique, one-on-one customer experiences. This approach not only enhances customer satisfaction but also optimizes inventory management and product assortment in both digital and physical retail environments [4]​​.

Furthermore, a white paper by Devoteam G Cloud on Machine Learning & MLOps in Google Cloud elaborates on the technical infrastructure and methodologies required to implement ML at scale. The paper emphasizes the importance of a systematic approach to ML, from data exploration and experimentation to model training, hosting, monitoring, and ensuring model transparency and fairness. This structured approach enables retailers to effectively apply ML to improve various aspects of their operations, including demand forecasting, customer segmentation, and operational efficiency, thereby driving growth and innovation​​ [5].

These insights from Dataiku and Devoteam G Cloud illustrate the dynamic nature of ML and AI in retail, underscoring the vast potential for businesses willing to invest in these technologies. By adopting a strategic and well-structured approach to ML, retailers can unlock new levels of efficiency, customer engagement, and market competitiveness.

As you walk through that shopping district, behind the scenes, machine learning algorithms are at work, predicting, personalizing, and enhancing every aspect of your shopping experience. This is the power and promise of machine learning in retail, a testament to how far technology has come in understanding the subtleties and nuances of human behavior.

Please note that the aim of this article is not to delve into the complexities of machine learning, but rather to illustrate its practical utility in everyday life. We aim to offer a glimpse into the practical applications of machine learning, sparking an interest that may encourage you to explore the field of data science and machine learning further. For the scope of this article, we will focus on the application of machine learning techniques to predict online shopping behavior, demonstrating how data-driven insights can transform the retail industry by personalizing the customer experience and optimizing operational efficiency.

Now, let’s take a look at some code snippets to see machine learning in action for online shopper buying behavior prediction.

The code snippets in this article are sourced from a Kaggle notebook by Tanisha Hardei, titled “E-commerce Shopper Purchase Prediction and Analysis”. Kaggle is a platform that provides a community for data scientists and machine learning practitioners to explore, analyze, and share quality data sets and build machine learning models. It offers a collaborative environment where users can participate in competitions, access a broad range of datasets, and gain insights from other professionals in the field.

Continuing from our previous discussion on Kaggle, this code snippet below sets up the foundation for a machine learning project focused on consumer buying prediction. It includes importing essential libraries, such as numpy and pandas for data manipulation, matplotlib and seaborn for data visualization, and ydata_profiling for generating comprehensive reports on datasets. This setup is crucial for exploring, cleaning, and understanding data before applying any machine learning or statistical models, making it a foundational step in any data science project.

The code below loads a dataset named ‘online_shoppers_intention.csv’ from a specified path on Kaggle into a Pandas DataFrame, which is a structure used for data manipulation in Python.

The .head() function is used to display the first five rows of this dataset.

Online shoppers intention dataset overview – top five rows:

The dataset offers insights into online shopping habits and interactions on a website, detailing various column names (also known as features or attributes) that capture information such as page visit types and durations, bounce and exit rates, and demographic details. These columns help in analyzing what influences customers’ online shopping decisions and behaviors:

  • Administrative: The number of visits to administrative pages by the user.
  • Administrative_Duration: The total amount of time (in seconds) spent by the user on administrative pages.
  • Informational: The number of visits to informational pages by the user.
  • Informational_Duration: The total amount of time (in seconds) spent by the user on informational pages.
  • ProductRelated: The number of visits to product-related pages by the user.
  • ProductRelated_Duration: The total amount of time (in seconds) spent by the user on product-related pages.
  • BounceRates: The average bounce rate of the pages visited by the user. The bounce rate is the percentage of visitors who enter the site and then leave (“bounce”) rather than continuing to view other pages within the same site.
  • ExitRates: The average exit rate of the pages visited by the user. The exit rate is the percentage at which users exit from a page or set of pages.
  • PageValues: The average value of the pages visited by the user. This metric is used to measure the effectiveness of a page in contributing to the final goal (e.g., completing a purchase).
  • SpecialDay: The closeness of the site visiting time to a special day (e.g., Mother’s Day, Valentine’s Day) in which the sessions are more likely to be finalized with a transaction.
  • Month: The month of the visit.
  • OperatingSystems: The operating system used by the user.
  • Browser: The browser used by the user.
  • Region: The region from which the user is visiting.
  • TrafficType: The type of traffic that led the user to the website.
  • VisitorType: The type of visitor, categorized as “Returning Visitor”, “New Visitor”, and possibly other types.
  • Weekend: A boolean indicating whether the visit took place on a weekend.
  • Revenue: A boolean indicating whether the visit resulted in revenue (transaction).

Examining summary statistics is beneficial because it provides a quick overview of key characteristics of the numerical data. The code below generates summary statistics for the numerical variables within the 

dataset and transposes the result to display it in a more readable format.

These statistics include measures like mean (average), standard deviation (a measure of data spread), minimum, maximum, and quartiles. They help data analysts and scientists understand the central tendency, variability, and distribution of the data, which is crucial for making informed decisions about data preprocessing, feature selection, and modeling. Summary statistics also aid in identifying potential outliers or unusual patterns in the data.

Here’s what the .describe() method does:

  • Count: Shows the number of non-missing entries in each column.
  • Mean: Provides the average value for each column.
  • Std (Standard Deviation): Indicates the amount of variation or dispersion in each column.
  • Min: The smallest value in each column.
  • 25% (First Quartile): The value below which 25% of the data falls.
  • 50% (Median): The middle value of the dataset.
  • 75% (Third Quartile): The value below which 75% of the data falls.
  • Max: The largest value in each column.

Overview of dataset characteristics with the .describe() method:

Having highlighted the utility of summary statistics through the .describe() method to grasp the essential tendencies and variations within our data, we transition to a broader perspective by exploring metadata. This step will provide us with more insights into the data’s structure and characteristics, enhancing our understanding of the dataset’s nuances.


Metadata is an overview of the data itself. It’s the data about the data. Metadata provides a high-level summary about the characteristics of the dataset. It includes details such as the column names, the types of data contained in each column (numerical, textual, etc.), and the count of non-null entries in each column, among other aspects. The code snippet below is typically utilized to retrieve the metadata of a dataset.

Upon executing the code, a succinct summary is displayed, offering valuable insights into the various data types present and the memory usage in the dataset. This information is crucial for deeper analytical efforts. The output generated might begin as follows:

  • Range Index: The dataset is indexed from 0 to 12329, providing a unique identifier for each row.
  • Columns: There are a total of 18 columns, each representing a different attribute related to online shopping behavior and website interaction.
  • Column Details: Each column’s non-null count is 12330, indicating there are no missing values in any of the columns.
  • Data Types: The dataset primarily consists of integer (int64) and floating-point (float64) values, with two columns (‘Month’ and ‘VisitorType’) having object (string) data types, and two columns (‘Weekend’ and ‘Revenue’) are of boolean type, indicating true or false values.
  • Memory Usage: The dataset consumes approximately 1.5 MB of memory.

This summary provides a quick overview of the dataset’s structure, size, and the types of data it contains, which is useful for preliminary data analysis and serves as a foundation for subsequent model building.

One of the primary responsibilities of a data scientist or machine learning practitioner involves communicating findings to stakeholders. Therefore, an integral part of their role is to create visualizations that convey insights derived from data analysis, ensuring that the information is accessible and understandable.

To illustrate this point, let’s take a look into specific visualizations.

The bar chart below illustrates how special days might affect customer engagement on a website, as measured by average revenue. Each bar represents a different level of “specialness” of the day, ranging from a regular day (0.0) to a highly special day (1.0), such as a holiday or sale event. The height of each bar shows the average revenue generated on these days. The chart suggests that regular days (0.0) have the highest average revenue, with a noticeable decrease on days that are somewhat special (0.2 to 0.8), and a slight increase again on the most special days (1.0).

The second visualization is a bar chart that presents the total revenue generated from different types of web traffic. Each bar represents a traffic type, labeled from 1 to 20, which could refer to various sources such as direct visits, search engines, referrals, etc. The height of the bar indicates the total revenue associated with each traffic type. This visualization helps identify which traffic types are most valuable in terms of revenue generation. For instance, types 2, 3 and 4 appear to bring in the most revenue compared to others.

In summary, these visualizations are instrumental for comprehending the influence of website engagement and traffic sources on online shopping revenue. The data demonstrates how factors like special days and the type of traffic can impact customer behavior and sales outcomes. Such insights are pivotal for optimizing marketing strategies and enhancing the user experience to boost online transactions.

So far, we have explored how data visualization helps in understanding some key insights in relation to online shopping behavior. Building on these insights, we now turn our attention to the process of feature engineering, where we refine and transform data to further enhance model performance.

Feature Engineering

In the realm of data analytics for online shopping behavior, a crucial preliminary step before training machine learning models is feature engineering. This process entails transforming the raw dataset into a format that’s more amenable to analysis by machine learning algorithms, enhancing their capability to forecast outcomes such as customer purchases.

For the online shopping dataset, feature engineering typically includes several essential actions. These are the extraction of significant attributes from user interaction data, meticulous handling of any missing values to preserve data quality, encoding of categorical variables like month and visitor type, and standardizing or scaling of numerical variables such as durations and rates.

Furthermore, it may involve creating new features through methods like combining related attributes or breaking down existing ones into more detailed components, as well as pinpointing the most impactful features that drive the accuracy of the model’s predictions. By rigorously refining and curating the dataset with these feature engineering techniques, we can substantially enhance the model’s insight into consumer behavior patterns and the factors influencing online shopping decisions, leading to more precise and actionable predictions.

Let’s delve into some key elements of feature engineering for predicting online shopping behavior:

  • Data Extraction and Refinement: Sifting through user data to isolate meaningful attributes and carefully managing any incomplete records to ensure the robustness of the dataset.
  • Feature Transformation: Transforming categorical data into a numerical format that’s interpretable by machine learning models and adjusting numerical variables to a standard scale to prevent any one feature from disproportionately influencing the prediction due to its magnitude.
  • Feature Creation and Selection: Generating new features by aggregating similar data points or decomposing complex attributes into simpler ones, and choosing the most pertinent features that bolster the predictive precision of the model.
  • Optimization for Predictive Accuracy: The overarching goal is to fine-tune the dataset, augmenting the model’s proficiency in recognizing patterns and determinants that affect online shopping outcomes, thereby refining the reliability and utility of its predictions.

After exploring feature engineering, the next pivotal step is Model Training. In this phase, we use the processed data, now enriched with engineered features, to instruct the machine learning model on distinguishing between the nuances of online shopper behavior. With this foundational work in place, we are ready to advance to the topic of training the model.

Training the Model

In this phase, the machine “learns” to distinguish between behaviors that indicate a likelihood of making a purchase and those that do not. The dataset is segmented into training and testing groups, a critical step to ensure the model is honed on a specific subset of data while its performance is evaluated on another set of unseen data. This division is fundamental for an accurate assessment of how well the model can predict online shopping outcomes.

As emphasized earlier, this article does not aim to serve as an exhaustive tutorial on machine learning. Rather, its purpose is to showcase how machine learning techniques can be applied to analyze and predict online shopping behaviors.

The following code snippets are designed to lay the groundwork for the various steps involved in training a model to predict online shopping intentions. While the provided code does not constitute a complete program, it gives a glimpse into the type of programming required to build and train a model tailored for analyzing online shopper engagement and predicting purchase decisions.

Moving forward, we’ll explore the Model Interpretation phase, focusing on how our model predicts online shopping behavior. Here, the emphasis isn’t on the complexity of the model but on its accuracy and reliability. It’s crucial to understand that the model does more than just churn out predictions; it provides insights into customer behaviors and preferences, translating raw data into actionable intelligence. In simpler terms, the goal is not only to assess if the model is effective but to understand the mechanics behind its predictions and the reasons we can trust its guidance. This comprehension is key to confidently relying on the model for strategic decision-making processes.

Model Interpretation

Model Interpretation in the context of predicting online shopping intentions using machine learning delves into understanding how algorithms analyze customer interaction data—like page visits, session durations, and transaction histories—to differentiate between visitors likely to make a purchase and those just browsing. This phase is crucial for ensuring that the insights provided by the model are not only accurate but also meaningful, explainable and actionable, fostering a level of transparency that builds trust with stakeholders. Explainability can still be helpful even when regulators or customers aren’t asking for it [6]. Explainability is not merely a regulatory checkbox but a critical component of ethical machine learning practices that contribute to more trustworthy and reliable systems.

Interpreting these “black box” models can be challenging, as it requires simplifying complex algorithms without losing essential details. It extends beyond mere prediction accuracy; it involves unraveling the factors that influence shopping behaviors and identifying patterns that might not be immediately apparent. For example, understanding how seasonal trends or special promotions impact purchase likelihood can offer valuable insights for tailoring marketing strategies and optimizing user experiences.

Evaluating model performance involves analyzing key metrics, including conversion rates, average order value, and customer lifetime value, to assess the effectiveness of the model. Model evaluation plays a pivotal role in this process, offering a quantitative measure of the model’s effectiveness. Beyond looking at business-centric KPIs like conversion rates, average order value, and customer lifetime value, a suite of machine learning metrics provides a deeper understanding of the model’s predictive capabilities:

  • Precision: Imagine you send out 100 promotional emails to potential customers, hoping they’ll make a purchase. If 80 of those who received the email actually make a purchase, your precision is high. This means you’re really good at picking out who to send these emails to — targeting people likely to buy, rather than those just browsing.
  • Recall: Suppose there are 200 people who would buy something from your store if they received a promotional email, but you only sent emails to 100 of them. If all 100 you emailed made a purchase, your recall shows you’re not reaching everyone who might buy. You caught some potential buyers, but missed out on others, indicating you could extend your marketing to more people to increase sales.
  • F1 Score: This is like finding the sweet spot between sending too many emails (and bothering people who won’t buy) and sending too few (and missing out on sales). The F1 Score helps you balance getting your promotion to most of the people who will buy without wasting effort on those who won’t.
  • ROC-AUC: Think of this as the overall skill of your store’s system in distinguishing between two groups of visitors: those who will end up buying something and those who will not. A higher score means your system is better at predicting who will buy and who is just browsing, similar to knowing exactly which customers to greet personally to make a sale.

These metrics, alongside sensitivity analysis, aid in demystifying the model’s decision-making process. They allow businesses to refine marketing strategies, personalize customer experiences, and optimize conversion paths effectively.

Furthermore, the application of advanced interpretability and explainability techniques, such as feature importance rankings and partial dependence plots, clarifies why certain predictions are made. This not only enhances transparency but empowers business users to leverage predictive insights confidently.

In conclusion, the integration of Model Interpretation and Evaluation into the analytics process for online shopping behavior is important. It transforms predictive modeling from a mere computational task into a strategic asset that aids decision-making. By employing a holistic approach that combines actionable insights with rigorous performance metrics, businesses can align their e-commerce strategies with data-driven intelligence, thus improving customer engagement, optimizing marketing efforts, and driving significant revenue growth in the competitive online marketplace.

Key Takeaways

This article highlights the impact of machine learning in retail, showcasing its role in analyzing consumer interactions to predict purchasing behaviors and customize the shopping experience. By featuring insights from McKinsey & Company and Google, along with practical examples and code snippets from Kaggle, it demonstrates how data-driven strategies enhance customer engagement and operational efficiency. Advanced techniques such as feature importance rankings and partial dependence plots illustrate the transparent and actionable insights provided by machine learning models. Ultimately, the article positions machine learning as a strategic asset in retail, and while the article aims not to delve into technical details it showcases machine learning’s role in enhancing the quality of making data-driven business decisions, with a hope to inspire readers to explore the fields of data science and machine learning further. 

Next Steps

  • AI and Machine Learning Specialists top the list of fast-growing jobs, followed by Sustainability Specialists and Business Intelligence Analysts [7]. This insight from the World Economic Forum’s 2023 report highlights the growing demand in these fields. To position yourself advantageously in the job market, consider WeCloudData’s bootcamps. With real client projects, one-on-one mentorship and hands-on experience in our immersive programs, it’s an opportunity to develop expertise in AI and machine learning, aligning your skills with the current and future needs of the job market. Set yourself up for success in a dynamic and rewarding career, take action now! 


[1] McKinsey & Company, “The new model for consumer goods,” on McKinsey & Company website, accessed February 5, 2024,

[2] McKinsey & Company, “An executive’s guide to machine learning,” on McKinsey & Company website, accessed February 5, 2024,

[3] McKinsey & Company. “Beyond the buzz: Harnessing machine learning in payments.” on McKinsey & Company website, accessed February 5, 2024,

[4] Dataiku, “Top High-Value AI Use Cases in Retail & CPG,” on Dataiku Blog, November 14, 2019,

[5] Devoteam G Cloud, “Machine Learning & MLOps in Google Cloud,” on Devoteam G Cloud website, accessed February 5, 2024,

[6] Harvard Business Review, “How AI Is Transforming Retail and E-Commerce,” Harvard Business Review, accessed February 5, 2024,

[7] World Economic Forum (2023). “Future of Jobs Report 2023.” Retrieved from

Join our programs and advance your career in Machine Learning Engineering

"*" indicates required fields

This field is for validation purposes and should be left unchanged.
Other blogs you might like
Career Guide
It has been approximately one year since I decided to make a career switch from Civil Engineering to the…
by Student WeCloudData
February 10, 2020
Learning Guide, WeCloud Faculty
This blog post was written by WeCloudData’s Data Science Instructor, Vinny Nguyen. Hi! I’m Vinny and I’m a data…
by WeCloudData
September 27, 2021
WeCloud Faculty
The blog is posted by WeCloudData’s AI instructor Rhys Williams. In this two-parter I’ll bounce from the conception of…
by WeCloudData Faculty
July 2, 2020

Kick start your career transformation