Market Basket Analysis

When talking to salespeople about increasing revenue, they tend to always say: “There are fundamentally two strategies to increase revenue: either encourage customers to purchase higher-priced items or to buy a greater quantity of products.” This essentially boils down to two primary strategies: Upselling and Cross-selling.

Upselling: This strategy involves encouraging customers to buy more expensive products. It’s about increasing the average order value by persuading the customer to purchase a higher-end version of the product (pricing ladder plan) they are interested in.

Cross-selling: In contrast, cross-selling focuses on convincing customers to buy additional products. Unlike upselling, cross-selling targets the quantity of purchases rather than swapping out the initial choice with something pricier.

Imagine a small drug store owner who notices that some products seem to sell out faster when placed near each other. Curious about this phenomenon, the owner decides to use Market Basket Analysis to explore purchasing patterns and identify items that customers frequently buy together.

The analysis uncovers that customers often purchase cold medicine and tissues simultaneously and those who buy vitamins usually also pick up health supplements. However, it also reveals that placing certain items like candy next to dietary supplements leads to decreased sales of the latter, as the tempting candy distracts from health-focused purchases.

Armed with these insights, the drug store owner strategically places complementary items next to each other – cold medicine next to tissues and vitamins next to health supplements – while intentionally separating contrasting categories such as candy and dietary supplements.

Market Basket Analysis is not just about increasing immediate sales; it’s also about understanding the customer deeper to foster a more engaging and personalized shopping experience. So what exactly is MBA?

Defintion

Market Basket Analysis explores purchasing patterns by identifying items that customers frequently buy together, providing invaluable insights into potential-prefered product combinations by customers. This analysis supports businesses in:

– Build and improve product recommendations for an e-commerce store.

– Cross-sell products in a retail store.

– Improve inventory management.

– Upsell products.

Introducing the data

The Online Retail II dataset consists B2B retail data from a UK-based online store, capturing over one million transactions from 2009 to 2011. This data is distinct from B2C data and offers unique insights into purchasing patterns in business-to-business environments.

For more details, visit the Online Retail II dataset page.

Here are the first five rows of the dataset after renaming the columns for better understanding:

	invoice_number	product_id	product_description	quantity	invoice_date	price	customer_id	country
0	489434	85048	15CM CHRISTMAS GLASS BALL 20 LIGHTS	12	01/12/2009 07:45	6.95	13085.0	United Kingdom
1	489434	79323P	PINK CHERRY LIGHTS	12	01/12/2009 07:45	6.75	13085.0	United Kingdom
2	489434	79323W	WHITE CHERRY LIGHTS	12	01/12/2009 07:45	6.75	13085.0	United Kingdom
3	489434	22041	RECORD FRAME 7" SINGLE SIZE	48	01/12/2009 07:45	2.10	13085.0	United Kingdom
4	489434	21232	STRAWBERRY CERAMIC TRINKET BOX	24	01/12/2009 07:45	1.25	13085.0	United Kingdom

Data Preperation

In preparing the B2B retail dataset for analysis, the following steps were undertaken:

Initial Data and Concatenation: Combined two CSV files into a single DataFrame, resulting in 1,067,371 records and 8 columns.
Renaming Columns: Columns were renamed for clarity, such as ‘Invoice’ to ‘invoice_number’ and ‘StockCode’ to ‘product_id’.
Removing Duplicates: Duplicate rows were dropped, reducing the dataset to 1,033,036 entries.
Handling Missing Values: Identified significant missing data in ‘product_description’ and ‘customer_id’. Observed that missing descriptions often had zero prices and negative quantities, indicating non-sales transactions.
Data Cleaning: Excluded transactions with missing descriptions and negative quantities, resulting in a cleaned dataset of 1,008,898 entries.

Extracting Seasonal Insights with Market Basket Analysis

Market Basket Analysis (MBA) is an excellent tool for uncovering product affinities and customer buying patterns. However, applying MBA to the entire dataset from December 2009 to December 2011 might not yield the most actionable insights. Instead, I focus on a specific timeframe: March 2010 to July 2010. This period typically represents a seasonal change in consumer purchasing behavior, making it ideal for analysis.

By extracting transactions within this window using the following code snippet, I aim to gain insights that can inform targeted promotional strategies for late summer (August/September) 2010:

				
					spring_summer_2010 = trans_data[(trans_data['invoice_date'] >= '2010-03-01') & (trans_data['invoice_date'] <= '2010-07-31')]

In total, the dataset contains 40,302 transactions, with 7,843 occurring during the spring and summer period of 2010 (which still translates to +170.000 rows of transactional data).

Basic retail stats and distribution

Between March 2010 and July 2010, 3,580 unique products were sold. Among these, 353 products performed exceptionally well. The middle 50% of the 3,551 products sold between 5 to 57 times, indicating a significant spread in sales frequencies. The analysis also revealed an average basket size of 22.54 items per transaction and an average transaction value of £442.21, underscoring the purchasing behavior during this period.

Key Metrics to consider

Antecedents: Items that appear in a transaction before others. For example, in the rule {cold medicine} → {tissues}, "cold medicine" is the antecedent.

Consequents: Items that appear in a transaction as a result of the antecedents. In the rule {cold medicine} → {tissues}, "tissues" is the consequent.

Support: Measures how frequently the items in a rule appear together in the dataset. Higher support indicates the rule is applicable to a larger number of transactions. The formula is: \[ \text{Support}(A \rightarrow B) = \frac{\text{Transactions containing both A and B}}{\text{Total transactions}} \]

Confidence: Indicates how often the items in the consequent appear in transactions that contain the antecedents. Higher confidence means a stronger association. The formula is: \[ \text{Confidence}(A \rightarrow B) = \frac{\text{Support}(A \cup B)}{\text{Support}(A)} \]

Lift: Evaluates the performance of a rule compared to random chance. A lift greater than 1 indicates a positive association; the higher, the better. It is calculated as: \[ \text{Lift}(A \rightarrow B) = \frac{\text{Confidence}(A \rightarrow B)}{\text{Support}(B)} \]

Leverage: Measures the difference in the occurrence of the items together versus what would be expected if they were independent. A higher leverage value indicates a stronger association. The formula is: \[ \text{Leverage}(A \rightarrow B) = \text{Support}(A \cup B) - (\text{Support}(A) \times \text{Support}(B)) \]

Conviction: Indicates the dependency between the antecedents and consequents, showing how much more often the antecedents predict the consequents than expected by chance. Higher conviction values indicate stronger associations. The formula is: \[ \text{Conviction}(A \rightarrow B) = \frac{1 - \text{Support}(B)}{1 - \text{Confidence}(A \rightarrow B)} \]

Zhang's Metric: A measure that combines support, confidence and the size of the dataset to evaluate the strength of association rules. A higher Zhang's metric value indicates a stronger and more interesting association.