Instacart Market Basket Analysis
Which among the previously ordered products an user will reorder in next order?
Problem Description:
Instacart, a grocery ordering and delivery app that allows customers to select products on their app or website and a personal shopper handpicks those products by in-store shopping and delivers the order. In this Kaggle Competition, Instacart made their anonymized data available for Machine Learning practitioners with an aim for best Machine Learning models to analyze customer reorder patterns and predict which products can a customer reorder based on their previous shopping data.
Groceries involve daily essentials, food products, skin care products and many more which are more likely to be reordered. This reordering behavior will be common among everyone concerned with some products like daily essentials, food products and can be different with other products. Even if everyone needs some common type of products the choice differs in brands and other factors. For example everyone needs a shampoo but which shampoo they use depends on customer interest. The products that customers choose will depend on their taste, cultural diversity, lifestyle and other factors. But this ordering pattern can be analyzed with a number of orders that were made by customers over time.
The business requirement here is to predict which previously purchased products will be in customer’s next order. This model should be developed using anonymized data which contain millions of records open sourced by Instacart. This improves the user experience as well as business of Instacart. This is different from regular recommendation system problems. There is no cold start problem involved in this task i.e., our aim is to predict will an user reorder a particular previously purchased product. This task does not expect to suggest new products to the customer.
Dataset Analysis:
Dataset available is a relational set of files describing customer orders over time.
- aisles.csv: This csv file has two fields, aisle_id and aisle name.
- departments.csv: A department has a number of aisles. This csv file has two fields, department_id and department_name.
- orders.csv: This csv file has information about orders of customers over time. This has seven fields, order_id(unique id of an order), user_id(unique id of user who placed the order), eval_set(whether the record belongs to prior,train or test set), order_number, order_dow(day of week), order_hour_of_day, days_since_prior_order. For each user Instacart provided 4 to 100 of their orders.
- products.csv: This csv file has four fields. product_id(unique id of a product), product_name, aisle_id(aisle id in which this product is available), department_id(department id in which this product is available).
- Order_products__prior.csv: This csv file has records of previous orders(except most recent) of users. eval_set=prior in orders file mean that it belongs to this prior file. This file has four fields, order_id, product_id, add_to_cart_order, reordered(0=newly ordered product, 1=previously ordered product).
- order_products__train.csv: This csv file has records of most recent orders of users. eval_set=train in orders file means that it belongs to this train file. Fields here are the same as Order_products__prior.
This Dataset has no information about the quantity of product ordered.
Problem Formulation for Machine Learning:
In order_prior__train we have records of the most recent order by each user. Using relational tables we can fetch user_id who ordered the product with order_id and we can formulate in such a way that we have user_id, product_id and reordered.
Here reordered=1 means that it’s a product which the user ordered before and reordered=0 means that it’s a product that the user is ordering for the first time.
But the task of this competition is to predict which of the previously purchased products will be in customer’s next order. For this problem we should not consider newly ordered products. So we can remove records with reordered=0 and take the products with respect to each user from order_products__prior that they did not order in most recent order(order_products__train) and label them 0.
The above picture illustrates how I formulated data for ML modelling. This is the data of the most recent order of user_id=1. This user ordered 18 unique products so far(in prior) out of which he/she reordered 10 products in the most recent order. Here reordered=1 means that user repurchased that previously ordered product in the most recent order, reordered=0 means that user did not repurchased that previously ordered product in the most recent order. Now we can extract user related features, product related features, user-product interaction features, time related features and many more from prior data and build ML models using reordered as class label.
This is a Binary Classification problem and data is highly imbalanced.
Performance Metric:
Mean F1-score is the performance metric as per Kaggle competition but considering real world scenario f1-score on ‘1=will reorder’ is the performance metric on which models should be evaluated in particular. Further, recall of ‘will reorder’ can be given observed.
Exploratory Data Analysis:
It’s all about questioning.
- How many orders have no reordered products?
We can observe that most of the orders contain previously purchased products by users.
2. Which departments have higher reorder percentage?
3. What are top-20 most reordered products?
Fruits and Veggies especially ‘Organic’ showed complete dominance.
4. On what days of week most number of products were ordered and reordered?
Is there anything different with meat and alcohol?
Higher number of orders were placed on day 0 and 1 of the week and the same sense observed in percentage of reorders too. But Alcohol products were ordered most on day 4 and 5 of the week. Might be because weekend is nearing.
5. How many unique products each user ordered?
This looks like exhibiting log-normal behavior. Is it?
Yes, to some extent.
6. What’s the gap between orders for orders with no reordered products(None orders)?
Significant number of orders with all freshly ordered products(no reorders) were ordered a month after user’s last order.
How about orders with reordered products then?
Weekly orders gained significance followed by monthly orders.
In depth EDA can be explored in my Github repository. I further explored relation between day of week and hour of day with respect to some departments, relation between departments and monthly, weekly orders and lot more.
Featurization:
I implemented some of the most important features from 2nd place winner’s blog and of course they worked well.
You can check: https://medium.com/kaggle-blog/instacart-market-basket-analysis-feda2700cded
I will showcase only those features which I developed additionally.
1. Features based on Cosine Similarity using vectors representation by Singular Value Decomposition(SVD):
Using SVD, I acquired vector representation of each and every user and product using number of times an user purchased a product as interaction element in the adjacency matrix. After converting every vector into an unit vector(Division by magnitude - for Cosine Similarity) I calculated Cosine Similarity of a particular previously purchased product of an user with top-3 most ordered products of the same user. Using this I developed three features, maximum of top-3 similarities, sum of top-3 similarities and product of top-3 similarities.
Here we can observe that for data points with maximum similarity=1(i.e., product among top-3 most ordered products by user) were reordered more.
Number of reorders peaked at sum of top-3 similarities equal to 1 and dominance continued for higher values.
2. Average number of times a product ordered:
Average number of times a product is ordered will be more when a particular product is ordered more number of times by less number of unique users. With this those products which are ordered by particular set of people can be identified(for example pets related products will not be ordered by all).
Dominance of reorders increased as average increases.
3. User-product order ratio after purchasing that product for first time:
Let an User-A placed 100 orders so far, from 51st order that user purchased Product-A 30 times. Now when we take ratio of number of times product-A ordered by user we get 30/100=0.3. But we can observe that user-A is frequently purchased product-A in recent orders. So this feature captures information about ratio of how many times user-A purchased product-A after purchasing it for first time(30/50=0.6). I developed this both for recent 5 orders as well as all orders of every user.
Go through my Github repository for exploring other features I developed, I explained in detail about every feature and documented instructions about how to develop those features. I developed 30 features in total and those features are based on statistical measures of gaps between user ordering a particular product, features related to users, features related to products and more features on user-product interaction.
Best Models:
1. XGBClassifier:
Hyper-parameters: n_estimators=1000 and max_depth=3
XGBClassifiers tend to overfit as number of base learners increase. With n_estimators(number of base learners)=1000 and max_depth(maximum depth of each base learner)=3 no overfitting is observed.
Probability threshold to decide class label was tuned.
Kaggle score with XGBClassifier:
2. Random Forest Classifier with balanced class weights:
Hyper-parameters: n_estimators=1000 and max_depth=12
Kaggle Score using Random Forest with balanced class weights:
Scores of XGB and Random Forest is almost similar but Random Forest predicted with higher probability threshold(0.69).
Kaggle Standings with all the models:
All the hyperparameter tuning for ML models I did was only with around 20% of data because of the limited computational capacity and huge size of the data. If further tuned using high amount of data and tuning other hyperparameters like learning rate, regularization strength etc., score can be further improved.
Go through my Github repository to explore models I hyperparameter tuned and used. I also developed ‘Deep Neural Networks’.
Future Work:
- ‘Market Basket Analysis’ is a concept of Data Mining. Features based on this concept can be developed and checked.
- And as mentioned above further hyperparameter tuning can be carried out using higher amount of data if computational capacity supports.
- Support Vectors Machines(SVM) can be tried if right kernel to be used is known.
- Further, much deeper Neural Networks can be developed using different optimizers. I used only Adam so far in Deep Learning models.
References:
- Winner’s Interview: 2nd place, Kazuki Ondera: https://medium.com/kaggle-blog/instacart-market-basket-analysis-feda2700cded
- https://www.appliedaicourse.com/
- https://maielld1.github.io/posts/2017/08/Instacart-Kaggle
- ML — Instacart F1 0.38-Part Two; XGBoost + F1 Max: https://www.kaggle.com/kokovidis/ml-instacart-f1-0-38-part-two-xgboost-f1-max
- A Gentle Introduction on Market Basket Analysis — Association Rules: https://towardsdatascience.com/a-gentle-introduction-on-market-basket-analysis-association-rules-fa4b986a40ce
Github repository: https://github.com/Chaitanya-Boyalla/Instacart-Market-Basket-Analysis
LinkedIn: www.linkedin.com/in/chaitanya-boyalla-23051998
email: chaitanya.boyalla@gmail.com