Shopee Product Matching

7 min readJul 20, 2021

Find the similar products to a particular product using the Image and Title of the products.

Problem Description:

Shopee is a leading online e-commerce platform which enables users to buy and sell products online. It focuses on e-commerce and operates its business mainly in South Asian countries. Customers appreciate its easy, secure, and fast online shopping experience tailored to their region. The company also provides strong payment and logistical support along with a ‘Lowest Price Guaranteed’ feature on thousands of Shopee’s listed products. In this competition they open sourced images of products with titles/descriptions and expect Machine Learning practitioners to build models that identify similar products based on images and descriptions.

Data Exploration:

Posting_id: unique id for each posting/product

Image: file name of the image

Image_phash: perceptual hash of the image

Title: Description of that image

Label_group: group code for which product belongs to.

There are 34250 postings in the train and some postings have similar image files, phash and titles. There are 11014 label groups. Products belonging to a label_group are similar. For an example,

These three different products which belong to a label group are similar and it’s evident by images of these three products.

Can label_group be considered as ground reality or ground truth? No.

Let’s consider Image_phash. We can observe that below data points have the same image phash but they belong to different label groups.

Sample of postings having same Image Phash

When I checked the images they are similar products.

We can clearly observe that the above 3 products are similar and they belong to different label groups. So can image phash be considered as ground truth? No.

From this discussion on kaggle it’s evident that dis-similar products can have the same image phash. Also, there are postings which have same/similar images or descriptions but belong to different label groups or have different image phash.

Coming to Title of the postings let’s see what are the most frequent words across the titles in train data using a Word Cloud.

**Word Cloud of most used words in titles of train data**

We can clearly observe many words here are not English words. Go through this discussion on Kaggle where it’s explored and observed using Google Translate that English, Indonesian, Malay and German are most used languages and it’s obvious because Shopee operate it’s business mainly in South Asian countries. Also we can observe numbers in most frequent words. Numbers shouldn’t be removed from titles during pre-processing because numbers may significantly describe and differentiate products.

How many words are present in titles?

**Histogram of number of words in titles**

How many unique words are present in titles?

**Histogram of number of unique words in titles**

We can observe through histograms of number of words and number of unique words that there are no high repeating words in titles. It’s logical that title of the products usually don’t either have highly repeating words or stop words. I did not significantly observe any of the stop words in Word Cloud.

Coming to Images,

It’s observed that around 96% images in train are Squared images and more than 56% of images are of dimensions 640x640 and 1024x1024.

ML problem formulation:

This problem doesn’t have a ground truth and it should be solved in an Unsupervised approach. Using titles/descriptions we can vectorize them using basic approaches like TFIDF, Word2Vec/Glove and can also use BERT and different variations of BERT(DistilBERT, SentenceBERT etc.,) to get the embeddings. After getting the embeddings we can make vectors as Unit Vectors and compute the DOT product(Cosine Similarity) with other vectors/embeddings. Making vectors/embeddings to unit length makes sure that DOT product(Cosine Similarity) lies between +1 and -1 as it only depends on angle between the vectors/embeddings now. If Cosine Similarity is +1 then the angle of separation between vectors is 0 degrees(Cos(0)=+1) which in turn mean they are identical vectors and if Cosine Similarity is -1 then the angle of separation between vectors is 180 degrees(Cos(180)=-1) which in turn mean they are most dis-similar vectors.

Similarly, Image embeddings can also be taken using Convolutional Neural Networks and the process of computing similarity is same as above. We can also use Euclidean Distance as a similarity measure but as I am already making vectors/embeddings as Unit Vectors the Euclidean Distance is proportional to Cosine Distance. We need to set a decision threshold on Cosine Similarity/DOT product to get the matches for a product and that optimum threshold can be figured out while working on the problem.

Performance Metric:

The performance metric of this problem is F1-score averaged over postings. Which means the F1-score is computed on each product using its matches and the average of all products is taken.

ArcFace Loss/Arc Margin Product:

Reference Paper: https://arxiv.org/pdf/1801.07698.pdf

ArcFace is a modification to Softmax to make loss depend only on Cosine Similarity(Dot Product) between feature vector and weight vector. Let’s consider Sigmoid for example. We know that the Sigmoid Function is 1/(1+exp-(W.X+b)) where X is the feature vector, W is weight and b is the bias term. Let’s consider the plane passes through origin i.e. bias=0 for simplicity. Now Sigmoid=1/(1+exp-(W.X)).
We know that dot(W,X)=||W||.||X||.Cos(angle between them). We compute this and consider this as probability estimation and calculate loss with that. What if we make W and X unit vectors? The loss depends only on angle between those vectors. That is if angle is less(more cosine similar) the loss is less and during training the model updates weights in such a way that it decreases angle between W and X for particular class label.
In addition to this, What if we add some angle to actual? Even if angle is less, as we are adding certain angle and making vector less cosine similar the model tries to reduce the added angle during training which in-turn update the weights through back propagation in such a way that the actual angle is reduced. We are just making training harder which results in better learning than actual.
This process of making loss depend on Angle(Cosine Similarity) and adding additional angle to that to improve training efficiency is called ArcFace- Additive Angular Margin Loss.

Best Solution:

IDF + BERT(ArcFace) + EfficientNet B3(ArcFace) : Submission

The best private score of 0.72 is with combination of text and image based models/embeddings. EfficientNet B3(image based) was trained using ArcFace with variable margin. Here, margin is added in every scenario but varies according to angle between feature and weights. Margin is larger when angle is small and vice-versa.

EfficientNet B3(image based) trained on this achieved private score of 0.69 using Cosine Similarity threshold as 0.6 and I added product matches predicted using text based embeddings with higher Cosine Similarity thresholds(IDF and ArcFace trained BERT with C.S threshold of 0.75 and 0.7) making sure that only precise predictions are added in matches by EfficientNet B3 and this combination resulted in private score of 0.72

Future work:

As we observed that title is in multiple languages we can translate it using resources like Google Translate and vectorize the text.
Also we can use language specific or multi-language trained BERT like IndonesianBERT etc.,
Coming to image, Siamese style training can be used with triplet or contrastive loss for optimization. Also SBERT, which is pre-trained in Siamese approach can be further trained on this competition data and used.
Other similarity metrics using fiass can be tried and can try to figure out more effective way of combining matches found by different models instead of just concatenating them.