How to build a Recommender System from scratch in 2 easy steps— E-commerce (3)

Understanding the approach to build a Powerful Recommendation Engine in 2 Easy Steps for recommending similar items

Rhydham Gupta
6 min readJun 3, 2023

Imagine you are a data scientist in an e-commerce company. Currently, they use a popularity-based recommendation engine on their website that recommends the top-selling items to users. Recently the marketing team has been receiving negative feedback on customer experience and one of the prominent reasons is lack of personalization. Your manager comes to you and said, Rhydham the business wants a better recommendation system at the earliest and you are tasked with the responsibility.

You start having more conversations with the business to understand the requirements and realized that they want the recommended items to be based on different variants and other similar items to what the customer is browsing. What will be your approach that you will recommend to your manager? Don't worry, if you don’t know, after this article you will be able to propose an efficient solution to your manager confidently.

Let’s start with some real recommendations results, you would have observed that, on an e-commerce website when you select a product, there is a list of recommended products below, which are often very useful something we really want.

Below is a glimpse of what we are trying to create. These are the two recommendations snapshot for a popular e-commerce platform Flipkart (owned by Walmart) in India:-

When user have selected the Samsung Smartphone:-

Selected Item : SAMSUNG Galaxy F13 (Waterfall Blue, 128 GB) (4 GB RAM)

Selected Item: WellManStore LED Temperature Display with Double Wall Insulated Water Bottle Stainless Steel 500 ml Bottle (Pack of 1, Black, Steel)

Did you notice something in the above examples:

  1. Similar products belong to the same category of items, e.g. If the selected product is a smartphone, then the similar products are smartphones as well. Moreover, all recommended smartphones are in the comparable price range.
  2. In 2nd example, the searched and similar products are both bottles in a similar price range.

Now let’s dive into the methodology for creating the above recommendations. We will try to understand it with a smartphone example:-

Our recommendation framework will be a two-step approach:-

  1. Subsetting the relevant candidates from all items pool
  2. Ranking the candidates

Subsetting the relevant candidates from all items pool

The objective of this step is to shortlist a few most relevant items from all other items. Basis the recommendations, what do you think will be the most obvious candidates for the selected items? A very simple condition can be filtering the same item category and comparable price range. For e.g.

Category = Smartphones

Price Range = +-20% current price of the selected product

These conditions will give us the list of n number of items. Please note that the above point was just an illustration, these conditions are defined based on the business requirements.

Ranking the candidates

Now, we have already reduced the scope of items that needs to be recommended, but still, there can be 100s of smartphones and an average customer will only look at the top 10–20 recommendations. We will rank the qualified items to fulfill this objective.

First, let’s define the most important features based on which we can say that the two items are similar. Some of the features can be:

  1. Brand
  2. Operating system
  3. Ram
  4. Storage capacity
  5. Camera Quality
  6. Display size
  7. Battery capacity
  8. etc.

Now the basic idea is that we will try to find the similarity between the selected item and all other items in the candidate's list based on the above features.

If you have noticed carefully in the recommendation example on Flipkart, the first two similar items are just different variants of SAMSUNG Galaxy F13 (Waterfall Blue, 128 GB) (4 GB RAM), this is due to the fact that these will have the most common features and the highest similarity score.

Consider the above example, you can observe that we have the selected item features and the relevant candidates' features (in practice it would be 100s of items). Now the objective is to compute the similarity between Smart Phone 1 — and all other smartphones and define which should be recommended to the customer

If you are thinking in the right direction, then you must be asking yourself these questions:-

  1. Which similarity metric should I use?
  2. How do I deal with the Brand and Operating system column, these are strings and how will I define their similarity?
  3. Do I need to standardize variables?

Here’s what we will do:-

  1. We will use cosine similarity (Reason: It is simple to explain and standardize the similarity score between 0–1)
  2. We will one-hot encoding for the categorical variables
  3. Yes, that will be required otherwise variables with higher magnitude will get higher weights, which is not desirable.

Below is the formula for computing the cosine similarity

Let’s understand it in detail with the example:

  1. Let’s create sample data into a data frame that we have seen in the above example —
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

# Dummy data for 5 products
data = {
'Product': ['Smartphone1', 'Smartphone2', 'Smartphone3', 'Smartphone4', 'Smartphone5'],
'Brand': ['Samsung', 'Apple', 'OnePlus', 'Samsung', 'Xiaomi'],
'OS': ['Android', 'iOS', 'Android', 'Android', 'Android'],
'RAM': [8, 4, 6, 8, 6],
'Storage': [128, 64, 128, 256, 128],
'Camera': [12, 16, 12, 20, 16],
'Display': [6.2, 5.8, 6.5, 6.1, 6.4],
'Battery': [4000, 3200, 4300, 4000, 4500],
'Price': [1000, 2200, 900, 1100, 800]

df = pd.DataFrame(data)

2. Standardize the numerical features and one-hot encode the categorical feature

# Standardize numerical features
numerical_features = ['RAM', 'Storage', 'Camera', 'Display', 'Battery', 'Price']
scaler = StandardScaler()
df[numerical_features] = scaler.fit_transform(df[numerical_features])

# Encode categorical features
df_encoded = pd.get_dummies(df[['Brand', 'OS']])

# Combine encoded features with standardized numerical features
df_final = pd.concat([df_encoded, df[numerical_features]], axis=1)

3. Let’s find out the cosine similarity scores of the smartphone1 with all other smartphones

# Calculate cosine similarity matrix
cos_sim = cosine_similarity(df_final)

# Define the selected product index (Product 1 in this case)
selected_product_index = 0

# Get the similarity scores for the selected product
similarity_scores = cos_sim[selected_product_index]

Here’s the output:

SmartPhone 4 is most similar to SmartPhone1, so that should be recommended to the customers.

And that’s it. We have completed designing the recommendation engine for similar items. As a bonus, just now you have also understood the concept of content-based filtering recommendation algorithm.

Hope you liked the simple and effective approach for populating the similar items section. This is the 3rd article in this series on creating the recommender system from scratch. If you are interested, you can read the 1st article in this series where I have listed down all the different building blocks of an effective recommender system, you can find it here. Moreover, you can try at your end to think of the approach that you will apply in each section.




Rhydham Gupta

I am a Data Scientist, I believe that observing and decoding data is an art. Same Data, Different Eyes Different Stories