Creating a simple Recommendation Engine — Using KNN
Nowadays we are surrounded by various recommendation engines. Youtube, social media apps, blogging platforms, E-commerce companies, streaming companies, sales force, everyone is using recommendations to target their customer in a better way. And with every passing day, the demand for recommendation engines is increasing.
In this article, I will discuss creating a simple recommendation engine. The only disclaimer is that this approach only works if you have a limited number of products preferably less than 100. In a lot of business requirements, this approach will be simple and best.
To develop a better intuition we will take the example of Huldiram which is a bhujia manufacturer and sells the products to multiple stores across the country, They manufacture 50 different types of Bhujia, and each store carries only 5–10 bhujias on an average, they want a recommendation engine for their sales force so that they can recommend best bhujia combinations to store owners that will bring them maximum sales.
Let us first discuss our data sources, we majorly have 2 data sources —
- Transactional Data (All the past shipment details to each store)
- Demographic Data around each store (Age, Ethnicity, Income, etc.)
Approach —
- Data Pre-Processing
- Clustering
- Modelling
- Evaluation
Intuition — Before even diving into the details, it is important to first understand the intuition of the recommendation engine that we are going to build. So the fundamental hypothesis is that a similar group of people tend to have similar tastes, so if one item is very popular among a group, there are high chances that it will be liked by most of the people in that group.
So in this case, if there is a store A, we are trying to find the most similar stores with respect to the sales pattern of different Bhujia’s. And in those stores if there are some Bhujias that are among high sellers and store A is not selling, we want to recommend that product to store A.
Now let’s dive into the details —
Data Pre-Processing
Firstly, we need to process the data in the format that is ingestable in the model. We need to understand that in any recommendation engine we have to rank different items for a particular user. In this case, our user is the store owner and items are different types of bhujias.
So we will aggregate the last 1 year's data at a store level and calculate the overall sales of different Bhujia’s. Now, this is important, why 1 year? It is because we want to take care of any seasonality impact on the sales. So it is possible that the special occasions fall on different dates in different parts of the country on which there are bumper sales. so by aggregating the yearly sales we have taken care of that phenomenon.
So we will aggregate the sales data for each store at a year level.
Clustering
Now if you have understood the intuition of our recommendation engine then you will be easily able to grasp the need for clustering. We are basically trying to find similar stores but we need to restrict the search space of each store But Why?
Let me take the example of India, the eating habits of people living in cities are different from the people residing in the rural part of the country. Now even if there are two stores with very similar sales patterns but one lies in the metro-city and the other lies in the rural part, we may not want to recommend the products across each other stores, because if there is some low oil Bhujia which sells higher in metro-city store, there is a high possibility that it may not work in rural store due to different eating habits.
So depending on the business requirements, we can select the variables for clustering, For our case, we will use demographic variables (Ethnicity, Age group, store area tier, income levels). Since it will be all categorical variables we can cross-product the different categories of each variable to create the clusters. For more details, you can read through my article on the new approach to clustering.
Finally, we will have a cluster tagged to each store which will represent the similarity basis on the demography.
Modelling
Now let’s come to the heart of the solution, we will use KNN (K-Nearest Neighbour)
This is what our sample of pre-processed data looks like,
Now the first step is scaling the data since KNN is a distance-based algorithm and if we don’t standardise, it will give more weight to Bhujia’s with higher sales which we don’t want.
We have used min-max scaling for each column.
Now we will apply the KNN within each cluster, let’s understand with an example for storeId A1—
- Since the cluster of A1 is 2, we will subset all the stores with cluster 2
- For A1, we will find the euclidian distance of store A1 with all the stores in cluster 2.
- The top 5 stores with minimum distance will be tagged as the neighbours of the store A1
Similarly, for all the stores we can find the nearest neighbours within each cluster.
Ranking the Bhujia’s based on the average volume of the neighbours' volume-
- For store A1, let’s say we identify the following nearest neighbours (A4, A9, A15, A34, A90)
- For all these stores we will take the average of all the bhujia’s sales
- Now for store A1, we will have recommended volume of each bhujia which will be the average sales in all the neighbour stores
- Now based on the descending recommended volume of each bhujia, we will rank the bhujias
Now, based on ranking we select n number of products that the store wanted to sell which will produce maximum sales.
Evaluation
Now the final step, in the modelling we have taken the 5 nearest neighbours to produce the recommended volume.
Now you may argue why 5? A valid question —
Here comes the bias-variance tradeoff.
More neighbours mean more number of new products will be recommended to a store. Fewer neighbours would mean keeping recommended products more similar to the existing portfolio.
So this question can be answered by looking at the data and the business knowledge.
Now its statistical validation is going to be tricky as we don’t know the reality. In most machine learning problems we do a train-test split and validate our model on the test dataset.
Now in this case we haven’t even tried the recommended product yet so how do you make your test dataset. One way could be to identify the intersection cases where the recommended product and actual product sold got matched. For e.g. we can keep the 3 months data as a test dataset, now if in-store A we recommend a new Bhujia H, and they have introduced the same in the last 3 months, we can check the total actual sales in the last 3 months if it is good or not.
Now, this might not be the best method because if in the same store A our engine is recommending bhujia Y, until we introduce that product in the store we can’t know if it could produce even more sales.
Ultimately, we can do A-B testing in the subset of stores to check the uplift from our recommendations.
Congratulate yourself, you have created your first recommendation engine!