Clustering — Profiling & Cluster Movement Diagram

Rhydham Gupta
4 min readOct 13, 2022

--

Kmeans clustering is one of the most widely used techniques by data scientists. Its simplicity, easy interpretation, and business value are simply amazing. But still, I find that when I ask people how do you select the optimal number of clusters, they’ll answer like the elbow curve. My next question to them is what else? And most often they look at me in amazement and say nothing more.

There’s nothing wrong with this but it is very incomplete because remember any technique is only valuable when it provides value to the business. In this article, I am going to share a practical approach to selecting the optimal number of clusters in any real project.

Broadly speaking there are four major steps -

  1. Defining the clustering objective
  2. Using statistical measures to select the optimal range of clusters
  3. Profiling of the clusters
  4. Cluster movement diagram

Defining the clustering objective

Clustering is no magic but grouping similar data points together. Now it is the business need that will determine the parameters for similarity. Let’s take a simple example —

Let’s say an e-commerce company (Flipkart, Amazon) wants to send discount coupons to selected customers. They have a fixed budget of $100K. Now as a data scientist you are asked to give them the list of customers who should be targeted for discounts and what level of discount should be offered.

One of the very practical solutions is to segment the customers. One objective could be that we want to offer these discounts to high-value customers and target the younger customer base. We have defined the parameters, now let's put them as variables-

High-Value customers— Average Monthly sales, Average ticket value, Payment method, #unique categories purchased, frequency of purchase

Younger customer base — Age, Location(City Tier)

Avoid putting the irrelevant variables into clustering because it will only drift your analysis.

Using statistical measures to select the optimal range of clusters

Now that we have the variables selected, the next question is how many clusters. The statistical measures are a good way to obtain the optimal range of clusters that can be selected. ELBOW CURVE and silhouette score are some popular methods that can be used. Use it to only get the range like 5–7, 4–6, etc. Choosing the K (number of clusters) only from these measures is not always a good idea as it sometimes leads us to ignore some important information.

Let’s just extend the above example, Based on the elbow curve, we select 5 as optimal clusters. Now in profiling (which we will understand in the next section), we realized that one of the groups has high average sales but it includes customers from all age groups. Remember, while defining the objective, we were planning to target young customers as they can deliver high lifetime value. But because we choose a 5-cluster solution, we didn’t get the chance to look at a 6-cluster solution where that cluster was breaking down further based on the age criteria. So that’s the reason other steps are very important.

Profiling of the clusters

Profiling of clusters is a very important step in any real project clustering solution. Below is a snapshot of what we mean by profiling.

It is equivalent to summarizing different variables across different clusters. Every cluster has some unique characteristics that are different from the other group.

Let’s say we are talking about Cluster-5, although it accounts for only 5% of the customers, the average monthly sales and ticket value are high. So it could be one of the potential groups to give discounts.

If we look at Cluster-4, we will see that it belongs to customers who are relatively young and come from Tier-2 and Tier-3 cities. In addition, their average sales are good but they are buying fewer categories of products on average. So we can strategize to give discounts to these customers for the categories they are not currently buying.

I think you got the idea.

Cluster movement diagram

Last but not the least, this step is very crucial to check the robustness of the existing clusters. It basically checks how the existing cluster size splits if we increase the number of clusters.

The number represents the count of customers in each cluster before(5-cluster) and after(6-cluster).

Clearly, we can see that if we increase the number of clusters from 5 to 6, it is mainly Cluster-5 that breaks down further.

Implementation — A very valid question one might ask is how K-means know that cluster_no-1 in the 5-cluster solution and 6-cluster solution is the same. It doesn’t but we know that if an existing cluster_no-1 is breaking into 1,2,3,4,5,6. Suppose we see the max count of customers for the combination 1–6, so we will simply swap cluster 1 by 6 in the 6-cluster solution. Do the same exercise for the rest of the clusters.

Hope you liked this article :)

You can visit my blog for more such articles.

--

--

Rhydham Gupta

I am a Data Scientist, I believe that observing and decoding data is an art. Same Data, Different Eyes Different Stories