Twitter Recommendation Algorithm — Simplified
Decoding the data science in the Twitter recommendation algorithm
Do you know how the Twitter recommendation algorithm works? You might be wondering why should I know about it. Here’s why, knowing it will help you to come up with better ideas in your future projects and will make you look more knowledgeable and smarter in the team, On a lighter note why should you not grab the $44 billion algorithm when it is available for free :)
One of the challenges in understanding the Twitter recommendation algorithm is that either the content is too business focused like Twitter blogs which makes it very difficult to relate to the actual implementation or if you start digging into the research papers for the algorithms that are mentioned, then good luck finishing it ever. In this article, the intent would be to strike a perfect balance between the two so that you walk away feeling confident with a solid understanding of the algorithm.
There are majorly 3 components of the Twitter recommendation algorithm:-
- Candidate sourcing — selecting the top 1500 from millions of tweets that should be recommended to a user
- Ranking Candidates — Then we do the ranking for the selected candidates based on the probability for a user to interact with all the selected 1500 tweets
- Applying post rules — If you have blocked a user, and still you see his tweets, what are you gonna do? You cannot kill Twitter and Twitter would not want you to uninstall it. So better to ensure that it doesn’t happen.
Suppose there is a user named Rohan, In this article, we will deep dive into how the algorithm will select which tweets to recommend to Rohan in what order. Let’s Begin!
Candidate sourcing
This is a very crucial step because selecting only 1500 tweets from millions is of course a very hard job. The Twitter algorithm handles it in two major ways:-
- In-network tweets (50%)
- Out-of-network tweets (50%)
In-network tweets:
These are easy because it is sourced from the people you follow. In case there is a higher number of tweets from followers which exceeds the threshold of 1500 tweets, the priority of tweets is defined based on the recency and popularity of the tweet.
Out-of-network tweets:
This is the hard part. Remember that a user can both create a tweet(producers) and read/interact with tweets (consumers). In this article, we will refer to users with at least 1 tweet as producers, and all users by default will also be consumers.
To understand it, we will start with the final output and then we will go into more details regarding its implementation.
We will discuss the Simcluster algorithm which is a clustering algorithm. Yes don’t be shocked, we will use the clustering to choose the right set of tweets for the user. Here’s how it works:-
- Cluster the producers into communities — These communities are built based on the producers that share a common topic of interest, e.g. politics, sports, AI, movies, etc.
2. Create the community vector for Rohan— Rohan will have some interaction with tweets from producers in the past. For ease of explanation, we have only considered the number of likes as an interaction.
Now, a little bit of mathematics. Take the dot product of the Rohan-producer matrix with the producer-community matrix.
What does the community vector for Rohan represents? He likes 33% of the tweets from Community 1 and 66% of the tweets from Community 2.
3. Selecting the tweets from each community based on the Rohan community vector — Now to put it simply, if the algorithm has to source 1500 tweets for Rohan, it can pick up 500 tweets from Community 1 and 1000 tweets from Community 2. Now this tweet selection will be based on the popularity and recency of a tweet in the community. For e.g. a recent tweet that is being liked by a lot of users. Hurray, we have selected the 1500 tweet candidates for Rohan.
Did you notice something in the above process, once you assign the producers to communities, the rest of the steps simply flow based on the historical data. Twitter uses the SimCluster algorithm for clustering the producers into communities.
The basic idea is that users who belong to the same community are more likely to follow each other and tweet about similar topics. So there are two crucial elements of a community, Users within a community:-
- tweet about similar topics and
- tend to follow more people within the community as against other communities.
We will first try to capture both of the above elements in the data and then use an algorithm to assign the users to communities.
- Create a user-topic similarity matrix — Determine how similar the user’s tweets are to each topic.
Now, you might be wondering how we know that P1 has 4 tweets about politics or P2 has 2 tweets on politics and 2 tweets about sports.
It will be done using a text classification model or simply based on the similarity with the corpus of words. Based on the exploratory data analysis, we can define the corpus of words for each topic, e.g. sports can include words like cricket, basketball, ball, match, ball, etc. Now any given tweet will be compared with every topic corpus and assigned topic wherever the match is highest.
2. Network structure between users — If two users follow each other and they belong to the same community, then it adds up the score but if they follow each other and are tagged to a different community then it assumes it to be an undesirable state. Let’s look at the network structure between users below:-
Great! We have defined both elements. Now how do we cluster the producers into communities so that producers in each community tweet about similar topics and tend to follow more users within the community?
It uses a probabilistic method called the Metropolis-Hastings algorithm to sample from the posterior distribution of the communities given the tweet data and the network structure. Here is how it works:
- It starts with randomly assigning the users to the different communities and calls it the initial state
- It then generates a new state from a distribution, to put it simply we reassign the users from one community to another which is basically different than the initial state
- Then we calculate the score (probability) for both the initial state and the new state, if the new state score is higher than the initial state, we replace the initial state with the new state otherwise retain the initial state. This process happens for the defined number of iterations. Now this score will be higher if the users within a community tweet about a similar topic and follows most of the people within the same community.
The most important part of the above algorithm is how are we defining the score for each state.
The above snapshot explains how the score is computed. Note that if you are not able to understand it fully, I will also write a separate article explaining this process in detail along with the code. For now, just know that using the above algorithm we are clustering the producers to communities and each community will have users tweeting about similar topics and following each other more. Moreover, in practice, we generally compute log_probability instead of probability.
Hurray! We have completed the 1st part and we have already filtered out 1500 tweets for Rohan. Now let’s score them in the next step.
Ranking Candidates
The idea is that we compute the probability of a user interacting with all the selected tweets and rank it based on descending order of the probability of interaction. Difference interactions could be:
- Like Probability
- Reply Probability
- Retweet Probability
- View Probability
Intuition:
In the past which type of users have which type of interaction with which type of tweets? which type is determined by different features on which we can train our model? The features are mainly:-
- User Features — user location, preferred language, age, total followers, etc.
- Tweet Features — tweet text features, media features, likes, replies, retweets, etc.
Now the question is how can we convert it to the classification model? This will involve preparing the modeling data that includes defining dependent variables, independent variables, and the granularity of the model.
Granularity — User * tweet
Dependent Variable — In reality, we create a multiclass classification model. But for ease of understanding, we can consider only one interaction scenario, like. If for a given tweet, the user has liked it, then mark it as 1 else 0.
Independent Variables — All the user features and tweet features become independent variables.
For model training, we will create a dataset as above based on the past data. We can train our model on millions of user*tweet combinations. Moreover, in the above example, there are certain text variables, so we can create the embeddings for these variables or one-hot encoding before training the model. We can use any classification model.
Now let’s predict the probability of Rohan liking all the 1500 tweets that were selected in step 1. The higher the probability better it will be ranked.
User features will be based on Rohan, Twitter features will be based on the selected 1500 tweets and we will use the trained model to predict the probability of Rohan liking each tweet.
Please note that in our example we have used only 1 interaction variable for prediction but in practice, we predict the probability of all the interactions and create a combined score for Ranking. For e.g.
- Like probability: 0.8
- Reply probability: 0.2
- Retweet probability: 0.1
- View probability: 0.05
Score: 0.8 * 0.4 + 0.2 * 0.3 + 0.1 * 0.2 + 0.05 * 0.1 = 0.385
Interesting Fact:
Elon Musk mentioned in one of his tweets that the logistic regression model used for ranking has been trained many years ago. Isn’t it amazing, a model trained many years ago is still giving so relevant predictions?
Applying post rules
This step is mostly rule-based step where
- Certain tweets blocked by users are filtered out from the ranked list
- Advertisements are mixed in with the ranked tweets
This is the overall process that Twitter goes through to provide you with the recommendation on the “For You” page. Please note that in the tweet sourcing step, they also use some other models but I have discussed one of the most popular techniques of Simclusters that they use in the recommendation. Moreover, for ranking, although they train a logistic regression model, it is trained using the Neural Networks where the last layer is sigmoid and gives the probability of each interaction. I will share a separate article on the implementation of the Simcluster algorithm & Neural Network model for Ranking.
Hope you have got a conceptual understanding of the Twitter recommendation algorithm & hopefully you can also leverage it in your future projects.
Please note that the above explanation might not be 100% the same as used by Twitter but the intent of this article was to understand the conceptual details of different models that might get used in other projects. I hope that you liked this article.
Happy Learning!