Nearest neighbour based method for collaborative filtering
It is one of the method for performing collaborative filtering. If collaborative filtering is new to you don’t forget to read this article to take a brief look. The objective here is to predict the user’s rating for the products they have not rated. The user for whom we are trying to predict the ratings is called active user. This method finds the k-nearest neighbours of active user and recommends him products liked by his neighbours. This method works because if the users have liked same types of products then they are most probably similar and they might like same type of products. The steps to recommend products using nearest neighbours are:
- Find k-nearest neighbours of active user.
- Find their weighted sum of ratings for products. These are now the ratings of active user given for the products s(he) has not rated yet.
So in short, we find similar users then we find the weighted average of ratings they have given for products and this becomes the ratings of active user.
But how do we find the neighbours of active user? How do we measure similarity between users? We know that the similar users are those who give same ratings for different products. For e.g if user A and B give ratings 5 for product P then A and B are similar.
To measure a similarity we can build a user product/items ratings matrix:
So we can see from the above users item matrix, there are D items/products and we have N users. Each row is the ratings given by user to the items. Now our goal here is to find how much would User1 rate ItemD. The first step is to find users that are similar to User1. If we look into the matrix we can see, UserN is similar to User1 because they both rate 4, 5 and 3 to item1, item2 and item4 respectively. The users products rating matrix is very sparse. This is much similar to real world scenario. Our goal is to fill those empty spaces. Now we have seen User1 and UserN are very similar to each other so it is highly probable that User1 would also rate 4 to ItemD because userN has also rated 4 to itemD.
In reality, the process of finding similarity between active user and other users are much more complex. We need a similarity or distance metric to measure similarity between them. There are different metrics to find similarity between users:
1.Euclidean distance
This metric is very simple and it is simply the distance between two points in the space. For example if we have N dimensional space for users User1(x1, x2, x3, …xn) and User2(y1, y2,y3, …yn) the distance between them is:
from scipy.spatial import distance
euclidean_distnce = distance.euclidean(User1, User2)
Or mathematically,
2. Cosine similarity
This metric calculates the cosine of angle between two vectors. If the distance between two vectors decreases then the angle theta also decreases between them. If User1 and User2 are in
If the two vectors are perfectly aligned then the similarity would be zero and if they are orthogonal it would be one. The idea of finding cosine similarity is to find the dot products of vectors first and divide it by the product of magnitude of individual vector.
3. Pearson correlation
Correlation is just a similarity between two vectors. It is similar to cosine similarity but we take into account for the bias for the user. For examples some people normally rate the products higher and some users rate the products lower. So how can account these biases? One way would be we normalize the users ratings by the average rating. This is exactly what Pearson correlation does.
Here x bar and y bar are the mean of all the rating user x and y have given respectively. The users are first shifted by x bar and y bar ( to remove the bias) and angle or similarity between them is calculated.
I will write more about recommendation system algorithms in my coming articles. If you like this post, don’t forget to clap the post and follow me on medium and on twitter.