Recommender systems have become ubiquitous and very important in recent years. They are present in a variety of areas, such as for recommending movies, music, news, books. One of the most popular algorithms used to produce a list of recommendations is collaborative filtering. In this blog post we will talk about user-based collaborative filtering algorithm and how to efficiently calculate similarities used by this algorithm. The basic idea behind user-based collaborative filtering is as follows: If we want to predict how user U will like item I, we can check how other users who are similar to user U have rated that item. Then, we can calculate the prediction - it is probable that the user will rate items like others who have similar taste.
Here is a simple ratings matrix of 4 users and 5 items (scores are between one and five). It seems that users U1 and U3 are similar because they gave similar ratings to items rated by both of them (items I4 and I5). Also, users U2 and U4 are similar.
Now, when you want to implement a recommender system and evaluate it, you typically need to find similarities between many pairs of users. If the number of users that should be mutually compared is N then the number of comparisons, i.e. combinations of users is C(n, 2), which has an order of magnitude of N2. When N is large (which is often the case), we want to find a solution that will calculate all similarities efficiently. It is clear that loop on N2 iterations would not be a good solution in that case. Since all ratings are stored in a matrix where rows represent users and columns represent items, matrix operations can help us deal with this complexity.
Ratings matrices are usually both large (there are a lot of users and items) and sparse (typically users rate only few of items, if any). In R, for example, there is a separate representation for large sparse matrices, such that missing values (ratings) are not stored into memory. Very often, over 90% of ratings are missing, and this kind of representation enables us to keep these large matrices in memory. When we need to calculate similarities between users or items, there is a number of measures that are commonly used, such as the Pearson correlation, Cosine, Euclidean similarity etc. In every language that supports matrix operations, there is often a package or library providing functions to calculate these similarities over the columns or rows of a matrix. However, many of those functions may not efficiently deal with large sparse matrices or may not cover some variants of formulas for calculating similarities. If we know how to utilize matrix operations in these scenarios, we will be able to:
- Efficiently calculate similarities on large ratings matrices.
- Achieve flexibility, i.e. we can manually tune or extend formulas in order to implement more formula variants.
Let’s see how we can use matrix operations in order to implement some of the most common similarity metrics - Pearson correlation and Euclidean similarity.
Pearson correlation is a common measure used to check how correlated two variables (vectors) are. This measure becomes handy for calculating similarities between users in recommendation systems since all users can be represented as vectors of ratings. The general formula for correlation between two users a and b is shown below. We will not go in too deep regarding the meaning of the formula, just to emphasize a couple of details: Ratings are marked with ‘r’, while normalized ratings are marked with ‘r̂’. For each user, normalized ratings are obtained by subtracting the user’s average (r̄) from his real ratings. Correlation is measured by comparing only items rated by both users (Iab).
Let’s take a look at our example matrix from the beginning, and check users U2 and U4. They can be represented as vectors, where missing values can be considered as zero:
We can see that both users rated items I1, I3 and I4 (green cells). We will not consider rating that user U4 gave to I5 (red cell) because user U2 did not rate I5, so we cannot compare their taste about that item. In order to calculate correlations, we use normalized ratings (Û2 and Û4). According to the formula we need to do summations over items rated by both users (green cells). This can be done by summing over vector Û2 only in places where Û4 is not missing, and vice-versa.
In R, there is a function called crossprod, which calculates the sums of products column-wise, between each pair of columns. More precisely, if we have rating matrices (or vectors) A and B, then: crossprod(A, B) = transpose(A) X B, where X is operator for the multiplication of matrices. We will use this function here, to calculate the correlation between U2 and U4. If the missing values are coded as zeros, the formula can be written as follows (we assume that logical matrices generated by comparison with zero can be used in multiplication, with value 0 for false, and 1 for true):
As we already mentioned, in case of a large number of users that we need to compare, the calculation of correlations in a loop will be very slow and may run for hours depending on the number of iterations. However, we can utilize matrix multiplication which is a fast operation in R and other languages that support matrix operations. By generalizing the above formula, we get the formula to calculate correlations between each pair of columns on matrices A and B, where columns represent users that we want to compare. availA and availB are logical or zero-one matrices, indicating which value is available and which one is missing in the original matrix.
When this formula is implemented in R, it takes a couple of seconds to execute on a large sparse matrix (e.g. 400,000 users and 25,000 items), which is by far better than a loop that would need hours to finish. The output is a matrix of correlations. In our example, if we need to find correlation for each pair of users, we will apply the formula using the same matrix as input for both arguments:
Ok, now we have correlations. But, was it really necessary to go deep into the matter and do manual implementation? Well, if you just need correlations calculated using the original formula and you have found a library that can calculate it quite fast, then you don’t need to do this by yourself. But, in case that the implementation you found is not applicable to large matrices, or if you want to implement some variant of the formula, then this approach is quite valuable.
There is a number of variants of the correlation similarity measure. For example, the constrained Pearson correlation is introduced in case we want to consider the impact of positive and negative ratings. It uses the rating scale median instead of user’s mean for normalizing ratings. Another important aspect is weighting. Intuitively, if two users have more items that both have rated, the similarity will be more reliable. Weighted Pearson correlation can be calculated in several ways. One approach is to take the number of common items(|I|) and use some reliability threshold (H) on this number:
For each pair of users, we can easily find the number of items rated by both users (|I|), simply by calculating crossprod on availA and availB matrices.
Euclidean distance is a common metric used to measure distance between vectors. It is calculated by directly comparing how each pair of ratings differ. More precisely, we first sum squared rating differences across all common items, and then calculate a root:
With Euclidean distance, we can find similarity using the appropriate scaling (inverting distance). In R, we were not able to find any implementation that could efficiently calculate the Euclidean distances on large ratings matrices, and decided to switch to manual implementation. Now, in the formula above, we have subtraction between two vectors. It seems that this subtraction cannot be calculated using matrix multiplication. However, we can apply one trick we all remember from algebra ;-) ( the square of a binomial ), in order to transform the expression under square:
Now, we have three operations of vector multiplication under square. Finally, this can be generalized to rating matrices A and B as follows (availA and availB matrices have the same meaning and purpose as explained in the example with correlation):
Again, this formula executes in seconds in R. After calculating distances, we want to transform them into similarities. Here again, manual implementation provides us an option to scale distances as we prefer. One approach is as follows: euclDist between two users takes the total distance between all of their common ratings. Average distance can be obtained by dividing the total distance (expression under square) by the number of common ratings (ratings for items that were rated by both users):
All average distances are in the interval [0,4], since ratings are in the interval [1,5]. By using an appropriate scaling factor, similarities can be easily calculated as euclSim = 1-(avgEuclDist/4):
Matrix multiplication is a very fast operation and can be useful whenever you need to operate on matrix-like data. In this blog, we briefly explained how it can be utilized in collaborative filtering setup, when you need to calculate similarities between a large number of users or items. I hope these examples gave you some good intuition into this approach and the great value it has.