This article will be a wrap-up of our series related to collaborative filtering techniques and how to apply them on large datasets in R. In the previous post we covered the nearest neighbours method. Here, the focus will be on the model-based algorithm, namely matrix factorization. In comparison to the neighbours method, it is often better in terms of prediction accuracy and time needed to train the model.

The idea behind matrix factorization is to capture patterns in rating data in order to learn certain characteristics, aka latent factors that describe users and items. In the picture below, these factors are stored in two matrices, P - user factors and Q - item factors. Let’s imagine that items are movies that users have rated. For movies, those factors might measure obvious dimensions such as the amount of action or comedy, orientation towards children, less well defined dimensions such as depth of character development or quirkiness; or completely uninterpretable dimensions. For users, each factor measures how much the user likes movies that score high on the corresponding movie factor.

For a given item i, the elements of (column) qi measure the extent to which the item possesses factors, positive or negative. For a given user u, the elements of pu measure the extent of interest the user has in items that are high on the corresponding factors, again, positive or negative. The resulting dot product, pTu* qi, captures the interaction between the user u and item i - the user’s overall interest in the item’s characteristics. This estimates the user u’s rating of item i, which is denoted by rui. Our goal is to find the best possible estimates for all existing ratings, i.e. to minimize the cost function:

where ||.|| is the Euclidean norm, and S is the set of all existing ratings. Note that the formula includes a regularization part (with lambda coefficient). The system learns the model by fitting the previously observed ratings, but the goal is to generalize those previous ratings in a way that predicts future, unknown ratings. Thus, the system should avoid overfitting the observed data by regularizing the learned parameters, whose magnitudes are penalized. The lambda constant controls the extent of regularization and is usually determined by cross-validation.

### The road to minimum

The above summation of rating-estimate differences is a non-convex function of P and Q elements and thus finding a minimum is a difficult optimization problem. There are several techniques that can be used, and we will give an overview of the most popular ones.

1. Use gradient of cost function

Let’s imagine that the cost function J is dependent on one variable (w) as shown in the image below. In order to find a minimum gradient descent algorithm can be used as follows: We can start with randomly chosen value for w - initial weight. Calculate the gradient of J for the current w value and update w using the gradient. This way, the value of cost function J for updated w is closer to minimum. Repeat these steps until convergence. (Of course, this procedure might get stuck in local minimum, if such exists.)

The above example can be generalized to the case when J depends on multiple variables, i.e. P and Q factors. First, initialize all factors to small random values. Then, find the minimum either by using:

- gradient descent, as explained above, or
- some more advanced algorithm that uses the gradient for minimization (e.g. L-BFGS)

However, in case when the number of ratings is large, all of these algorithms might be slow since calculating the gradient of J is expensive (it contains summation over all ratings). The exact formulas for calculating gradients and parameter updates are shown below: for each parameter, a corresponding gradient is multiplied by learn rate (alpha) and subtracted from the current parameter value. (Note that the gradients below are of J/2, so if we want to be precise this would be the optimization of J/2).

2. Stochastic gradient descent (SGD)

The basic idea of stochastic gradient descent is that, instead of expensively calculating the gradient of J, it randomly selects rating rui and calculates the corresponding gradient (based solely on that rating, not the whole sum):

The overall procedure of SGD is to iteratively select rui instances and apply updates according to given formulas. During a single iteration we use all ratings for updating parameters (each rating exactly once).This procedure is sequential, however it can converge much faster than classic gradient descent which needs to sum over all ratings in order to do one single step towards the minimum.

3. Alternating least squares (ALS)

Because both qi and pu are unknowns, the J function is not convex. However, if we fix one of the unknowns, the optimization problem becomes quadratic and can be solved optimally (using normal equation). Thus, ALS technique rotates between fixing qi ’s and pu’s. When all pu’s are fixed, the system recomputes the qi’s by solving a least-squares problem, and vice versa. Again, this is an iterative process, but suitable for parallelization. For example, in ALS the system computes each pu independently of the other user factors (so we can solve normal equations for different users in parallel). The same holds for calculating item factors.

### R package ‘recosystem’

In order to apply matrix factorization on a large dataset, it is important to achieve short convergence time when optimizing the cost function J. This is especially important in case we need to use cross-validation to optimize parameters (regularization, number of factors etc). ‘recosystem’ is an R package for fast matrix factorization using parallel stochastic gradient descent. It is the wrapper of an open source C++ library, LIMBF, and the fastest among packages for matrix factorization that we have tested in R. It supports both working with rating data in-memory and on disc. In this article the focus will be on use cases when ratings can fit into available RAM.

We have described the stochastic gradient descent as a sequential and iterative process, but it can be parallelized. Here are some ideas behind ‘recosystem’ and how it uses parallelization (all details can be found in the documentation posted here):

- We can randomly select multiple ratings, and apply parameter updates in parallel for all of them. Those ratings preferably belong to different users and items, so that we don’t have the overwriting problem where different threads access the same data or variables pu and qi at the same time. This is somewhat easier to achieve with sparse matrices (and rating matrices are usually sparse).
- Matrix can be divided into blocks that can be manipulated in parallel.

The usage of this package is really straightforward. Here is a code example, where the ratings_data variable represents a dataframe with the following columns: user_id, item_id, rating. Parameters are set arbitrarily: the number of factors (dim) is 30, regularization for P and Q factors (costp_l2, costq_l2) is set to 0.1, and convergence can be controlled by a number of iterations (niter) and learning rate (lrate). The user can also control the parallelization using the nthread parameter:

When the code is applied to Movielens datasets (90% train, 10% test), on a machine with 8 cores and 16 GB RAM, we get this execution time:

- ~3 sec for MovieLens 1M
- ~30 sec for MovieLens 10M

This performance allows us to execute parameter optimization within a reasonable amount of time. For instance, here are the results for RMSE using 10-fold cross validation on MovieLens 10M, with a different number of factors (10, 20, 30) and iterations (100, 300, 500), for fixed lambda = 0.1, and 6 threads:

In total, we had 9 combinations (num factors - num iterations). For each combination of those parameters 10-fold cross validation was performed, and the execution was done in ~120 mins.

### Conclusion

Matrix factorization is a powerful method, one of the most popular for calculating recommendations based on past ratings. Among many implementations and approaches for matrix factorization, it is important to find an efficient one, especially in case we need to deal with big datasets. As long as the dataset can fit and be processed within the available RAM on one machine, the ‘recosystem’ is a fairly good choice.