The (often) forgotten child of Machine Learning
Everyone with an internet connection has been subjected to a recommender system (RS).
Almost all media services have a particular section where the system recommends things to you, being things a movie in Netflix, a product to buy in Amazon, a playlist in Spotify, a page to like in Facebook and so many others. We all have seen the “You might like this” section at least once in almost all services we daily use; the algorithm behind that is a Recommender System.
Some might consider a RS as a system that tries to get to know you better than yourself and give you exactly what you need. Sounds kind of (maybe a lot) creepy if you put it this way but its goal is to make your experience the best possible by learning your likes and dislikes.
Like all applications of Machine Learning (ML) the prospect of a RS is that of learning from a huge amount of data and then predict on new incoming data where prediction in the RS can be seen as a classification problem of whether the user will like the item recommended or not. Even though there are quite noticeable differences starting from the way data is stored (sparse matrices) to the approaches and models specific for RS only, the steps followed for all ML applications remain the same:
- Data Visualization
- Data Pre-processing
- Choosing a model
- Hyper-Parameter Tuning (HPT)
While it’s all fine and dandy for the first 5 steps the wall I ran into while working on my thesis was the last step: Hyper-Parameter Tuning. The lack of material and practical examples for the possible approaches to do HPT in RS, what works best in general and whatnot, is quite prominent and it’s what sparked the idea to write this post.
Gentle Recommender System introduction
The setting of a RS algorithm is the same of any other ML application with the exception of two key elements:
- Data storing: in RS data is stored in sparse matrices, more precisely in two, URM (User Rating Matrix) and ICM (Item Content Matrix). A URM, a UserxItem matrix, contains all the information about the user preferences that is, in a row we have ratings (explicit or implicit) done by a user to the items that the service provides. ICM instead is a matrix that holds metadata about all the items, where each row contains all the information about an item.
- Approaches: can be categorized in three types.
- Content-based: we use the features (metadata) of the items to find other items similar among them and then recommend to users items similar to those the user already showed interest to. In more practical words: For a user who liked/saw/rated the movie “The Great Beauty” you recommend a movie like “Youth” because of things they have in common like the director (Paolo Sorrentino), same genres (Drama, Comedy), keywords and so on.
- Collaborative filtering: exploits the fact that as different as people may be there exist patterns when it come preferences so two users with similar preferences we recommend to one items that the others likes which are unknown to such user.
- Hybrid: a combination of the previous two through the old ways off ensembling
Once you have your data aka matrices ready and chosen the method to use it’s all a matter of matrix multiplication.
What we do in model tuning is shape our model with the goal of achieving the best possible score in unseen data (Test set). Before jumping to the ways HPT can be done, first we need to know what hyper-parameters (HP) are.
In ML a hyper-parameter is a parameter that the model doesn’t learn by itself but we have to provide to the model before it starts learning, e.g number of hidden units in a Neural Network, number of trees in a Random Forest, the K number of nearest neighbors in KNN etc.
The key is that HP themselves need to be learnt: we don’t know beforehand if 10 hidden units in a Neural Network are the best choice for our problem or if 1000 is better, if 50 trees in a Random Forest will yield better performance than 100 trees and so on. Hence the need to do HPT.
HP are not to be confused with parameters: while HP are provided by us to the model, the parameters are automatically learnt inside the black-box that the model usually is.
A quick search with show you that the most essential and used approaches for HPT are:
- Manually: we could go back to building a fire with just two stones while we’re at it
- Grid Search (GS): it’s getting slightly better
- Random Search: you’re starting to see the light
- Bayesian Optimization (BO): now we’re talking
The main reason behind such consideration for the HPT approach is due to one fact only: time. The time it takes to find the optimal HP for your problem is crucial, especially in a field where recommendations have to be made at real-time, updated with information that you insert at the moment when doing it online.
To prove the hassle behind all this I’ve made a fairly simple example that proves the point, confronting two of these approaches:
- GS: a simple brute-force approach
- BO: an automatic approach that has proven to outperform other state of the art optimization algorithms in many problems.
To not enter too much into details of the algorithm used for the scope of the post is not that, the basic facts are these:
Dataset: MovieLens dataset with 10 Million interactions
RS method: KNN, a content-based approach, where for each movie we find K similar movies (nearest neighbours) to it, where the similarity is Cosine Similarity
Train — Test: an 80%-20% split with approximately 8 Million of interactions in Train set and 2 Million interactions in Test set
Validation set: to make HPT possible we need also a validation set which I generated by further splitting the Train set in Train and Validation set with a ratio of 80–20%, so now we have about 7.4 Million interactions in Train and 1.6 interactions in Validation set
Metric: MAP@5 (Mean Average Precision) treats the recommendation as a ranking task where the aim is that of recommending to users first the items that are most likely for the user to fancy. The 5 stands for evaluating the top 5 five items recommended. The value of MAP goes from 0 to 1. The higher the MAP value, the better.
Hyper-parameters to tune: NN (Nearest Neighbours) and Shrinkage factor
The idea behind GS is that of creating a grid with the possible values for your hyper-parameters and then start testing out the combinations of such possible values.
The possible combinations in our case would be 81 which means 81 times of training and evaluating our algorithm with each parameter combination. Imagine having more than 2 parameters to tune though, image how that the number of combinations would increase and consequently the time as well. Yeah, it would take quite a lot more.
Good: by fixing one parameter at the best value found by GS, we vary the other visualizing the change in performance. The sensitivity of the model to hyper-parameters is clear and shows the importance of such step in the process of creating a RS.
Not so good: this is the time that it took to try all the combinations!
Needless to say, I went looking in a frenzy for another method to do this last step of my model training and my research lead me to BO.
Bayesian Optimization Approach
The thing about GS, Manual and Random search also, is that the next set of hyper-parameters to evaluate is made in an uninformed fashion. This is where the BO approach stands out: it uses the past choices made in order to make a smart pick for the next set of values to test out, reducing thus the number of evaluations done at the cost of searching for the next parameters to test in a more intelligent way.
As optimization goes, in BO we are aim to find the maximum (or minimum) of a function f (x) on a bounded set X which translated to this specific problem means that the function being considered is one that in synthesis given an input x (a configuration of the HP) from X (set of all possible values of the HP) it returns the performance of the model generated with such HP. For the BO process the function f is unknown and what it does is associate a prior to it. The purpose of the prior is to express assumptions about the function being optimized and as evaluations are done a posterior distribution is formed. The posterior distribution then itself is used to drive the acquisition function of the next input (HP) to evaluate. It’s a sequential strategy and thus with the increasing number of iterations, so does the quality of the posterior which in turn means that the certainty of the goodness of the next region to be explored is higher.
As for many other methods concerning the world of DS you can either use this as a black-box to reach your goal or you could try to do some unwrapping and try to understand how such optimization is done. The full explanation of this approach can be read here, which is where the implementation of BO used for this post was based on . If you’d rather read something lighter instead, this article was very helpful for me.
The implementation of BO used for this example is this Python library.
In practice, there are three basic elements that you need to define in order to do BO for HPT:
- The range of parameter values, so the domain of the function:
- The function to optimize, where you can adapt to whichever metric you want hence the facility of the approach:
- Call the function
The only sticky aspect of BO is that it chooses continuous values to try out and you need to adapt in case your hyper-parameters are integer numbers; nothing a simple casting can’t solve though. Define these three elements and your job is done.
The difference in time to reach the same results you might wonder:
The best parameter combination found by using GS:
The best parameter combination found using BO:
There is a slight change in those parameters but not to worry. A simple test can remove just any doubt.
The difference in time performance is noticeable, almost 10 times less, yet the goal is reached by both methods. By no mean a simple example like this can be used as proof that Bayesian Optimization is the way to go for all Recommender System problems when doing hyper-parameter tuning but it is interesting to see that RS is no exception in the use of BO. After all, if there is a certainty in Data Science it’s that there is no certainty that an approach is going to work in your dataset.