A maths-free explanation of an underappreciated algorithm
Gaussian processes are a powerful algorithm for both regression and classification. Their greatest practical advantage is that they can give a reliable estimate of their own uncertainty. By the end of this maths-free, high-level post I aim to have given you an intuitive idea for what a Gaussian process is and what makes them unique among other algorithms.
- Recap on machine learning
- How to deal with uncertainty
- Bayesian inference in a nutshell
- Gaussian processes
What is machine learning?
Machine learning is linear regression on steroids.
Machine learning is using data we have (known as training data) to learn a function that we can use to make predictions about data we don’t have yet. The simplest example of this is linear regression, where we learn the slope and intercept of a line so we can predict the vertical position of points from their horizontal position. This is shown below, the training data are the blue points and the learnt function is the red line.
Machine learning is an extension of linear regression in a few ways. Firstly is that modern ML deals with much more complicated data, instead of learning a function to calculate a single number from another number like in linear regression we might be dealing with different inputs and outputs such as:
- The price of a house (output) dependent on its location, number of rooms, etc… (inputs)
- The content of an image (output) based on the intensities and colours of pixels in the image (inputs)
- The best move (output) based on the state of a board of Go (input)
- A higher resolution image (output) based on a low resolution image (input)
Secondly, modern ML uses much more powerful methods for extracting patterns of which deep learning is only one of many. Gaussian processes are another of these methods and their primary distinction is their relation to uncertainty.
Thinking about uncertainty
Uncertainty can be represented as a set of possible outcomes and their respective likelihood —called a probability distribution
The world around us is filled with uncertainty — we do not know exactly how long our commute will take or precisely what the weather will be at noon tomorrow. Some uncertainty is due to our lack of knowledge is intrinsic to the world no matter how much knowledge we have. Since we are unable to completely remove uncertainty from the universe we best have a good way of dealing with it. Probability distributions are exactly that and it turns out that these are the key to understanding Gaussian processes.
The most obvious example of a probability distribution is that of the outcome of rolling a fair 6-sided dice i.e. a one in six chance of any particular face.
This is an example of a discrete probability distributions as there are a finite number of possible outcomes. In the discrete case a probability distribution is just a list of possible outcomes and the chance of them occurring. In many real world scenarios a continuous probability distribution is more appropriate as the outcome could be any real number and example of one is explored in the next section.
Another key concept that will be useful later is sampling from a probability distribution. This means going from a set of possible outcomes to just one real outcome — rolling the dice in this example.
Bayesian inference might be an intimidating phrase but it boils down to just a method for updating our beliefs about the world based on evidence that we observe. In Bayesian inference our beliefs about the world are typically represented as probability distributions and Bayes’ rule tells us how to update these probability distributions.
Bayesian statistics provides us the tools to update our beliefs (represented as probability distributions) based on new data
Let’s run through an illustrative example of Bayesian inference — we are going to adjust our beliefs about the height of Barack Obama based on some evidence.
Let’s consider that we’ve never heard of Barack Obama (bear with me), or at least we have no idea what his height is. However we do know he’s a male human being resident in the USA. Hence our belief about Obama’s height before seeing any evidence (in Bayesian terms this is our prior belief) should just be the distribution of heights of American males.
Now let’s pretend that Wikipedia doesn’t exist so we can’t just look up Obama’s height and instead observe some evidence in the form of a photo.
Our updated belief (posterior in Bayesian terms) looks something like this.
We can see that Obama is definitely taller than average, coming slightly above several other world leaders, however we can’t be quite sure how tall exactly. The probability distribution shown still reflects the small chance that Obama is average height and everyone else in the photo is unusually short.
What is a Gaussian process?
Now that we know how to represent uncertainty over numeric values such as height or the outcome of a dice roll we are ready to learn what a Gaussian process is.
A Gaussian process is a probability distribution over possible functions.
Since Gaussian processes let us describe probability distributions over functions we can use Bayes’ rule to update our distribution of functions by observing training data.
To reinforce this intuition I’ll run through an example of Bayesian inference with Gaussian processes which is exactly analogous to the example in the previous section. Instead of updating our belief about Obama’s height based on photos we’ll update our belief about an unknown function given some samples from that function.
Our prior belief about the the unknown function is visualized below. On the right is the mean and standard deviation of our Gaussian process — we don’t have any knowledge about the function so the best guess for our mean is in the middle of the real numbers i.e. 0.
On the left each line is a sample from the distribution of functions and our lack of knowledge is reflected in the wide range of possible functions and diverse function shapes on display. Sampling from a Gaussian process is like rolling a dice but each time you get a different function, and there are an infinite number of possible functions that could result.
Instead of observing some photos of Obama we will instead observe some outputs of the unknown function at various points. For Gaussian processes our evidence is the training data.
Now that we’ve seen some evidence let’s use Bayes’ rule to update our belief about the function to get the posterior Gaussian process AKA our updated belief about the function we’re trying to fit.
Similarly to the narrowed distribution of possible heights of Obama what you can see is a narrower distribution of functions. The updated Gaussian process is constrained to the possible functions that fit our training data —the mean of our function intercepts all training points and so does every sampled function. We can also see that the standard deviation is higher away from our training data which reflects our lack of knowledge about these areas.
Advantages and disadvantages of GPs
Gaussian processes know what they don’t know.
This sounds simple but many, if not most ML methods don’t share this. A key benefit is that the uncertainty of a fitted GP increases away from the training data — this is a direct consequence of GPs roots in probability and Bayesian inference.
Above we can see the classification functions learned by different methods on a simple task of separating blue and red dots. Note that two commonly used and powerful methods maintain high certainty of their predictions far from the training data — this could be linked to the phenomenon of adversarial examples where powerful classifiers give very wrong predictions for strange reasons. This characteristic of Gaussian processes is particularly relevant for identity verification and security critical uses as you want to be completely certain your models output is for a good reason.
Gaussian processes let you incorporate expert knowledge.
When you’re using a GP to model your problem you can shape your prior belief via the choice of kernel (a full explanation of these is beyond the scope of this post).
This lets you shape your fitted function in many different ways. The observant among you may have been wondering how Gaussian processes are ever supposed to generalize beyond their training data given the uncertainty property discussed above. Well the answer is that the generalization properties of GPs rest almost entirely within the choice of kernel.
Gaussian processes are computationally expensive.
Gaussian processes are a non-parametric method. Parametric approaches distill knowledge about the training data into a set of numbers. For linear regression this is just two numbers, the slope and the intercept, whereas other approaches like neural networks may have 10s of millions. This means that after they are trained the cost of making predictions is dependent only on the number of parameters.
However as Gaussian processes are non-parametric (although kernel hyperparameters blur the picture) they need to take into account the whole training data each time they make a prediction. This means not only that the training data has to be kept at inference time but also means that the computational cost of predictions scales (cubically!) with the number of training samples.
The future of Gaussian processes
The world of Gaussian processes will remain exciting for the foreseeable as research is being done to bring their probabilistic benefits to problems currently dominated by deep learning — sparse and minibatch Gaussian processes increase their scalability to large datasets while deep and convolutional Gaussian processes put high-dimensional and image data within reach. Watch this space.