Software 2.0 — Playing with Neural Networks (Part 1)
The Internet has revolutionised how we interact with each other, how we build businesses, and even how we lead our daily life. It has allowed humans to evolve from doing the manual labour work to the intelligent species which writes software to get things done. Today, we are about to witness a whole new revolution, where we do not program the machine with specific instructions, but rather, we feed it with data, so that the machine itself will write the code.
This article is the third article in my series of articles titled “Machine Learning 2019.” If you are new to this field, or would like to refresh your knowledge about basic regression and classification, then I advice you to go through my previous articles before diving into neural networks.
In this article we are going to discuss about neural networks (from scratch), the innovative concept, which has taken the world by storm. I will assume that the reader is already familiar with the following concepts:
- Cost function (MSE and Cross Entropy)
- Gradient Descent
- Logistic regression
- Activation Function
- Binary Classification
Particularly, this article will try to address the following questions:
- What are neural networks?
- Cost function used in neural networks
- The back propagation algorithm to compute the gradients
- Different types of activation functions available
- Implementing Neural Network from scratch using Numpy
- Analysing the mislabelled examples (Bonus! :D)
As a running example, we will be using MNIST (Modified National Institute of Standards and Technology) dataset to perform multi-class classification using neural networks.
Why Neural Networks?
The idea behind neural networks is simple, we try to extrapolate the intuition behind logistic regression. Consider, for a moment, that you are solving the problem of house price prediction. What are the features you can come up with?
- Length of the house
- Width of the house
- Area of the house (which is even better than the first two features)
- Number of bed rooms
- Average wealth of people living in the neighbourhood
- Locality (as latitude and longitude coordinates)
- Year of renovation
These are all probably good features. However, there are few subtle features which definitely will impact the price of a house but we do not know how to represent them numerically, such as:
- Number of people who live in the house
- Facilities available near to the house
- Cleanliness of the house, etc.
How do we incorporate these features into our model? Can we say:
Facilities available near to the house = (Average wealth of people living in the neighbourhood + Locality) * Year of renovation?
Probably yes and probably no. So the problem here is, even though we know, perhaps abstractly, that few features impact the model, we are unable to represent them appropriately. The way this can be accomplished is to let the model itself figure out the new features.
Notations used in Neural Networks
In the above diagram, we can see that a neural network is simply an extension of logistic regression. Instead of making the output a linear combination of input features (passed through an activation function), we introduce a new layer, called hidden layer, which holds the activations of input features. Notice that in the above figure we have two hidden layers with four neurons each. The way output is calculated in the above network is as follows:
- A linear combination of input features is passed through the activation function to get the activations of input layer.
- These activations represent the hidden layer.
- The same operation is again repeated, i.e., a linear combination fo the neurons present in the hidden-layer-1 is computed and passed through an activation function to get the hidden-layer-2 neurons.
- A linear combination is once again computed by the neurons present in hidden-layer-2, and passed through an activation function to get the single output neuron.
More formally, if you were to look at the first neuron in the first hidden layer, the value which it will have is:
a10 = g(w11x1 + w12x2 + w13x3 + b1); (g(x) being the activation of x)
where, ‘aij’ is the activation value at the jth node in layer ‘i’ and ‘Wi’ is the weight matrix whose size is (i, i-1)
Example of W1 in the above figure is:
[w11 w12 w13
w21 w22 w23
w31 w32 w33
w41 w42 w43]
Furthermore, for the sake of clarity and completeness, the other activations in the hidden-layer-1 can be written as follows:
a11 = w21x1 + w22x2 + w23x3 + b2
a12 = w31x1 + w32x2 + w33x3 + b3
a13 = w41x1 + w42x2 + w43x3 + b4
Note: If you are still facing trouble understanding the precise notation, i.e., whether it is zero based indexing or whether should we use ‘W’ or ‘w’, then do not worry. There are multiple notations used and you will eventually get a clear picture once you see the code (later in this article).
Cost Function of Neural Networks
The cost function behind neural networks is calculated by a process called “forward propagation”. Till now, we have seen regression and classification techniques where coming up with the cost function was pretty straightforward. Such as:
However, in the case of neural networks, we have several layers sandwiched between the input and the output layer. If you were to know what the output of the above neural network is, then you have to compute the values of all the intermediate (hidden) neurons. Hence, we need to forward propagate the computations from input layer to the output layer, to evaluate how the network is performing. This is the reason why the process is called “forward propagation”. The way to do this in python is as follows:
As evident by the above code snippet, we loop over all the layers, and for each layer, we compute the activations for all the neurons. Don’t forget that the above code is vectorised for efficient processing.
Note: Ignore the ‘caches’ variable temporarily, we will understand it later.
Back propagation Algorithm
Now since we have evaluated how a model performs, we need to think about how to compute the gradients of the parameters present in the model so that gradient descent will be able to learn something useful.
Note: Again, if you are not familiar with gradient descent, or you do not know what is meant by “gradient descent will learn” then please refer to my previous articles where I have intuitively and mathematically described how learning happens.
Unlike logistic regression we will not be able to compute the derivates in a single step — since weights (or parameters) of the neural network are distributed across layers. The way we go about calculating derivates is similar to the way we performed forward propagation. Only difference being, (as the name implies) we do it backwards. The following diagram illustrates how we go about performing back propagation:
As you can see, while performing the backward pass, we need to compute the following quantities (among others):
- Z = W.T @ A_prev+ b (where W.T is the transpose of the matrix W and A_prev is the previous layer’s activations, and b is the bias) The symbol ‘@’ represents the matrix multiplication operator in Numpy.
- A = g(Z) (where g(x) is the activation of ‘x’)
Recall now, the ‘caches’ variable which I have said to ignore during forward propagation. As an optimisation, instead of computing again the intermediate values, A and Z (see above two points,) we use the previously computed results from forward propagation, which are stored in ‘caches’ variable. Therefore, the reason for storing and returning ‘caches’ variable from forward propagation is to aid the backward propagation algorithm by preserving computation (i.e. it avoids redundant computations at the expense of memory.) Programatically we can implement back-propagation as follows:
Choice of Activation Functions
In the Introduction to Classification article I have talked about sigmoid function (i.e. the logistic function) and the way it transforms the linear input combination to provide non-linearity to the model. It turns out, there are several other activation functions available in practise:
- Hyperbolic tangent function. As you can see from the below figure, the hyperbolic tangent function achieves its maximum when x →∞ which corresponds to +1, and its minimum when x →-∞ which corresponds to -1.
Hyperbolic tangent performs better than sigmoid function when applying it as an activation function in the hidden layer because, the mean of the activations will be 0 as opposed to the situation when we use a sigmoid function, where the mean will be 0.5. The reasoning here is similar — just like we centre the input data around the mean for gradient descent to easily converge, if the activations also have a zero-centred mean then the gradient descent, in general, performs well. The only place where sigmoid can be helpful is when we are performing a binary classification problem. In this situation, the output label should either be 0 or 1. If we were to use the tanh() function for the output layer, then the prediction will lie between -1 and 1 instead of 0 and 1. Therefore, in this scenario, it is better to use sigmoid function in the last layer alone.
- ReLU function (Rectified Linear Unit)
By far the best activation function till date — used widely in deep neural networks. The advantage of the ReLU function is that, for most part, the derivative is far from zero for several values of ‘z’ when compared to sigmoid or tanh() function, where they suffer from vanishing gradients. Consequently, the gradient descent algorithm converges faster since the gradients are far from zero (therefore learning does not slow down like in tanh() or sigmoid).
Note: You might wonder, how ReLU is used as an activation? It looks more like a linear function. Whereas, tanh() and sigmoid look non-linear. Guess what? You are probably mistaken! ReLU is capable of constructing highly non-linear boundaries. I will be writing about ReLU soon, but you might want to read one of my answer on Quora talking about the same.
With all that theory in mind, let us see how to implement a neural network which classifies the digits present in MNIST dataset.
Note: You can refer to the entire codebase to implement neural networks from scratch using Numpy, on my Github repository here.
Walkthrough of the above code:
- The MNIST dataset (present in the form of a ‘csv’ file) is read using Pandas library and split into train, dev and test sets. Note: For those who are not familiar with reading an image (since MNIST dataset contains images of hand-written digits) in python, please refer to this link for a tutorial.
- The function model() contains the crux of the code — which is (i) forward propagation (ii) backward propagation, and(iii) gradient descent to update parameters.
- As we can see a neural network is characterised by few variables. “Number of layers”, which is apparent from the name, denotes the number of layers present in the network. The more the no:of layers, the deeper the network becomes.
- “Activation functions” is a list which contains indices of another list which in turn contains the activation functions. It is coded this way because it is highly customisable. If you were to use the above code, then simply change the indices in the array to switch to different activation functions.
- “Alpha” denotes the learning rate and “iterations” denotes the number of iterations the network has to be trained.
- After the model trains with the specified network parameters, we print out the results and perform few predictions on the dev and test sets.
- Finally the model is persisted on disk. Note: It is highly recommended to store the model (post training phase) because when the network becomes deeper and deeper, the time taken to train and analyse the results becomes prohibitively large.
The test-set accuracy which the model achieves for the network parameters:
Layers: [784, 10, 10]
Activation functions: relu → sigmoid
Test set accuracy: 92.47%!
Analysing mislabelled examples
Preparing the dataset, coming up with features, computing the cost, calculating gradients, learning parameters through gradients, and finally, predicting results on test set — all this is just half of the story. The remaining part is when we actually try to see what examples were mislabelled. This part of the process is also sometimes referred to as error analysis. Too many professionals feel that error analysis is a manual process and is not worth tackling. However, if you are able to accurately judge why your model is failing, then you will be able to tune it in such a way so that its accuracy increases! See the below picture:
The model predicted it as ‘7’, whereas the actual value is ‘2’. What can we learn from this? Two things:
- The model is unable to properly identify the lower part of the ‘2’ — i.e., if the lower part of ‘2’, which is the horizontal line, is small enough, then our model is failing. The solution is to train the model longer. But before taking this route we need to investigate whether the model is overfitting the training set or not. This investigation can be carried out by plotting the cost function curve for train and dev sets as follows:
Note: The above picture is just an example of how the plot might look. It does not correspond to our example of MNIST dataset.
If the loss on dev set is less than or equal to the loss on train set, then we are in the safe zone — i.e., we can continue training for longer durations without running the risk of overfitting. However, if the loss on dev set is increasing, when compared with the loss on train set, then the model is signalling us that it is going to overfit the training set.
- Deepen the network. By increasing the number of layers we are simply allowing the network to learn more and more complex features at the expense of overfitting. Recall how we have given the intuition of neural networks in the beginning of this article — each neuron in the hidden layer can be treated as a learned feature which helps the model better classify the input data. In our example, the hidden layers might represent edges, and circles in the input image. Therefore if we are increasing the number of layers, or for that matter, the number of neurons per layer, then we are indirectly allowing the network to learn more complex features.
- Last resort: Bring in more data. Especially those which separates 2 from 7. This works tremendously well provided that our model is not suffering from high-bias. Note: High bias is a case where our model is unable to learn anything useful from the dataset even though we have huge amount of data. This scenario occurs when we have a network which is neither deep nor wide — which is very rare (and only frequently observable in linear models which fail to capture complex features.) The reason why bringing in more data works is because neural networks are considered to be data-hungry models. That is just a fancy way of saying “throw in more data” and you get “more accuracy.” This is predominantly because of the number of hidden layers and the number of hidden neurons — neural networks are able to search for the optimal result in very high dimensional space. This is the reason why neural networks (today) are omnipresent and omnipotent!
In this article we have covered what are neural networks, what are the different activation functions available and how to implement a neural network to classify MNIST dataset from scratch using Numpy. Implementing a basic neural network from scratch is generally considered to be a good idea for those who are venturing into the field of Machine learning because it gives them an opportunity to understand how things work under the hood. In the subsequent articles, my emphasis will be more on the usage of high-level frameworks and API’s such as Tensorflow and Keras.
I would like to conclude by quoting one of the famous tweets by “Andrej Karpathy” on the power of neural networks (specifically gradient descent.)
Software 2.0 — Playing with Neural Networks (Part 1) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.