Ever since the world of Machine Learning was introduced to non-linear functions that work recursively (i.e. Artificial Neural Networks), the applications of which boomed noticeably. In this context, proper training of a Neural Network is the most important aspect of making a reliable model. This training is usually associated with the term “Back-propagation”, which is highly vague to most people getting into Deep Learning. Heck, most people in the industry don’t even know how it works — they just know it does!

** Back-propagation** is the essence of neural net training. It is the practice of fine-tuning the weights of a neural net based on the error rate (i.e. loss) obtained in the previous epoch (i.e. iteration). Proper tuning of the weights ensures lower error rates, making the model reliable by increasing its generalization.

So how does this process work, with the vast simultaneous mini-executions involved? Let’s learn by example!

In order to make this example as subjective as possible, we’re just going to touch on related concepts (e.g. loss functions, optimization functions, etc.) without explaining them, as these topics deserve their own series.

### First off, let’s set the model components

Imagine that we have a deep neural network that we need to train. The purpose of training is to build a model that performs the **XOR** (exclusive OR) functionality with two inputs and three hidden units, such that the **training set** (truth table) looks something like the following:

X1 | X2 | Y

0 | 0 | 0

0 | 1 | 1

1 | 0 | 1

1 | 1 | 0

Moreover, we need an **activation function** that determines the activation value at every node in the neural net. For simplicity, let’s choose an identity activation function:

f(a) = a

We also need a **hypothesis function** that determines what the input to the activation function is. This function is going to be the typical, ever-famous:

h(X) = W0.X0 + W1.X1 + W2.X2

or

h(X) = sigma(W.X) for all (W, X)

Let’s also choose the **loss function** to be the usual cost function of logistic regression, which looks a bit complicated but is actually fairly simple:

Furthermore, we’re going to use the Batch Gradient Descent **optimization function** to determine in what direction we should adjust the weights to get a lower loss than the one we currently have. Finally, the **learning rate** will be 0.1 and all the weights will be initialized to 1.

### Our Neural Network

Let’s finally draw a diagram of our long-awaited neural net. It should look something like this:

The leftmost layer is the input layer, which takes X0 as the bias term of value 1, and X1 and X2 as input features. The layer in the middle is the first hidden layer, which also takes a bias term Z0 of value 1. Finally, the output layer has only one output unit D0 whose activation value is the actual output of the model (i.e. h(x)).

### Now we forward-propagate

It is now the time to *feed-forward the information from one layer to the next*. This goes through two steps that happen at every node/unit in the network:

1- Getting the weighted sum of inputs of a particular unit using the h(x) function we defined earlier.

2- Plugging the value we get from step 1 into the activation function we have (f(a)=a in this example) and using the activation value we get (i.e. the output of the activation function) as the input feature for the connected nodes in the next layer.

Note that units X0, X1, X2 and Z0 do not have any units connected to them and providing inputs. Therefore, the steps mentioned above do not occur in those nodes. However, for the rest of the nodes/units, this is how it all happens throughout the neural net for the first input sample in the training set:

Unit Z1:

h(x) = W0.X0 + W1.X1 + W2.X2

= 1 . 1 + 1 . 0 + 1 . 0

= 1 = a

z = f(a) = a => z = f(1) = 1

and same goes for the rest of the units:

Unit Z2:

h(x) = W0.X0 + W1.X1 + W2.X2

= 1 . 1 + 1 . 0 + 1 . 0

= 1 = a

z = f(a) = a => z = f(1) = 1

Unit Z3:

h(x) = W0.X0 + W1.X1 + W2.X2

= 1 . 1 + 1 . 0 + 1 . 0

= 1 = a

z = f(a) = a => z = f(1) = 1

Unit D0:

h(x) = W0.Z0 + W1.Z1 + W2.Z2 + W3.Z3

= 1 . 1 + 1 . 1 + 1 . 1 + 1 . 1

= 4 = a

z = f(a) = a => z = f(4) = 4

As we mentioned earlier, the activation value (z) of the final unit (D0) is that of the whole model. Therefore, our model predicted an output of 1 for the set of inputs {0, 0}. Calculating the loss/cost of the current iteration would follow:

Loss = actual_y - predicted_y

= 0 - 4

= -4

The actual_y value comes from the training set, while the predicted_y value is what our model yielded. So the cost at this iteration is equal to -4.

### So where is Back-propagation?

According to our example, we now have a model that does not give accurate predictions (it gave us the value 4 instead of 1) and that is attributed to the fact that its weights have not been tuned yet (they are all equal to 1). We also have the loss, that is equal to -4. ** Back-propagation** is all about feeding this loss backwards in such a way that we can fine-tune the weights based on which. The optimization function (Gradient Descent in our example) will help us find the weights that will — hopefully — yield a smaller loss in the next iteration. So let’s get to it!

If feeding forward happened using the following functions:

f(a) = a

Then feeding backward will happen through the partial derivatives of those functions. There is no need to go through the working of arriving at these derivatives. All we need to know is that the above functions will follow:

f'(a) = 1

J'(w) = Z . delta

where Z is just the z value we obtained from the activation function calculations in the feed-forward step, while delta is the loss of the unit in the layer.

**I know it’s a lot of information to absorb in one sitting, but I suggest you take your time and really understand what is going on at every step before going further.**

### Calculating the deltas

Now we need to find the loss at every unit/node in the neural net. Why is that? Well, think about it this way, every loss the the deep learning model arrives to is actually the mess that was caused by all the nodes accumulated into one number. Therefore, we need to find out which node is responsible for most of the loss in every layer, so that we can penalize it in a sense by giving it a smaller weight value and thus lessening the total loss of the model.

Calculating the delta of every unit can be problematic. However, thanks to Mr. Andrew Ng, he gave us the shortcut formula for the whole thing:

delta_0 = w . delta_1 . f'(z)

where values delta_0, w and f’(z) are those of the same unit’s, while delta_1 is the loss of the unit on the other side of the weighted link. For example:

You can think of it this way, in order to get the loss of a node (e.g. Z0), we multiply the value of its corresponding f’(z) by the loss of the node it is connected to in the next layer (delta_1), by the weight of the link connecting both nodes.

This is exactly how back-propagation works. We do the delta calculation step at every unit, back-propagating the loss into the neural net, and finding out what loss every node/unit is responsible for.

Let’s calculate those deltas and get it over with!

delta_D0 = total_loss = -4

delta_Z0 = W . delta_D0 . f'(Z0) = 1 . (-4) . 1 = -4

delta_Z1 = W . delta_D0 . f'(Z1) = 1 . (-4) . 1 = -4

delta_Z2 = W . delta_D0 . f'(Z2) = 1 . (-4) . 1 = -4

delta_Z3 = W . delta_D0 . f'(Z3) = 1 . (-4) . 1 = -4

There are a few things to notice here:

- The loss of the final unit (i.e. D0) is equal to the loss of the whole model. This is because it is the output unit, and its loss is the accumulated loss of all the units together, like we said earlier.
- The function f’(z) will always give the value 1, no matter what the input (i.e. z) is equal to. This is because the partial derivative, as we said earlier, follows: f’(a) = 1
- The input nodes/units (X0, X1 and X2) do not have delta values, as there is nothing those nodes control in the neural net. They are only there as a link between the data set and the neural net. This is merely why the whole layer is usually not included in the layer count.

### Updating the weights

All that is left now is to update all the weights we have in the neural net. This follows the Batch Gradient Descent formula:

W := W - alpha . J'(W)

Where W is the weight at hand, alpha is the learning rate (i.e. 0.1 in our example) and J’(W) is the partial derivative of the cost function J(W) with respect to W. Again, there’s no need for us to get into the math. Therefore, let’s use Mr. Andrew Ng’s partial derivative of the function:

J'(W) = Z . delta

Where Z is the Z value obtained through forward-propagation, and delta is the loss at the unit on the other end of the weighted link:

Now we use the Batch Gradient Descent weight update on all the weights, utilizing our partial derivative values that we obtain at every step. It is worth emphasizing on that the Z values of the input nodes (X0, X1, and X2) are equal to 1, 0, 0, respectively. The 1 is the value of the bias unit, while the zeroes are actually the feature input values coming from the data set. One last note is that there is no particular order to updating the weights. You can update them in any order you want, as long as you don’t make the mistake of updating any weight twice in the same iteration.

In order to calculate the new weights, let’s give the links in our neural nets names:

New weight calculations will happen as follows:

W10 := W10 - alpha . Z_X0 . delta_Z1

= 1 - 0.1 . 1 . (-4) = 1.4

W20 := W20 - alpha . Z_X0 . delta_Z2

= 1 - 0.1 . 1 . (-4) = 1.4

. . . . .

. . . . .

. . . . .

W30 := 1.4

W11 := 1.4

W21 := 1.4

W31 := 1.4

W12 := 1.4

W22 := 1.4

W32 := 1.4

V00 := V00 - alpha . Z_Z0 . delta_D0

= 1 - 0.1 . 1 . (-4) = 1.4

V01 := 1.4

V02 := 1.4

V03 := 1.4

It is important to note here that the model is not trained properly yet, as we only back-propagated through one sample from the training set. Doing all we did all over again for all the samples will yield a model with better accuracy as we go, trying to get closer to the minimum loss/cost at every step.

It might not make sense to you that all the weights have the same value again. However, training the model on different samples over and over again will result in nodes having different weights based on their contributions to the total loss.

The theory behind Machine Learning can be really difficult to grasp if not tackled the right way. One example of this would be Back-propagation, whose effectiveness is visible in most real-world Deep Learning applications, but it is never examined. Back-propagation is just a way of propagating the total loss back into the neural network to know how much of the loss every node is responsible for, and subsequently updating the weights in such a way that minimizes the loss by giving the nodes with higher error rates lower weights and vice versa.

How Does Back-Propagation in Artificial Neural Networks Work? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.