Backpropagation is an efficient method of computing gradients in directed graphs of computations, such as neural networks. This is not a learning method, but rather a nice computational trick which is often used in learning methods. It is actually a simple implementation of chain rule of derivatives.
We can visually use backpropagation using computational graphs.
Backpropagation is a local process i.e. each gate in a circuit diagram gets some input and compute two things:
The gates can do it independently without being aware of any of the details of the full circuit.
Let’s take an example,
\[f(w,x) = \frac{1}{1+e^{-(w_0x_0 + w_1x_1 + w_2)}}\]This expression describes a 2-dimensional neuron (with input x and weights w) that uses the sigmoid activation function. The function consists of many gates which can be represented by computational graph as:
The forward pass computes values from inputs to output (shown in green). The backward pass then performs backpropagation which starts at the end and recursively applies the chain rule to compute the gradients (shown in red) all the way to the inputs of the circuit. The gradients can be thought of as flowing backwards through the circuit.
Note that the above function can be represented as \(\sigma(x) = \frac{1}{1+e^{-x}}\), which is known as sigmoid function. The derivative of sigmoid function \(\sigma(x)\) is \(\sigma(x)(1 - \sigma(x))\). For example, the sigmoid expression receives the input 1.0 and computes the output 0.73 during the forward pass. The derivation above shows that the local gradient would simply be (1 - 0.73) * 0.73 ~= 0.2.
We can break up our function into modules for which we can easily derive local gradients, and then use chain rule. Hence, always decompose our expression into stages such that we can differentiate every stage independently and then backprop through the variables one step at a time. The use of computaitonal graphs facilitates the gradient compuations.