Computational graphs: Backpropagation

Friday, March 9, 2018
5 mins read

Backpropagation is an efficient method of computing gradients in directed graphs of computations, such as neural networks. This is not a learning method, but rather a nice computational trick which is often used in learning methods. It is actually a simple implementation of chain rule of derivatives.

We can visually use backpropagation using computational graphs.

Backpropagation is a local process i.e. each gate in a circuit diagram gets some input and compute two things:

  1. its output value and
  2. the local gradient of its input w.r.t its output values.

The gates can do it independently without being aware of any of the details of the full circuit.

Let’s take an example,

\[f(w,x) = \frac{1}{1+e^{-(w_0x_0 + w_1x_1 + w_2)}}\]

This expression describes a 2-dimensional neuron (with input x and weights w) that uses the sigmoid activation function. The function consists of many gates which can be represented by computational graph as:


The forward pass computes values from inputs to output (shown in green). The backward pass then performs backpropagation which starts at the end and recursively applies the chain rule to compute the gradients (shown in red) all the way to the inputs of the circuit. The gradients can be thought of as flowing backwards through the circuit.

Note that the above function can be represented as \(\sigma(x) = \frac{1}{1+e^{-x}}\), which is known as sigmoid function. The derivative of sigmoid function \(\sigma(x)\) is \(\sigma(x)(1 - \sigma(x))\). For example, the sigmoid expression receives the input 1.0 and computes the output 0.73 during the forward pass. The derivation above shows that the local gradient would simply be (1 - 0.73) * 0.73 ~= 0.2.

w = [2, -3, -3]  # assume random weights and data
x = [-1, 2]

# forward pass
dot = w[0]*x[0] + w[1]*x[1] + w[2]
f = 1.0/(1 + math.exp(-dot))  # sigmoid function

# backward pass
ddot = (1 - f) * f  # gradient on dot variable, using the sigmoid gradient derivation
dx = [w[0] * ddot, w[1] * ddot]  # backprop into x
dw = [x[0] * ddot, x[1] * ddot, 1.0 * ddot]  # backprop into w
# we're done! we have the gradients on the inputs to the circuit

We can break up our function into modules for which we can easily derive local gradients, and then use chain rule. Hence, always decompose our expression into stages such that we can differentiate every stage independently and then backprop through the variables one step at a time. The use of computaitonal graphs facilitates the gradient compuations.

This page is open source. Improve its content!

You May Also Like

comments powered by Disqus