In neural networks, activation functions are used to introduce non-linearity in the model. There are several activation functions to choose from. Traditionally, people have been using sigmoid as the activation function. But, you shouldn’t use it. Why? Let’s see.
It is defined as
\[\sigma(x) = \frac{1}{(1 + e^{-x})}\]It returns a value between 0 and 1. The two major problems with sigmoid activation functions are:
There are other activation functions such as
Tanh: It gives a value in range of [-1, 1] for a real number. But, as in sigmoid, its output saturate but are zero-centered.
ReLU: Rectified Linear Unit is another popular choice. It is defined as
\[f(x) = max(0, x)\]It solves the problem of saturation (though only in the positive region). In the negative region, the gradient is zero, hence again the weights of the network do not update. Also, ReLU units can die during training e.g. a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. Thus, if learning rate is set too high, as much as 40% of your network can be dead (i.e. neurons that never activate across the entire training dataset). However, this problem can be solved with the proper setting of the learning rate.
Leaky ReLU: Leaky ReLU are an attempt to solve the dying ReLU problem. It has a small negative slop when x < 0 i.e.
\[f(x) = 1(x < 0) (\alpha x) + 1(x>=0) (x)\]Note that it doesn’t always guarantee success.
What activation function one should use?
Use the ReLU activation. But, set the learning rates carefully to avoid the dying ReLU problem. Try Leaky ReLU also. But, don’t use sigmoid.
References: