One Time Setup
Activation functions
- Non-Linear Function between the layers such as ReLU, Sigmoid, etc
Sigmoid Function
$$
\sigma(x) = 1/(1+e^{-x})
$$
- Squashes numbers to range [0, 1]
- Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron
3 Problems
- Saturated neurons “kill” the gradients
- When x is very small such as -10, the gradient would be close to zero, and make learning very slow
- Sigmoid outputs are not zero-centered
- All outputs are positive, thus the gradients will be always negative or always positive - Not good for training
- exp() is a bit compute expensive
- Less of an issue with GPU
Tanh
$$
\sigma(x) = tanh(x)
$$
- Basically shifted Sigmoid.
- Solves the issue of zero-center, but still kills the gradients
ReLU
$$
f(x) = max(0, x)
$$
- Does not saturate (in + region)
- Very computationally efficient
- Converges much faster than Sigmoid/tanh in practice