One Time Setup

Activation functions

Non-Linear Function between the layers such as ReLU, Sigmoid, etc

Sigmoid Function

$$ \sigma(x) = 1/(1+e^{-x}) $$

Squashes numbers to range [0, 1]
Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron

3 Problems

Saturated neurons “kill” the gradients
- When x is very small such as -10, the gradient would be close to zero, and make learning very slow
Sigmoid outputs are not zero-centered
- All outputs are positive, thus the gradients will be always negative or always positive - Not good for training
exp() is a bit compute expensive
- Less of an issue with GPU

Tanh

$$ \sigma(x) = tanh(x) $$

Basically shifted Sigmoid.
Solves the issue of zero-center, but still kills the gradients

ReLU

$$ f(x) = max(0, x) $$

Does not saturate (in + region)
Very computationally efficient
Converges much faster than Sigmoid/tanh in practice