- Given weights $w$ and Loss function $L(w)$ our goal is to find the optimal weight $w^*$ where the loss is minimized
$$
w^* = arg\,min_wL(w)
$$
Gradient Descent
- In 1-dimension, the derivative of a function gives the slope
$$
\frac{df(x)}{dx}=lim_{h\rightarrow0}\frac{f(x+h)-f(x)}{h}
$$
- In multi-dimension, the gradient is the vector of partial derivatives along each dimension
- The slope in any direction is the dot product of the direction with the gradient
- The direction of steepest descent is the negative gradient
We can get the Analytic Gradient of the Loss: $\nabla_WL$
$$
L=\frac{1}{N}\sum^N_{i=1}L_i+\sum_kW^2_k\\L_i=\sum_{j\neq y_i}\,max(0, s_j-s_{y_i}+1)\\s=f(x;W)=Wx
$$
Batch Gradient Descent
Once we get the Gradient, we can train the model using Gradient Descent
- Iteratively step in the direction of the negative gradient
- Hyper-parameters: Weight Initialization method, num steps, learning rate
w = initialize_weights()
for t in range(num_steps):
dw = compute_gradient(loss_fn, data, w)
w -= learning_rate * dw
$$
L(W)=\frac{1}{N}\sum^N_{i=1}L_i(x_i,y_i,W)+\lambda R(W) \\ \nabla_WL(W) = \frac{1}{N}\sum^N_{i=1}\nabla_WL_i(x_i,y_i,W)+\lambda\nabla_WR(W)
$$
- The loss function is the sum of all loss of all samples in the dataset
- The Gradient is also the sum of all individual gradients of data in the dataset
- Problem is that sum can be very expensive when N becomes large
Stochastic Gradient Descent (SGD)