Given weights $w$ and Loss function $L(w)$ our goal is to find the optimal weight $w^*$ where the loss is minimized

$$ w^* = arg\,min_wL(w) $$

Gradient Descent

In 1-dimension, the derivative of a function gives the slope

$$ \frac{df(x)}{dx}=lim_{h\rightarrow0}\frac{f(x+h)-f(x)}{h} $$

In multi-dimension, the gradient is the vector of partial derivatives along each dimension
The slope in any direction is the dot product of the direction with the gradient
The direction of steepest descent is the negative gradient

We can get the Analytic Gradient of the Loss: $\nabla_WL$

$$ L=\frac{1}{N}\sum^N_{i=1}L_i+\sum_kW^2_k\\L_i=\sum_{j\neq y_i}\,max(0, s_j-s_{y_i}+1)\\s=f(x;W)=Wx $$

Batch Gradient Descent

Once we get the Gradient, we can train the model using Gradient Descent

Iteratively step in the direction of the negative gradient
Hyper-parameters: Weight Initialization method, num steps, learning rate

w = initialize_weights()
for t in range(num_steps):
	dw = compute_gradient(loss_fn, data, w)
	w -= learning_rate * dw

$$ L(W)=\frac{1}{N}\sum^N_{i=1}L_i(x_i,y_i,W)+\lambda R(W) \\ \nabla_WL(W) = \frac{1}{N}\sum^N_{i=1}\nabla_WL_i(x_i,y_i,W)+\lambda\nabla_WR(W) $$

The loss function is the sum of all loss of all samples in the dataset
The Gradient is also the sum of all individual gradients of data in the dataset
Problem is that sum can be very expensive when N becomes large

Stochastic Gradient Descent (SGD)