Lecture 3 Linear Classifiers

Viewpoints of Linear Classifiers

The score of the correct class should be higher than all the other scores

Given example $(x_i, y_i)$, let $s = f(x_i, W)$ be scores, then SVM loss has the form:

$$ L_i=\sum_{j\neq y_i}max(0, s_j - s_{y_i}+1) $$

when implementing network, think what type of loss when all scores are random
- If loss looks weird, then there’s probably a bug somewhere.

Want to interpret raw classifier scores as probability

$$ s = f(x_i;W), P(Y=k|X=x_i)=\frac{e^sk}{\sum_je^{s_j}} $$

	Class A	Class B	Class C
Unnormalized log-probability (logit)	3.2	5.1	-1.7
Unnormalized Probabilities (Exp)	24.5	164.0	0.18
Normalized Probabilities (Softmax)	0.13	0.87	0.00
Correct Probabilities	1	0	0

Once we get the normalized probability, we get the Loss
- If the correct class was Class A, the lost would be $L_i=-log(0.13)=2.04$

$$ L_i=-log\,P(Y=y_i|X=x_i) $$

Loss Equation is based on Maximum Likelihood Estimation
- Choose weights to maximize the likelihood of the observed data
- We compare the correct probabilities and the normalized probabilities on Kullback-Leibler divergence which results in cross entropy loss
$$ D_{KL}(P||Q)=\sum_yP(y)log\frac{P(y)}{Q(y)} $$

Added to the loss function to prevent the model from doing too well on training data
- Express preferences in among models beyond “minimize training error”
- Avoid Overfitting: Prefer simple models that generalize better
- Improve optimization by adding curvature

$$ L(W)=\frac{1}{N}\sum^N_{i=1}L_i(f(x_i,W),y_i)+\lambda R(W) $$