Support Vector Machine

Hard-Margin SVM

Assumption: linear separable

Given $w$ , the (signed) distance to the closest example is
$min_{i \in [n]} \frac{y^{(i)} \cdot w^{T} x^{(i)}}{| | w | |_{2}}$

Derivation (distance from a point to a hyperplane)

See also SVM>Margin

Hence the max-min classifier that maximizes the margin (hard margin) is given by
$max_{w \in R^{d}} min_{i \in [n]} \frac{y^{(i)} \cdot w^{T} x^{(i)}}{| | w | |_{2}}$

Because linear separable, we can find $w$ such that $min_{i} y^{(i)} w^{T} x^{(i)} = 1$
We can fix the scale of $w$ so that $y_{i} w^{T} x_{i} = 1$ for support vectors and $y_{i} w^{T} x_{i} \geq 1$ for all other points
Therefore it is equivalent to solve the following optimization problem:
$max_{w \in R^{d}} \frac{1}{| | w | |_{2}} s.t. y^{(i)} \cdot w^{T} x^{(i)} \geq 1, \forall i \in [n] \to$
$min_{w \in R^{d}} \frac{1}{2} | | w | |_{2}^{2} s.t. y^{(i)} \cdot w^{T} x^{(i)} \geq 1, \forall i \in [n]$

Is strictly convex
If a solution exists, then it is unique
Since we assume the data to be linearly separable, then a solution exists

This is also known as Max-Margin Classifier

Soft-Margin SVM

If the data is not linearly separable, we introduce some relaxation through slack variable $ξ_{i} \geq 0$ for each data point $(x^{(i)}, y^{(i)})$ .

$1 - ξ_{i} \leq 0$ means false classification
$\sum_{i} ξ_{i}$ is the minimum amount of translation needed to make the optimization problem feasible
$C$ is a hyper-parameter and $λ = \frac{1}{C}$

Pasted image 20240204103714.png|470

When we allow misclassifications, the distance between the observations and the threshold is the soft margin. Observations on the edge and within the the soft margin are support vectors.

If $ξ_{i} = 0$ , $(x^{(i)}, y^{(i)})$ is correctly classified (beyond margin)
If $0 < ξ_{i} \leq 1$ , $(x^{(i)}, y^{(i)})$ is correctly classified (within margin)
If $ξ_{i} > 1$ , $(x^{(i)}, y^{(i)})$ is wrongly classified

Pasted image 20240204090609.png|300

Also known as Support Vector Classifier

Support Vector Machine

Support Vector Machine is Hinge Loss with L2 Regularization

The loss function is a penalty between a model and the truth
Regularization is a penalty for too complex model

Hinge Loss

$min_{w \in R^{d}} \sum_{i} L_{hinge} (y^{(i)} \cdot w^{T} x^{(i)}) + \frac{λ}{2} | | w | |_{2}^{2}$
where $L_{hinge} (t) = max {0, 1 - t}$

Derivation:
Further transformation into an unconstrained optimization problem

The hyper-parameter $λ \geq 0$ controls the strength of regularization:

If $λ \to 0$ , then less focus on regularization, more focus on loss
If $λ \to \infty$ , then more focus on regularization, less focus on loss

Dual SVM

In Hard-Margin SVM:
We introduce $α_{i} \geq 0$ for each of the constraint in $min_{w \in R^{d}} \frac{1}{2} | | w | |_{2}^{2} s.t. y^{(i)} \cdot w^{T} x^{(i)} \geq 1, \forall i \in [n]$
$min_{w} max_{α} L (w, α) = \frac{1}{2} | | w | |_{2}^{2} + \sum_{i} α_{i} (1 - y^{(i)} \cdot w^{T} x^{(i)})$
(the Lagrangian)

$α_{i}$ - the price to pay if the corresponding constraint is violated
Transform into unconstrained problem
Two cases (because we are trying to find $α$ to maximize the equation):
- If $i$ th constraint holds, $1 - y^{(i)} \cdot w^{T} x^{(i)} \leq 0$ then $α_{i}^{*} = 0$
- If $i$ th constraint is violated, $1 - y^{(i)} \cdot w^{T} x^{(i)} > 0$ then $α_{i}^{*} \to \infty$

$min_{w} P (w) = max_{α} D (α)$ where we define

Primal Problem $P (w) = max_{α} L (w, α)$
Dual Problem $D (α) = min_{w} L (w, α)$

This is because of Strong Duality:

In general, for an arbitrary function $f (x, y)$ , we have weak duality
$min_{x} max_{y} f (x, y) \geq max_{y} min_{x} f (x, y)$

For convex problems with affine constraints, we have strong duality
$m i n_{x} max_{y} f (x, y) = max_{y} min_{x} f (x, y)$

The Dual Problem

The dual problem $D (α) = min_{w} L (w, α)$ is solvable due to convexity
$max_{α} D (α) = \sum_{i} α_{i} - \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} x_{i}^{T} x_{j} = 1_{n}^{T} α - \frac{1}{2} α^{T} K α$
where $1_{n} \in R^{n}$ is all-one vector of dim- $n$ and $K \in R^{n \times n}$ with $K_{i j} = (y_{i} x_{i})^{T} (y_{j} x_{j})$

The optimization $max_{α} D (α)$ wrt $α$ is a quadratic program
Solve for $α^{*}$
Recover the optimal $w^{*}$ with $w^{*} = \sum_{i} α_{i}^{*} y^{(i)} x^{(i)}$
- The optimal normal vector $w^{*}$ is a linear combination of $y^{(i)} x^{(i)}$
- Only the ones with $α_{i}^{*} > 0$ contributes to $w^{*}$
- The ones with $α_{i}^{*} = 0$ does not contribute to $w^{*}$
- The points $y^{(i)} x^{(i)}$ with $α_{i}^{*} > 0$ are support vectors
- $α_{i}^{*} > 0 \to y^{(i)} w^{T} x^{(i)} = 1$

Dual solutions and support vectors are not necessarily unique
(even if the primal solution is unique)

Derivation:

The Dual Problem makes #Kernel Method easier

Kernel Method

Sometimes datasets are linearly inseparable, in which case we use the Kernel Method.
Key idea is feature mapping/lifting to construct new features.
$ϕ (\cdot) : R^{d} \to R^{p}$

Example:
In the XOR Problem, we can add a third dimension $(x_{1}, x_{2}) \to (x_{1}, x_{2}, x_{3})$
where $x_{3} = x_{1} x_{2}$

The primal optimization becomes $min_{w \in R^{d}} \frac{1}{2} | | w | |_{2}^{2} s.t. y^{(i)} \cdot w^{T} ϕ (x^{(i)}) \geq 1, \forall i \in [n]$
#The Dual Problem under the feature map $ϕ$ is therefore
$max_{α} D (α) = \sum_{i} α_{i} - \frac{1}{2} \sum_{i, j} α_{i} α_{j} y_{i} y_{j} ϕ (x_{i})^{T} ϕ (x_{j}) = 1_{n}^{T} α - \frac{1}{2} α^{T} K α$
where $1_{n} \in R^{n}$ is an all-one vector of dimension $n$ and $K \in R^{n \times n}$ with $K_{i j} = (y_{i} ϕ (x_{i}))^{T} (y_{j} ϕ (x_{j}))$

The dual form never needs $ϕ (x) \in R^{p}$ explicitly, but only the inner product $ϕ (x)^{T} ϕ (x^{'}) \in R$
Kernel Trick: Replace every $ϕ (x)^{T} ϕ (x^{'})$ with kernel evaluation $k (x, x^{'})$
- Sometimes $k (x, x^{'})$ is much cheaper
- But we need to explicitly maintain $K \in R^{n \times n}$

Example:
Affine features $ϕ : R^{d} \to R^{d + 1}$ with
$ϕ (x) = (1, x_{1}, . . ., x_{d})$
The kernel form is $ϕ (x)^{T} ϕ (x^{'}) = 1 + x^{T} x^{'}$

Quadratic features $ϕ : R^{d} \to R^{p}$ with
$ϕ (x) = (x_{1}^{2}, x_{2}^{2}, \sqrt{2} x_{1} x_{2}, \sqrt{2} x_{1}, \sqrt{2} x_{2}, 1)$
The kernel form is $ϕ (x)^{T} ϕ (x^{'}) = (1 + x^{T} x^{'})^{2}$

RBF Kernel

Radial Basis Function (RBF/Gaussian) Kernel
For any $σ > 0$ , there is an infinite-dim feature map $ϕ : R^{d} \to R^{\infty}$ such that
$k (x, x^{'}) = ϕ (x)^{T} ϕ (x^{'}) = e x p (- \frac{| | x - x^{'} | |_{2}^{2}}{2 σ^{2}})$

MappingNonLinear.png|400