Support Vector Machine

See also Support Vector Machine

Support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis.

Hard-Margin SVM

Assumption: linear separable

Given w, the (signed) distance to the closest example is
mini[n]y(i)wTx(i)||w||2

Derivation (distance from a point to a hyperplane)
IMG_02FA49C7CA67-1.jpeg|400
See also SVM>Margin

Hence the max-min classifier that maximizes the margin (hard margin) is given by
maxwRdmini[n]y(i)wTx(i)||w||2

Because linear separable, we can find w such that miniy(i)wTx(i)=1
We can fix the scale of w so that yiwTxi=1 for support vectors and yiwTxi1 for all other points
Therefore it is equivalent to solve the following optimization problem:
maxwRd1||w||2s.t.y(i)wTx(i)1,i[n]
minwRd12||w||22s.t.y(i)wTx(i)1,i[n]

This is also known as Max-Margin Classifier

Soft-Margin SVM

If the data is not linearly separable, we introduce some relaxation through slack variable ξi0 for each data point (x(i),y(i)).

Pasted image 20240204103714.png|470

When we allow misclassifications, the distance between the observations and the threshold is the soft margin. Observations on the edge and within the the soft margin are support vectors.

Pasted image 20240204090609.png|300

Also known as Support Vector Classifier

Support Vector Machine

Support Vector Machine is Hinge Loss with L2 Regularization

See also Neural Net#Overfitting

Hinge Loss

minwRdiLhinge(y(i)wTx(i))+λ2||w||22
where Lhinge(t)=max{0,1t}

Derivation:
Further transformation into an unconstrained optimization problem
IMG_F77B7D893A0D-1.jpeg|600

The hyper-parameter λ0 controls the strength of regularization:

Dual SVM

In Hard-Margin SVM:
We introduce αi0 for each of the constraint in minwRd12||w||22s.t.y(i)wTx(i)1,i[n]
minwmaxαL(w,α)=12||w||22+iαi(1y(i)wTx(i))
(the Lagrangian)

minwP(w)=maxαD(α) where we define

This is because of Strong Duality:

The Dual Problem

The dual problem D(α)=minwL(w,α) is solvable due to convexity
maxαD(α)=iαi12i,jαiαjyiyjxiTxj=1nTα12αTKα
where 1nRn is all-one vector of dim-n and KRn×n with Kij=(yixi)T(yjxj)

Dual solutions and support vectors are not necessarily unique
(even if the primal solution is unique)

Derivation:
IMG_09031752CF56-1.jpeg|520

The Dual Problem makes #Kernel Method easier

Kernel Method

Sometimes datasets are linearly inseparable, in which case we use the Kernel Method.
Key idea is feature mapping/lifting to construct new features.
ϕ():RdRp

Example:
In the XOR Problem, we can add a third dimension (x1,x2)(x1,x2,x3)
where x3=x1x2
Pasted image 20240204224618.png|300

The primal optimization becomes minwRd12||w||22s.t.y(i)wTϕ(x(i))1,i[n]
#The Dual Problem under the feature map ϕ is therefore
maxαD(α)=iαi12i,jαiαjyiyjϕ(xi)Tϕ(xj)=1nTα12αTKα
where 1nRn is an all-one vector of dimension n and KRn×n with Kij=(yiϕ(xi))T(yjϕ(xj))

Example:
Affine features ϕ:RdRd+1 with
ϕ(x)=(1,x1,...,xd)
The kernel form is ϕ(x)Tϕ(x)=1+xTx

Quadratic features ϕ:RdRp with
ϕ(x)=(x12,x22,2x1x2,2x1,2x2,1)
The kernel form is ϕ(x)Tϕ(x)=(1+xTx)2

RBF Kernel

Radial Basis Function (RBF/Gaussian) Kernel
For any σ>0, there is an infinite-dim feature map ϕ:RdR such that
k(x,x)=ϕ(x)Tϕ(x)=exp(||xx||222σ2)

MappingNonLinear.png|400