Good vs Bad Classifiers
Here’s an interesting question: both lines above separate the red and green clusters. Is there a good reason to choose one over another?
Remember that the worth of a classifier is not in how well it separates the training data. We eventually want it to classify yet-unseen data points (known as test data). Given that, we want to choose a line that captures the general pattern in the training data, so there is a good chance it does well on the test data.
The first line above seems a bit “skewed.” Near its lower half it seems to run too close to the red cluster, and in its upper half it runs too close to the green cluster. Sure, it separates the training data perfectly, but if it sees a test point that’s a little farther out from the clusters, there is a good chance it would get the label wrong.
The second line doesn’t have this problem. For example, look at the test points shown as squares and the labels assigned by the classifiers in the figure below.
The second line stays as far away as possible from both the clusters while getting the training data separation right. By being right in the middle of the two clusters, it is less “risky,” gives the data distributions for each class some wiggle room so to speak, and thus generalizes well on test data.
SVMs try to find the second kind of line. We selected the better classifier visually, but we need to define the underlying philosophy a bit more precisely to apply it in the general case. Here’s a simplified version of what SVMs do:
- Find lines that correctly classify the training data
- Among all such lines, pick the one that has the greatest distance to the points closest to it.
The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space(N — the number of features) that distinctly classifies the data points.
Hyperplanes and Support Vectors
Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM.
Large Margin Intuition
In logistic regression, we take the output of the linear function and squash the value within the range of [0,1] using the sigmoid function. If the squashed value is greater than a threshold value(0.5) we assign it a label 1, else we assign it a label 0. In SVM, we take the output of the linear function and if that output is greater than 1, we identify it with one class and if the output is -1, we identify is with another class. Since the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values([-1,1]) which acts as margin. In other words svm predict a certain class if wt+b is positive and vice versa!
Finally, the secret sauce that makes SVMs tick. This is where we need to look at a bit of math.
Let’s take stock of what we have seen so far:
- For linearly separable data SVMs work amazingly well.
- For data that’s almost linearly separable, SVMs can still be made to work pretty well by using the right value of C.
- For data that’s not linearly separable, we can project data to a space where it is perfectly/almost linearly separable, which reduces the problem to 1 or 2 and we are back in business.
It looks like a big part of what makes SVMs universally applicable is projecting it to higher dimensions. And this is where kernels come in.
First, a slight digression.
A very surprising aspect of SVMs is that in all of the mathematical machinery it uses, the exact projection, or even the number of dimensions, doesn’t show up. You could write all of it in terms of the dot products between various data points (represented as vectors). For p-dimensional vectors i and j where the first subscript on a dimension identifies the point and the second indicates the dimension number:
The dot product is defined as:
If we have n points in our dataset, the SVM needs only the dot product of each pair of points to find a classifier. Just that. This is also true when we want to project data to higher dimensions. We don’t need to provide the SVM with exact projections; we need to give it the dot product between all pairs of points in the projected space.
This is relevant because this is exactly what kernels do. A kernel, short for kernel function, takes as input two points in the original space, and directly gives us the dot product in the projected space.
Let’s revisit the projection we did before, and see if we can come up with a corresponding kernel. We will also track the number of computations we need to perform for the projection and then finding the dot products — to see how using a kernel compares.
For a point i:
Our corresponding projected point was:
To compute this projection we need to perform the following operations:
- To get the new first dimension: 1 multiplication
- Second dimension: 1 multiplication
- Third dimension: 2 multiplications
In all, 1+1+2 = 4 multiplications.
The dot product in the new dimension is:
To compute this dot product for two points i and j, we need to compute their projections first. So that’s 4+4 = 8 multiplications, and then the dot product itself requires 3 multiplications and 2 additions.
In all, that’s:
- Multiplications: 8 (for the projections) + 3 (in the dot product) = 11 multiplications
- Additions: 2 (in the dot product)
Which is total of 11 + 2 = 13 operations.
I claim this kernel function gives me the same result:
We take the dot product of the vectors in the original space first, and then square the result.
Let expand it out and check whether my claim is indeed true:
It is. How many operations does this need? Look at step (2) above. To compute the dot product in two dimensions I need 2 multiplications and 1 addition. Squaring it is another multiplication.
So, in all:
- Multiplications: 2 (for the dot product in the original space) + 1 (for squaring the result) = 3 multiplications
- Additions: 1 (for the dot product in the original space)
A total of 3 + 1 = 4 operations. This is only 31% of the operations we needed before.
It looks like it is faster to use a kernel function to compute the dot products we need. It might not seem like a big deal here: we’re looking at 4 vs 13 operations, but with input points with a lot more dimensions, and with the projected space having an even higher number of dimensions, the computational savings for a large dataset add up incredibly fast. So that’s one huge advantage of using kernels.
Cost Function and Gradient Updates
In the SVM algorithm, we are looking to maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is hinge loss.
so how do we compute it in real life, Stanford university has given us a wonderful example for computing hinge loss for the scores on a given images
so how did we infer than losses is 2.9 on the first image, here what actually got calculated
for each wrong class we compute max(the difference between its score aganist the right class and we added 1 as our margin we can change it’s value as we wish)
for an implementation of svm from scratch I found this amazing notebook.
I have borrowed from these great contents: