Support Vector Machines (SVM)

Introduction

The goal of this lab work is to use classification SVMs. Let us discover the use of these SVMs from a multi-class problem : the iris dataset. The goal here is not to tune parameters, but mainly to read the docs. Note that the multi-class is obtained from several bi-class SVMs, using a one-versus-one (ovo) strategy.

Learning and testing

First, create a directory, and download the two following tool files:

svm_tools.py: Utilities for inspecting SVM data.
plot_tools.py : Utilities for plotting SVC in 2D.

Then, try to run the intro-iris-001.py script.

Read the script, well as the code of the tool function that are called.

Inside libsvm

The multi-class classification requires to compute several bi-class classifiers. Retreiving each of them, knowing, for each, which samples are the supports, which are the coefficients, is not easy. This is what we illustrate next. Let us remind that a bi-class classifier is defined as :

\[\mathrm{class}(x) = \mathrm{sign}\left(b + \sum_{i=1}^{|\mathrm{dataset}|} \alpha_i y_i.K(x_i,x)\right)\]

where many \(\alpha_i\)s are null (for non support samples).

Run and read the intro-iris-002.py script. The SVM computation is the same as previously. You may be interested in reading the ovo methods in svm_tools.py, but you can postpone it for the moment. Indeed, this is quite technical since the data is not easy to retrieve.

Display decision functions and supports

One can be interested in plotting the internal decision functions. Run and read the intro-iris-003.py script. You can also read the code of the plotting functions invoked in the script.

A simple bi-class example

Playing with blocks

Let us consider here a artificial bi-class classification example. Run the block-001.py script and read it. The support for which a slack variable is used are marked with a black cross, the others are marked in red.

Is the empirical risk a good estimator of the real risk ? Why ?

Try different values for C and observe the supports (reduce C).

Go back to a big value for C (e.g. 1000), and use very small datasets (learning_size = 10, 8, 6, 4)… For each dataset size, run the experiment several times to see how the risks vary from one run to another. What do you notice about the risks ?

Use a nu-SVC on a 200-samples dataset in in order to involve 10% of the supports. How is the real risk ? Try to increase nu (20%, 30%, 50%…).

Restart the whole process with overlapping blocks (change the comment on the make_dataset definition). With 50 samples for example, you should see how reducing C (or increasing nu) enables a better generalization.

Playing with moons

Run moon-001.py and read it.

Once you have found (visually) suitable parameters, measure the real risk:

with a cross validation
by averaging the empirical risk on another large moon sets (use svm_tools.empirical_risk).

Try with a nu-SVC. Determine the parameters by a grid-search (using cross-validation). Plot the grid (risk = f(params)).

Digit recognition

First steps

Execute and read the following files in order to handle the digits dataset.

digits-001.py: This shows how to get and display the digits.
digits-002.py: The use of a multiclass SVM for digits classification.

Warning If you run these scripts and experience SSL related issues in MacOS, you may have to run, within a terminal “/Applications/Python 3.9/Install ertificates.command” and adapt the path as needed.

What is the input space dimension ?

When you run the second script, what happens in terms of risks ? Make the link with small-sets experiments in section Playing with blocks.

Implement the recognition

The goal of this section is to improve the performances. You can try several things.

Use a RBF kernel. Try to guess the order of magnitude of a suitable \(\sigma\).
Use a nu-SVC.
Use a suitable distance on the images (e.g a blur, use opencv for this).
Use a PCA on the input set in order to reduce the dimensionality.

This code shows how to blur an image:

import cv2

digit         = inputs[0]
img           = digit.reshape((28,28))/255.0
blurred_digit = cv2.GaussianBlur(digit.reshape((28,28)), (9,9), 0).reshape(28*28)
blurred_img   = blurred_digit.reshape((28,28))/255.0

When you are done

As everything is given in this labwork, there is no need for a correction. You have to experiment, generate and observe some plots, compute some risks, and conclude.

On the blocks, the idea is to see that a linear SVM does not overfit. Indeed, we have 1000 points in \(\mathbb{R}^2\). The real risk is therefore well represented by the empirical risk on the data used to learn (zero risk in this case). With small datasets, we end up with “slanted” separators, whose real risk is not good. But we have to use very small datasets for linear SVM to be misleading.

When we move on to digits, we end up with a linear separator too (at first), in dimension \(28\times 28=784\). We could expect that with more than 784 points, the linear SVM will not be able to overfit (because of the VC dimension argument). But it overfits a little, because we have 0% empirical risk and 10% real risk. So we find ourselves in the same situation as blocks with small datasets, i.e. 1000 samples in dimension 784, it’s a bit like 3 samples in dimension 2, it’s enough to overfill a linear separator. That’s what you have to observe.

In the section about the digits, then, we try things to do the best we can, following different tracks, but for that, there is not a single solution. The goal is to make you manipulate the SVM by understanding what you are doing, on a problem that is not necessarily easily solved with a SVM.

If you understand how to use sklearn SVMs from the examples given, the “knowing how to apply a SVM in practice” side is acquired, this is the purpose of this practical work. Afterwards, based on the course, you should expect results and check that they are consistent with your expectation when you experiment. That’s what practical work is all about. That linear SVM overfit over 1000 digits surprised me, I thought 1000 was big enough as 784 so that it doesn’t happen… this is the kind of thing that the reality of the experiment questions, when you have a theoretical a priori, i.e. when you have understood the course.

Hervé Frezza-Buet,