Preprocessing and dimensionality reduction

Jeremy Fix

February 15, 2024

Slides made with slidemaker

Introduction

You need datasets

You can use open datasets

Vision datasets

Scene classification: face detection

Kaggle ICML 2013

  • 7 classes (Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral)
  • 28K Train, 3K public test, 3K final test
  • \(48\times48\) grayscale images
Kaggle ICML 2013
Kaggle ICML 2013

Others: CIFAR-10, CIFAR-100, MNIST, Fashion-MNIST

Object localization/detection

Historical dataset: PascalVOC2012, 20 classes, 20K Train, 20K Test, 11K Test

ImageNet, ILSVRC2014

  • 1000 classes,
  • 1.2M Train, 50K Valid, 100K Test,
  • Avg image size \(482\times415\), RGB
ILSVCR2014
ILSVCR2014
  • Bounding box coordinates (e.g. top-left, width, height)
  • Object identity

Text detection/recognition

Robust reading ICDAR 2019 challenge :
- \(10.166\) images : \(5.603\) training, \(4.563\) testing
- diversity of shapes,, orientation, language, font

Sample from the ICDAR 2019 dataset
Sample from the ICDAR 2019 dataset

Labels :
- polygon coordinates \((x_1, y_1), (x_2, y_2), \cdots\)
- retranscription
- language
Label

See also Street View House Numbers

Object segmentation

COCO 2017:

  • 80 classes,
  • \(200\)K images,
  • \(500\)K masks

Dense labeling : image of labels

COCO 2017
COCO 2017

Object localization/detection/segmentation

Open Images (Kuznetsova et al., 2020)

Recommandation systems

Recommandation systems

MovieLens, Netflix Prize, Anime recommendations database, .. MovieLens 20M

  • 27K movies by 138K users
  • 5 star ratings, \(.5\) increments
  • 20M ratings
  • metadata (e.g. genre , ..)
  • links to imdb to enrich metadata

Speech and text

Automatic speech recognition (Speech to text)

Examples: Timit, VoxForge, Mozilla Common Voice Timit, Darpa, TI, MIT

  • 630 speakers, eight American english dialects
  • time-aligned orthographic, phonetic and word transcriptions
  • 16 KHz speech waveform file for each utterance
  • some machine learning techniques (e.g. Connectionist Temporal Classification) do not need time aligned transcriptions

Transcript : 61748 She had your dark suit in greasy wash water all year.
Words : \((7470, 11362)\) she, \((11362, 16000)\) had, \((15420, 17503)\) your, \((17503, 23360)\) dark, \((23360, 28360)\) suit, \((28360, 30960)\) in, \((30960, 36971)\) greasy, \((36971, 42290)\) wash, \((43120, 47480)\) water, \((49021, 52184)\) all, \((52184, 58840)\) year
Phonemes: \((0, 7470)\) h#, \((7470, 9840)\) sh, \((9840, 11362)\) iy, \((11362, 12908)\) hv, …

Automatic music retranscription

MusicNet dataset
- 330 classical music recordings
- 1 million annotated labels (verified by experts) of precise timing of each note

Tasks : composer classification, notes transcription, instrument detection, ..

Sentiment Analysis

Example: Large Movie Review dataset (IMDB), (Maas et al., 2011)

  • \(25\)K training, \(25\)K testing, \(50\)K unlabeled
  • movie reviews (text ..and more), with rating \(\in [1, 10]\)

Example of a positive review

Example of a negative review

Automatic translation

Europarl dataset, (Koehn, 2012)

  • single language datasets (for learning language models)
  • parallel corpora (for translation), e.g. French-English (2M sentences), Czech-English (650K sentences)
One sentence aligned sample of Europarl
One sentence aligned sample of Europarl

You need datasets

You need datasets

You may have to collect data on your own

  • crawl the web ? (e.g. Tweeter API for scrapping Tweeter sample code)
  • simulate/synthesize data (e.g. for training Automatic Speech Recognition systems with synthetic Text To Speech, see (Hu et al., 2022))
  • if supervized learning : assign labels (mechanical turk, domain experts (classifying tumors))
  • Ensure you collected sufficient features (and also not too much, e.g. making the problem uselessly simple)

Preprocessing

Preprocessing data

Data are not necessarily vectorial :

  • Ordinal or categorical : poor/fair/excellent; Make/female
Ordinal feature name numerical feature value
poor -1
fair
0
excellent 1
Categorical feature name numerical feature value
English [1, 0, 0]
Brasilian
[0, 1, 0]
Chinese [0, 0, 1]
  • Text documents: stemming, filtering, bag of words, embeddings, ..

Even if vectorial :

  • there might be missing values
  • the features may need to be scaled (e.g. for gradient based or metric based estimators, although not for decision trees)

Your data might not be vectorial data : Text documents

Vector representation : Bag Of Words

Vectorial representation of text documents Bag Of Words

  • define a vocabulary \(\mathcal{V}\), \(|\mathcal{V}|=n\)
  • for each document, build a vector \(x\) so that \(x_i\) is the frequency of the word \(\mathcal{V}_i\)

e.g. \(\mathcal{V}=\{I, in, love, metz, machine learning, study\}\)

I love machine learning and love metz too. \(\rightarrow x=[1, 0, 2, 1, 1, 0]\)

I love studying machine learning in Metz. \(\rightarrow x=[1,1,1,1,1,1]\)

Does not take the order into account \(\rightarrow N-gram\), but this leads to sparser representations.

See also Term-Frequency-Inverse-Document-Frequency (TF-IDF).

Your data might not be vectorial data : Text documents

Filtering out useless variability

But in practice, the vocabulary can be huge

Suppose a problem of classifying text documents by their topic, e.g. 20 newsgroups dataset

From target ‘rec.autos’

Vocabulary size : unigrams(\(130.107\)), bigrams(\(130.107^2=10^{10}\)), trigrams (😄)

\(\Rightarrow\) Removing some useless variability(HaCohen-Kerner, 2020) by:

  • uppercase/lowercase,
  • HTML tags,
  • stemming (e.g. amazing, amazed, amazingly),
  • dropping stopwords (the, a, afterwards, along, …) <- be carefull with the lists provided by scikit or others,
  • correction of mispelling,
  • removing repeated letters.

Your data might not be vectorial data : Text documents

Word/Sentence embeddings

Vectorial representation of text documents Word/Sentence embeddings:
- word2vec(Mikolov, Chen, Corrado, & Dean, 2013),
- GLoVe(Pennington, Socher, & Manning, 2014),
- fasttext (Bojanowski, Grave, Joulin, & Mikolov, 2016).
- *BERT (Devlin, Chang, Lee, & Toutanova, 2018)

Continuous Bag of Words (CBOW) : predict a word given its context

Continuous bag of words
Continuous bag of words
  • Input and output coded with one-hot
  • predict a word given its context
  • hidden layer : word representation

Of fixed size, and captures some semantic information.

see also : Bayesian approaches (e.g. Latent Dirichlet Allocation)

Some features might be missing

Missing features

  • Completely drop out the samples with missing attributes, or the dimensions that have missing values
  • or try to impute, i.e. set a value in place of the missing attributes

For missing value imputation, there are plenty of methods  :

  • global : assign the mean, median, most frequent value of an attribute
  • local : based on k-nearest neighbors, decide which value to impute

The bias you may introduce by imputing a value may depend on the causes of the missing values, see (Silva & Zárate, 2014).

Some vectorial data might not be appropriately scaled

Feature scaling

  • dimensions with the quickest variations will dominate euclidean distances (e.g. nearest neighbors),
  • when gradient descent is involved, feature scaling makes convergence faster (because the loss is circular symmetric)
Without scaling
Without scaling
With scaling
With scaling
  • when regularization is involved, we would like to use a single regularization coefficient \(\lambda\), independent on the scale of the features

\[\begin{align} \min_\theta J(\theta) = \frac{1}{N} \sum_i \|y_i - \theta^T x_i\| + \lambda \|\theta\|^2 \end{align}\]

Some vectorial data might not be appropriately scaled

Feature scaling Given \(x_i \in \mathbb{R}^d\), you can normalize by :

  • min/max scaling :

\[\begin{align} \forall i, \forall j \in [0,d-1] x'_{i,j}= \frac{x_{i,j} - \min_k x_{k,j}}{\max_k x_{k,j} - min_k x_{k,j}} \end{align}\]

  • z-score normalization :

\[\begin{align} \forall i, \forall j \in [0,d-1] x_{i,j}&= \frac{x_{i,j} - \mu_j}{\sigma_j}\\ \mu_j &= \frac{1}{N} \sum_k x_{k,j}\\ \sigma_j &= \sqrt{\frac{1}{N} \sum_k (x_{k,j} - \mu_j)^2} \end{align}\]

Your statistics must be computed from the training set and applied also to test data.

Note : your data may also be long tailed, multi-modal, with outliers … See for example this scikit notebook

Dimensionality reduction : The what and the why

The what and the why

What Optimally transform \(x_i \in \mathbb{R}^n\) into \(z_i \in \mathbb{R}^d\) so that \(d \ll n\)

It remains to define what means “optimally transform”

Why

  • visualization of the data
  • interpretability of the predictor
  • speed up the algorithms whose complexity depends on \(n\)
  • data may occupy a manifold of lower dimensionality than \(n\)
  • curse of dimensionality : data get quickly sparse, models may overfit

Why : Visualization

Data analysis/Visualization How are your data distributed ? How are your classes intricated ? Do we have discriminative features ?

t-SNE MNIST (Maaten & Hinton, 2008) \mathbb{R}^{784}\mapsto\mathbb{R}^2
t-SNE MNIST (Maaten & Hinton, 2008) \(\mathbb{R}^{784}\mapsto\mathbb{R}^2\)



Visualization of t-SNE projection of CIFAR-10 through mobilenet-v2 network (dimensionality \(\mathbb{R}^{20 480}\mapsto\mathbb{R}^2\)). See here

Why : Interpretability

Example Why does this predictor say the tumor is malignant ?

Fully grown decision tree
Fully grown decision tree
Depth-2 decision tree
Depth-2 decision tree

Real risk estimated by 10-fold CV : \(0.92\) in both cases.

UCI ML Breast Cancer Wisconsin (Diagnostic) dataset: scikit loader

Why : Speed up

Decreasing dimensionality decreases training/inference times.
For example :

  • Linear regression \(\hat{y} = \theta^T x + b\)

  • Logistic regression(classification) : \(P(y=1/x) = \frac{1}{1 + \exp(\theta^T x)}\)

Both training and inference in \(O(n)\), \(x\in \mathbb{R}^n\)

Why : Intrinsic dimensionality

The data may occupy a lower dimensional manifold

Swiss roll
Swiss roll

\(\rightarrow\) you do not necessarily loose information by reducing the number of dimensions

Why : Intrinsic dimensionality

The data may occupy a lower dimensional manifold You want to classify facial expressions of a single person, controlled illumination:

  • suppose a huge image resolution, e.g. \(1024\times1024\) RGB pixels, \(x \in \mathbb{R}^{1024\times1024\times3}\)

  • what is the dimensionality of the data manifold ?

Why : curse of dimensionality

You may even have better predictors. The data become (exponentially) quickly sparse with respect to the dimension. It becomes harder to prevent overfitting.

The curse of dimensionality (Goodfellow, Bengio, & Courville, 2016)
The curse of dimensionality (Goodfellow, Bengio, & Courville, 2016)

see also (Hastie, Tibshirani, & Friedman, 2017), Link

Dimensionality reduction : the How

The How

What Optimally transform* \(x_i \in \mathbb{R}^n\) into \(z_i \in \mathbb{R}^d\) so that \(d \ll n\)
It remains to define what means “optimally transform”

transform* : not necessarily project

How

  • select a subset of the original features : feature selection
  • compute new features from the original ones : feature extraction

Dimensionality reduction : Feature selection

Feature selection

Select a subset of the original features/attributes/dimensions

Feature selection
Feature selection
Feature extraction : Foundations and applications (Guyon, Gunn, Nikravesh, & Zadeh, 2006). Link
Feature extraction : Foundations and applications (Guyon, Gunn, Nikravesh, & Zadeh, 2006). Link

Feature selection : overview

  • Filters : dimensions are selected based on a heuristic
  • Wrappers : dimensions are selected based on an estimation of the real risk
  • Embededed : The ML algorithm is designed to select a subset of the features, e.g. linear regression with L1 penalty
Three principal approaches of feature selection (Guyon et al., 2006)
Three principal approaches of feature selection (Guyon et al., 2006)

Feature selection: embedded

Feature selection : embedded

Example with 1) LASSO (L2 regression with L1 penalty), 2) a decision tree classifier,

Feature selection : univariate filters

Feature selection : univariate filters


Univariate filters rank every feature independently of the others, their relationship with the target.


Some feature importance evaluation for feature/target types
Target
Categorical Numerical
Feature Categorical Chi-2 , Mutual information ANOVA
Numerical ANOVA Correlation (Pearson, Spearman)

Univariate filters are handfull in large feature space where we cannot afford more elaborate multivariate filters/wrappers.

Univariate categorical -> categorical : chi-2 (1/3)

Statistical quantification (we will rank all the features) of the independence between the feature and the target : \(\chi^2\) independence test.

Assume one categorical feature (\(3\) values) and a categorical target (\(4\) values) with the following contingency table in the training set, with \(p_{ij} = P(feature=f_i, target=t_j)\):

Target
\(t_0\) \(t_1\) \(t_2\) \(t_3\)
Feature \(f_0\)        
\(f_1\)        
\(f_2\)        



Independance \(\iff P(F,T) = P(F) P(T) = (\sum_j P(F/T=t_j) P(T=t_j)) (\sum_i P(T/F=F_i)P(F=F_i))\)

We perform an hypothesis test :
- Null hypothesis \(H_0\) : the feature and the target are independent
- hypothesis \(H_1\): \(H_0\) is wrong.

Univariate categorical -> categorical : chi-2 (2/3)

If \(H_0\) is true, i.e. the feature and target are independent, and the joint probability is the product of the marginals. How well do the observations confirm this ?

\[ \chi^2 = \sum_{i,j} \frac{(N \hat{p}_{i,j} - N p_{i,j})^2}{N p_{i,j}} \] with:
- N the total number of samples
- \(\hat{p}_{i,j}\) the observed frequency of the case \(F=f_i, T=t_j\)
- \(p_{i,j}\) the expected frequency of the case \(F=f_i, T=t_j\), were \(H_0\) true

The expected frequency, under \(H_0\), is given by :

\[ p_{i,j} = P(F=f_i) P(T=t_j) = \left[ \frac{\sum_k \#[F=f_i, T=t_k]}{N} \right]\left[ \frac{\sum_k \#[F=f_k, T=t_j]}{N} \right] \]

The observed frequency, is given by :

\[ \hat{p}_{i,j} = \frac{\#[F=f_i, T=tj]}{N} \]

For filtering, we select the \(n_f\) features with the highest \(\chi^2\).

Univariate categorical -> categorical : chi-2 (3/3)

Example: sentiment analysis on Stanford 50K reviews.

Preprocessing: removing HTML tags, linebreaks, links, punctuation, repeated letters, words containing numbers, ..

Features : n-grams presence/absence (\(432746\) (1,2)-grams)

\(H_0\) : the presence or absence of a n-gram is independent on the sentiment

Word \(\#(f=0,t=0)\) \(\#(f=0,t=1)\) \(\#(f=1, t=0)\) \(\#(f=1,t=1)\) \(\chi^2\)
bad 8094 11002 4406 1498 1875.17
the worst 10836 12336 1664 164 1327.95
the best 11588 10485 912 2015 470.77
recently still 12499 12499 1 1 0.0



Univariate categorical -> categorical : Mutual information

Measure the mutual information between one categorical feature and a categorical target.

Mutual information is:
- a measure of statistical dependance between two random variables
- defined as the Kullback-Leibler divergence between the joint and the product of the marginals.
- null iif \(X\) and \(Y\) are independent
- \(I(X,Y) \geq 0\) and upper bounded

\[ I(X, Y) = \sum P(X, Y) \log\left(\frac{P(X,Y)}{P(X)P(Y)}\right) \]

Word \(\#(f=0,t=0)\) \(\#(f=0,t=1)\) \(\#(f=1, t=0)\) \(\#(f=1,t=1)\) \(\chi^2\) \(I(X,Y)\)
bad 8094 11002 4406 1498 1875.17 0.039
worst 10458 12273 2042 227 1596.76 0.036
the worst 10836 12336 1664 164 1327.95 0.031
waste 11292 12404 1208 96 1000.45 0.023
awful 11208 12347 1292 153 952.88 0.022
great 10410 8276 2090 4224 964.96 0.020
excellent 12150 11070 350 1430 705.51 0.015

For filtering, we select the \(n_f\) features with the highest \(I(X,Y)\). Strongly correlated but not identical to the \(\chi^2\) statistics.

Feature selection : Univariate continuous -> categorical (1/2)

Analysis of VARiance (ANOVA) F-Test :
- Are the means of [X / y] significantly different between the \(k\) populations ?
- compute and rank the features w.r.t to the F-values

Hypothesis \(H_0\) : the means \(\mu_1 = E[X / y=1]\), \(\mu_2 = E[X / y=2]\), .. \(\mu_k = E[X / y=k]\) are all equal.

Reasonnably valid H_0 Wikipedia Source
Reasonnably valid \(H_0\) Wikipedia Source
Reasonnably invalid H_0 Wikipedia Source
Reasonnably invalid \(H_0\) Wikipedia Source

Notes: better to have similar groupsize, and normally distributed values.

The statistics of interest \(F\) is : \[ F = \frac{\frac{1}{k-1}\sum_k n_k (\mu_k - \mu)^2}{\frac{1}{N-k} \sum_{k=1}^{K} \sum_{i \in n_k} (x_{i} - \mu_k)^2} = \frac{\mbox{between group variability}}{\mbox{within group variability}} \] follows a Fischer distribution (under the conditions of the test). High \(F\) invalidates \(H_0\).

For filtering, we select the features with the highest \(F\).

Feature selection : Univariate continuous -> categorical (2/2)

Feature selection : Univariate continuous -> continuous (1/2)

Rank the features w.r.t. their correlation with the target; e.g. Pearson correlation between \(x\in \mathbb{R}, y\in \mathbb{R}\): \[ \begin{align} r(x, y) &= \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sigma_x \sigma_y}\\ \sigma_x &= \sqrt{\frac{1}{n} \sum_i (x_i - \bar{x})^2}\\ \sigma_y &= \sqrt{\frac{1}{n} \sum_i (y_i - \bar{y})^2} \end{align} \]

For filtering, we select the features with the highest correlation magnitude \(|r|\).

Feature selection : Univariate continuous -> continuous (2/2)

Feature selection : Beyond univariate filters

Univariate filtering is computationally efficient but not always optimal.

See (Guyon et al., 2006), Introduction. Link

Feature selection : multivariate filters and wrappers

Feature selection: multivariate filters and wrappers

Problem: Finding the best subset of features (John, Kohavi, & Pfleger, 1994)

Denote \(\chi\) a subset of the dimensions/attributes/features :
- suppose we are provided a measure of how good this subset is \(J(\chi)\)
- we optimize \(J(\chi)\) over the possible subsets \(\chi = argmax_{\chi \in \chi_d} J(\chi)\)

The tree of possible subsets
The tree of possible subsets

If \(x \in \mathbb{R}^n\), we have \(2^n\) possible subsets :
\[ \chi_d = \{\emptyset, \{x_1\}, \{x_2\}, \cdots, \{x_1, x_2\}, \cdots\} \]

Feature selection: Multivariate filters

In multivariate filters, a feature subset is evaluated by an heuristics, e.g. statistics between the features and the features and the target (see (Duch, 2006),(Hall & Smith, 1997)).

e.g. Correlation based feature subset evaluation : select the features that are correlated with the target, yet not correlated between each others

\[ \begin{align} J(\chi) &= \frac{k r_{ky}}{\sqrt{k + k (k-1) r_{kk}}}\\ \mbox{Correlation : } r(w, z) &= \frac{\sum (w_i - \bar{w})(z_i - \bar{z}) }{\sqrt{\sum_i(w_i - \bar{w})^2\sum_j(z_j - \bar{z})^2}}\\ \mbox{Average pairwise correlation : } r_{kk}&= \frac{2}{k(k-1)} \sum_{j_1 < j_2} r(x_{j_1}, x_{j_2})\\ \mbox{Average feature-target correlation : } r_{ky} &= \frac{1}{k} \sum_j r(x_j, y) \\ \end{align} \] with \(k = |\chi|\) the size of the subset

Feature selection: multivariate wrappers

In multivariate wrappers, a feature subset is ranked by the real risk of a trained estimator estimated, e.g. by cross-validation.

Example (Somol, Novovicova, & Pudil, 2010) : Ionosphere data
- 351 samples, 34 real valued features, 2 classes (225/126),
- \(80\%\) Train, \(20\%\) Test split.
- \(10\)-fold cross validation
- wrapper : \(3\)-Nearest neighbor
- Sequential Forward Search / Sequential Forward Floating Search / Best Individual Feature / Oscillating Search

Multivariate FS performances
Multivariate FS performances
Perf. on independent test data
Perf. on independent test data

Carefull, cross validation performances […] Used with computationally intensive search algorithms, these estimates may overfit and yield biased predictions (Reunanen, 2003). Need independent test data. Said differently, model selection is part of the algorithm to be cross validated.

Feature selection : multivariate wrappers in practice

Feature selection : Exploiting the structure of the optimization problem (1/2)

Principle: the features are ranked by analyzing, locally, the sensitivity of the loss with respect to the features.
Algorithm (Recursive Feature Elimination, RFE):
- fit a predictor
- locally analyze the importance/relevance of the features
- discard the less relevant features
- loop

Example: Optimal Brain Damage (OBD, (LeCun, Denker, & Solla, 1990)):
Denote \(J : \mathbb{R}^n \mapsto \mathbb{R}\) a loss to be optimized, under the “Diagonal”, “Extremal”, “Quadratic” assumptions : \[ \begin{align} J(w) &\approx J(w_0) + (w - w_0)^T\frac{d J}{dw}(w_0) + \frac{1}{2}(w-w_0)^T.H.(w-w_0) + ... \\ \Delta J_i &\approx \frac{1}{2} H_{i,i} (\Delta w)_i^2 \end{align} \] For feature selection, we look for dimensions that can be canceled out (so \(\Delta w_i = (w)_i\)) affecting the less the loss .

How to compute \(H_{ii}\) ?

  • For multilayer neural networks, see (LeCun et al., 1990) for \(O(N)\) computation of \(H_{i,i}\),
  • For linear regression, \(J(w) = \|y - w^T x\|^2\), \(h_{ii} = \sum_j x_{ji}^2\)
  • For logistic regression, \(J(w) = CE(y, \sigma(w^Tx))\), \(h_{ii} = \sigma(w^T x) (1 - \sigma(w^Tx)) \sum_j x_{ji}^2\)

For the last 2 cases, if your input data are normalized (centered, reduced), \(h_{ii}\) does not depend on \(i\) and a criteria to select a feature is based on \(\Delta w_i^2 = w_i^2\) only : remove the feature \(i\) with smallest \(|w_i|\)

Feature selection : Exploiting the structure of the optimization problem (2/2)

Principle: the features are ranked by analyzing, locally, the sensitivity of the loss with respect to the features.
Algorithm (Recursive Feature Elimination, RFE):
- fit a predictor
- locally analyze the importance/relevance of the features
- discard the less relevant features
- loop

Example: RFE-SVM (Guyon, Weston, Barnhill, & Vapnik, 2002). In SVM, we optimize the margin \(J(w) = \frac{1}{2}\|w\|^2\) under constraints.

Linear SVM, we maximize the margin, i.e. minimize \(J(w) = \frac{1}{2}\|w\|^2\) under constraints.

  • Fit a linear SVM \(w^T x + b\)
  • Remove the features that vary the least the margin, i.e. with smallest \(|w_i|\)
  • Loop

You can drop out more than one feature at a time. See also (Guyon et al., 2006), p145. See (Guyon et al., 2002) for generalization to non-linear SVM.

Dimensionality reduction : Feature extraction

Feature extraction

Given \(N \mbox{ samples } x_i \in \mathbb{R}^d\),
We compute \(r \ll d\) new features from the original \(d\) features.

Some feature extraction techniques
Some feature extraction techniques
Non linear dimensionality reduction (Lee & Verleysen, 2010)
Non linear dimensionality reduction (Lee & Verleysen, 2010)

Principal component analysis

Problem statement (Pearson, 1901)

Objective Find an affine transformation of the data minimizing the reconstruction error.

Intuition and formalisation

PCA with d=2, r=1
PCA with \(d=2\), \(r=1\)

In 1D, we seek a line \((w_0, w_1)\) minimizing the sum of the squared length of the red segments.

Note: It is not unique !

Formalization

Statement : Find an affine transformation of the data minimizing the reconstruction error

Formally :

\[\begin{align} \label{pca:problem}\min_{\{w_0, w_1, ..w_r\} \in \mathbb{R}^d} \sum_{i=0}^{N-1} \left|x_i - (w_0 + \sum_{j=1}^r (w_j^T (x_i-w_0))w_j )\right|_2^2 \end{align}\]

subject to \(w_i^T w_j = \delta_{i,j}\)



Matrix form :

Denote \(\mathbf{W} = (w_1 | \dots | w_r) \in \mathcal{M}_{d \times r}(\mathbb{R})\)

\[\begin{align} \min_{\{w_0, w_1, ..w_r\} \in \mathbb{R}^d} \sum_{i=0}^{N-1} \left| (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)(x_i - w_0)\right|_2^2 \end{align}\]

subject to \(\mathbf{W}^T \mathbf{W} = \mathbf{I}_r\)

Principal component analysis : derivation

Formalization: simplifications

Let us simplify the equation :

  • If \(\mathbf{M}\) is idempotent, so is \((\mathbf{I} - \mathbf{M})\)
  • \((\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)\) is symmetric and idempotent



\[\begin{align} \min_{\{w_0, w_1, ..w_r\} \in \mathbb{R}^d} \sum_{i=0}^{N-1} (x_i - w_0)^T (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)(x_i - w_0) \end{align}\] subject to \(\mathbf{W}^T \mathbf{W} = \mathbf{I}_r\)

Finding \(w_0\)

Remember : For \(u:\mathbb{R}^n\mapsto\mathbb{R}^m\), \(v:\mathbb{R}^n\mapsto\mathbb{R}^m\), \(A\in\mathcal{M}_{m,m}(\mathbb{R})\): \[\frac{d u^T A v}{dx} = \frac{du}{dx}A v + \frac{dv}{dx}A^T u\]

Finding \(w_0\) :

\[\begin{align} J &= \sum_{i=0}^{N-1} (x_i - w_0)^T (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)(x_i - w_0)\\ \frac{\partial J}{\partial w_0} &= -2 (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T) \sum_{i=0}^{N-1}(x_i - w_0) \end{align}\]

Finding \(w_0\) cont.

Finding \(w_0\)

\[\begin{align} J &= \sum_{i=0}^{N-1} (x_i - w_0)^T (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)(x_i - w_0)\\ \frac{\partial J}{\partial w_0} &= -2 (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T) \sum_{i=0}^{N-1}(x_i - w_0)\\ \frac{\partial J}{\partial w_0} = 0 &\Leftrightarrow \exists h \in \mathop{\mathrm{span}}\{w_1, ..., w_r\}, w_0 = h + \frac{1}{N}\sum_i x_i \end{align}\]

\((\mathbf{I}_d - \mathbf{W} \mathbf{W}^T) h\) is the residual vector by the orthogonal projection on the column vectors of \(\mathbf{W}\).

Finding \(w_0\) : conclusion

Finding \(w_0\)

\[\begin{align} J &= \sum_{i=0}^{N-1} (x_i - w_0)^T (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)(x_i - w_0)\\ \mathop{\mathrm{argmin}}_{w_0} J &\Rightarrow w_0 = h + \frac{1}{N}\sum_i x_i, h \in \mathop{\mathrm{span}}\{w_1, ..., w_r\} \end{align}\]

For example, \(h=0\) is one solution.

The offset \(w_0\) is the mean of the data points, up to a translation in the space spaned by the principal components vectors.

Step 1: Center the data

Denote:

\[\begin{align} \widetilde{x}_i &= x_i - \bar{x}\\ \bar{x} &= \frac{1}{N}\sum_i x_i\\ \widetilde{\mathbf{X}} &= (\widetilde{x}_0|\dots|\widetilde{x}_{N-1})\\ J &= \sum_{i=0}^{N-1} \widetilde{x}_i^T(\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)\widetilde{x}_i\\ \end{align}\]

Deriving the principal component vectors

We will derive the principal vectors with a (later justified) greedy approach.

\[\begin{align} J &= \sum_{i=0}^{N-1} \widetilde{x}_i^T(\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)\widetilde{x}_i\label{eq:loss_pv} \\ \mathop{\mathrm{argmin}}_{w_1,..w_r} J &= \mathop{\mathrm{argmax}}_{w_1,..w_r} \sum_{i=0}^{N-1} \widetilde{x}_i^T\mathbf{W} \mathbf{W}^T\widetilde{x}_i, \mbox{ s.t. } \mathbf{W}^T\mathbf{W} = \mathbf{I}_r\\ \label{eq:pca_maxvar} &= \mathop{\mathrm{argmax}}_{w_1,..w_r} \sum_{i=1}^{r} w_i^T \mathbf{\widetilde{X}} \mathbf{\widetilde{X}}^Tw_i, \mbox{ s.t. } \mathbf{W}^T\mathbf{W} = \mathbf{I}_r\\ \end{align}\]

Introduce the Lagrangian for this constrained optimization problem.

\[ \mathcal{L}(w_i, \lambda_i) = \sum_{i=1}^r w_i^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_i + \lambda_i (1 - w_i^T w_i) \]

Deriving the first component vector \(w_1\)

Let us operate in a greedy approach, first looking for \(w_1\) only :

\[ \mathcal{L}(w_1, \lambda_1) = w_1^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_1 + \lambda_1 (1 - w_1^T w_1) \]

Then

\[ \frac{\partial\mathcal{L}}{d w_1} = 0 \Rightarrow \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_1 = \lambda_1 w_1 \]

Therefore, \(w_1 eigen\vec{v}\) but which \(\lambda_1\) ? \[ w_1^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_1 = \lambda_1 \]

\(\lambda_1\) is the largest eigenvalue of \(\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T\)

First principal component vector The first principal component vector is a normalized eigenvector associated with the largest eigenvalue of the "sample covariance matrix \(\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T\)

Deriving the second component vector \(w_2\)

Suppose \(w_1\) is a normalized eigenvector of \(\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T\) associated with its largest eigenvalue. Denote \(\lambda_1 \geq \lambda_2 \geq \dots \lambda_d \geq 0\) the eigenvalues.

We want to optimize :

\[\begin{align} \mathop{\mathrm{argmax}}_{w_2} w_1^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_1 + w_2^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_2 &= \mathop{\mathrm{argmax}}_{w_2} \lambda_1 + w_2^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_2 \\ &= \mathop{\mathrm{argmax}}_{w_2} w_2^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_2, \mbox{ s.t. } w_i^T w_j = \delta_{i,j}\\ &=\mathop{\mathrm{argmax}}_{w_2} w_2^T (\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T - \lambda_1 w_1 w_1^T)w_2, \mbox{ s.t. } w_i^T w_j = \delta_{i,j} \end{align}\]

Therefore,

\(w_2\) is a normalized eigenvector associated with the largest eigenvalue of \((\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T-\lambda_1 w_1w_1^T)\), i.e. \(\lambda_2\)

And so on. But is the greedy algorithm finding the optimum ?

Theorem For any symmetric positive semi-definite matrix \(\mathbf{M}\in\mathcal{M}_{d\times d}(\mathbb{R})\), denote \(\{\lambda_i\}_{i=1..d}\) its eigenvalues with \(\lambda_1 \geq \lambda_2\cdots\geq \lambda_d\geq 0\). For any set of \(r\in[|1,d|]\) orthogonal unit vectors, \(\{v_1,\dots,v_r\}\), we have : \[\sum_{j=1}^r v_j^T \mathbf{M} v_j \leq \sum_{j=1}^r \lambda_j\] And this upper bound is reached by eigenvectors associated with the largest eigenvalues of \(\mathbf{M}\)

Principal component analysis: algorithm and examples

The algorithm

Given \(\{x_0, \dots, x_{N-1}\} \in \mathbb{R}^d\), to compute the r principal component vectors :

  1. Center your data \(\widetilde{x}_i = x_i - \bar{x}\)
  2. Build the matrix \(\widetilde{\mathbf{X}} = [\widetilde{x}_0| \dots| \widetilde{x}_{N-1}]\)
  3. Compute \(r\) normalized eigenvectors associated with the \(r\) largest eigenvalues of \(\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T\)

See a simple (naive) implementation at https://github.com/rougier/ML-Recipes

PCA is a projection method

Given \(x \in \mathbb{R}^d\), its principal components are its coordinates in the selected eigenspace :

\[\begin{align*} \mathbb{R}^d & \mapsto \mathbb{R}^r\\ x&\rightarrow ((x-\bar{x})^T w_1, (x-\bar{x})^T w_2, \dots, (x-\bar{x})^T w_r) \end{align*}\]

The whole space \(\mathbb{R}^d\) is projected, not just the training samples \(x_i\).

If \(x \in \{x_0, \dots, x_{N-1}\}\), you should better use the SVD which gives you directly the principal components in addition to the principal component vectors.

Some dimensionality reduction methods (i.e. t-SNE) do not project the whole original space \(\mathbb{R}^d\) but only assign a representative in \(\mathbb{R}^r\) of some key points and hence cannot be coined “projection methods”.

Quantifying the quality of the PCA : reconstruction error

What is the reconstruction error if we keep \(r\) components ?

From equation \(\eqref{eq:loss_pv}\), the square of the reconstruction error is :
\[\begin{align*} J &= \sum_{i=0}^{N-1} \| x_i - \hat{x_i}\|_2^2 = \sum_{i=0}^{N-1} \widetilde{x}_i^T(\mathbf{I}_d - \mathbf{W_r} \mathbf{W_r}^T)\widetilde{x}_i = \sum_{i=0}^{N-1} \widetilde{x}_i^T\widetilde{x}_i - \sum_{i=1}^{r} w_i^T \mathbf{\widetilde{X}} \mathbf{\widetilde{X}}^Tw_i \\ & = \mathop{Tr}(\mathbf{\widetilde{X}}^T\mathbf{\widetilde{X}}) - \sum_{i=1}^{r} \lambda_i = \sum_{i=1}^{d} \lambda_i - \sum_{i=1}^{r} \lambda_i = \sum_{i=1}^{d} \lambda_i ( 1 - \frac{\sum_{i=1}^{r} \lambda_i}{\sum_{i=1}^{d} \lambda_i}) \\ \end{align*}\]

The two matrices \(\mathbf{\widetilde{X}}^T\mathbf{\widetilde{X}}\) and \(\mathbf{\widetilde{X}}\mathbf{\widetilde{X}^T}\) have the same eigenvalues (see the section on the relationship between the Gram and sample Covariance matrices).

Quantifying the quality of the PCA : variance fraction

Denote \(z_i = \mathbf{W_r}^T \widetilde{x}_i\). We have \(\bar{z} = 0\) and

\[ \begin{align} \mathbf{\Sigma}_z &= \frac{1}{N-1} \sum_i z_i z_i^T = \frac{1}{N-1} \sum_i \mathbf{W_r}^T \widetilde{x}_i \widetilde{x_i}^T \mathbf{W_r}\\ Tr(\mathbf{\Sigma}_z) &= \sum_{j=1}^r (\mathbf{\Sigma}_z)_{j,j} = \frac{1}{N-1} \sum_j \sum_i (w_j^T \widetilde{x}_i)(\widetilde{x}_i^T w_j) = \frac{1}{N-1} \sum_{j=1}^r w_j^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_j = \frac{1}{N-1}\sum_{j=1}^r \lambda_j \end{align} \]

This is exactly what we maximized in equation \(\eqref{eq:pca_maxvar}\). PCA is also the solution to “finding an affine transformation which maximizes the variance of the projections”.

And :

\[ Tr(\mathbf{\Sigma}_x) = Tr(\frac{1}{N-1} \widetilde{x}_i \widetilde{x}_i^T) = \frac{1}{N-1} Tr(\mathbf{P} \mathbf{\Lambda} \mathbf{P}^{-1}) = \frac{1}{N-1}Tr(\mathbf{\Lambda}) = \frac{1}{N-1}\sum_{j=1}^{d} \lambda_j \] Note, for any matrix \(\mathbf{M}\) and orthogonal matrix \(\mathbf{P}\), \(Tr(\mathbf{P}^{-1} \mathbf{M} \mathbf{P}) = Tr(\mathbf{P} \mathbf{P}^{-1} \mathbf{M})= Tr \mathbf{M}\)

Therefore, the quality of the PCA can be evaluation in terms of fraction of variance that we keep : \[ \frac{\sum_{j=1}^{r}\lambda_j}{\sum_{j=1}^{d} \lambda_j} \]

Toy example

Let us sample \(N=500\) samples from a normal distribution \(\mathcal{N}(\mu=\begin{pmatrix}0.5 \\ 1.0 \end{pmatrix}, \Sigma=\begin{pmatrix} 0.43 & 0.36 \\ 0.36 & 0.76 \end{pmatrix})\)

We perform a PCA from \(d=2\) to \(r=2\). Then \(w_0 = \begin{pmatrix} 0.49 \\ 1.0 \end{pmatrix}\), \(w_1 = \begin{pmatrix} -0.83 \\ 0.54\end{pmatrix}\), \(w_2 = \begin{pmatrix} 0.54 \\ 0.83\end{pmatrix}\)

Toy example: 1st principal axis (red) and 2nd principal axis (green)
Toy example: 1st principal axis (red) and 2nd principal axis (green)
Toy example: Transformed
Toy example: Transformed

Fraction of the variance over the principal axes : \(\lambda_1 = 0.95 (\lambda_1 + \lambda_2)\), \(\lambda_2 = 0.045 (\lambda_1 + \lambda_2)\)

Example from https://github.com/rougier/ML-Recipes

Images example

\(N=4000\) grayscale \(28\times28\) images of handwritten digits, \(d=784\).

We perform a PCA from \(d=784\) to \(r=2\),

Handwritten digits. Top: the projection of the data with the two first principal components. Bottom: The ten first principal vectors. The labels are just used for display, not for computing the unsupervized PCA!
Handwritten digits. Top: the projection of the data with the two first principal components. Bottom: The ten first principal vectors. The labels are just used for display, not for computing the unsupervized PCA!

Note \(\frac{\sum_{i=1}^{10} \lambda_i}{\sum_{i=1}^{784} \lambda_i} = 47.67\%\) of the variance is kept with the first \(10\) components.

Example from https://github.com/rougier/ML-Recipes

Principal component analysis : further comments

PCA can be computed with the Singular Value Decomposition

Theorem For any matrix \(\mathbf{M} \in \mathcal{M}_{d, N}(\mathbb{R})\), there exists an orthogonal matrix \(\mathbf{U} \in \mathcal{M}_{d,d}(\mathbb{R})\), a diagonal matrix \(\mathbf{D} \in \mathcal{M}_{d, N}(\mathbb{R})\) and an orthogonal matrix \(\mathbf{V} \in \mathcal{M}_{N,N}(\mathbb{R})\) such that

\[ \mathbf{M} = \mathbf{U} \mathbf{D} \mathbf{V}^T \]

Note: orthogonal matrices \(\equiv \mathbf{U}^T = \mathbf{U}^{-1}\).

For the PCA, given \(\mathbf{\widetilde{X}} = \mathbf{U} \mathbf{D} \mathbf{V}^T\), we have the diagonalization of \(\mathbf{\widetilde{X}} \mathbf{\widetilde{X}}^T\) :

\[ \mathbf{\widetilde{X}} \mathbf{\widetilde{X}}^T = \mathbf{U} \mathbf{D} \mathbf{D}^T \mathbf{U}^{-1} \]

The principal component vectors are the column vectors of \(\mathbf{U}\) and the principal components of the training set are the first \(r\) rows of :

\[ \mathbf{U}^T \mathbf{\widetilde{X}} = \mathbf{D} \mathbf{V}^T \]

Toward PCA extensions

Relationship between the sample covariance and Gramm matrices (1/2)

Suppose your data are centered : \(\sum_i x_i = 0\).

The sample covariance matrix is : \[ \mathbf{\Sigma}_x = \frac{1}{N-1}\sum_i x_i x_i^T = \frac{1}{N-1}\mathbf{X}\mathbf{X}^T \] The Gram matrix is (the matrix of the dot-products):

\[\mathbf{G} = \mathbf{X}^T \mathbf{X}= \begin{bmatrix} x_0^T x_0 & x_0^T x_1 & \cdots & x_0^T x_{N-1} \\ \vdots & \vdots & \vdots & \vdots \\ x_{N-1}^T x_0 & x_{N-1}^T x_1 & \cdots & x_{N-1}^T x_{N-1} \end{bmatrix}\]

It turns out that the eigenvectors and eigenvalues of \(\mathbf{G}\) and \(\mathbf{\Sigma}\) are strongly related !

Relationship between the sample covariance and Gramm matrices (2/2)

Lemma:\(\forall \mathbf{A} \in \mathbb{R}^{n \times m}, ker(\mathbf{A}) = ker(\mathbf{A}^T \mathbf{A})\)

Theorem (Rank-nullity) \(\forall \mathbf{A} \in \mathbb{R}^{n \times m}, rk({\mathbf{A}}) + dim(ker(\mathbf{A})) = m\).

Theorem \(\forall \mathbf{A} \in \mathbb{R}^{n \times m}, rk(\mathbf{A}^T\mathbf{A}) = rk(\mathbf{A}\mathbf{A}^T) \leq \mathrm{min}(n,m)\)

Lemma (Eigenvalues of the sample covariance and Gram matrices) The non-zero eigenvalues of the scaled covariance matrix \((N-1) \mathbf{\Sigma} = \mathbf{X} \mathbf{X}^T\) and Gram matrix \(\mathbf{G} = \mathbf{X}^T \mathbf{X}\) are the same : \[ \{\lambda \in \mathbb{R}^*, \exists v\neq 0, (N-1) \mathbf{\Sigma} v = \lambda v\} = \{ \lambda \in \mathbb{R}^*, \exists v\neq 0, \mathbf{G} v = \lambda v \} \] And, during the proof we show some relationships :

  • If \((\lambda, v)\) eigen of \(\mathbf{X}\mathbf{X}^T\), then \((\lambda, \mathbf{X}^Tv)\) eigen of \(\mathbf{X}^T\mathbf{X}\),
  • If \((\lambda, w)\) eigen of \(\mathbf{X}^T\mathbf{X}\), then \((\lambda, \mathbf{X}w)\) eigen of \(\mathbf{X}\mathbf{X}^T\),

We will see two applications of this property :

  • the eigenface algorithm, used when \(N \ll d\)
  • the non-linear PCA called Kernel PCA (Scholkopf, Smola, & Müller, 1999)

Principal component analysis in large dimensional spaces

What to do when \(N << d\)

If \(N\ll d\), it is much more efficient to “diagonalize” \(\mathbf{G} \in \mathbb{R}^{N \times N}\) than \(\mathbf{\Sigma} \in \mathbb{R}^{d \times d}\). The eigen face is the PCA done by “diagonalizing” the Gram matrix :

  1. Center your data \(\widetilde{x}_i = x_i - \bar{x}\)
  2. Build the matrix \(\widetilde{\mathbf{X}} = [\widetilde{x}_0| \dots| \widetilde{x}_{N-1}]\)
  3. Compute \(r\) normalized eigenvectors \(w_j\) associated with the \(r\) largest eigenvalues of \(\mathbf{G} = \widetilde{\mathbf{X}}^T \widetilde{\mathbf{X}}\)
  4. Project the data on the \(r\) normalized eigveectors of \(\mathbf{\Sigma}\) given by : \[ \frac{\widetilde{\mathbf{X}} w_j}{|\widetilde{\mathbf{X}} w_j|} = \frac{1}{\sqrt{\lambda_j}}\widetilde{\mathbf{X}} w_j \]

See a simple (naive) implementation at https://github.com/rougier/ML-Recipes

Eigen-face example

AT&T dataset : 400 images of size \((112, 92)\), i.e. \(N=400, d=10304\).

PCA of the faces
PCA of the faces

\(59.95\%\) of the variance is kept with \(10\) compoents. Below we show some samples ordered by their first PC, going from hairy to shaved people.

Samples ordered by their first PC
Samples ordered by their first PC

Non-linear PCA

Reformulating PCA with only dot products

We can perform a PCA with only dot products :

  • the Gram matrix \(\mathbf{G} \in \mathbb{R}^{N \times N}\) is made of dot products
  • projecting a vector \(x\) on the vector \(\frac{1}{\sqrt{\lambda_j}}\widetilde{\mathbf{X}} w_j\) reads:
    \[ (\frac{1}{\sqrt{\lambda_j}}\widetilde{\mathbf{X}} w_j)^T x = \frac{1}{\sqrt{\lambda_j}} w_j^T \begin{bmatrix} <x_0, x> \\ <x_1, x> \\ \vdots \\ <x_{N-1}, x> \end{bmatrix} = \frac{1}{\sqrt{\lambda_j}} w_j^T \begin{bmatrix} k(x_0, x) \\ k(x_1, x) \\ \vdots \\ k(x_{N-1}, x)\end{bmatrix} \]

A linear algorithm involving only dot products can be rendered non-linear using the kernel trick (see SVM). The PCA is performed in the feature space !

But…. we need to ensure that the vectors in the feature space are centered.

Implicitely centering in the feature space

We can build the Gram matrix \(\tilde{\mathbf{G}}\) of the vectors centered in the feature space, from the Gram matrix \(\tilde{\mathbf{G}}\) of the original vectors in the feature space with the double centering transform : \[ \widetilde{\mathbf{G}} = (\mathbf{I}_N - \frac{1}{N}\mathbf{1})\mathbf{G}(\mathbf{I}_N - \frac{1}{N}\mathbf{1}) \]

Kernel PCA

We now a non-linear PCA, the Kernel PCA (Scholkopf et al., 1999)

  1. Build the matrix \(\mathbf{X} = [x_0| \dots| x_{N-1}]\)
  2. Build the Gram matrices \(\mathbf{G}_{i,j} = k(x_i, x_j)\) and \(\tilde{\mathbf{G}}=(\mathbf{I}_N - \frac{1}{N}\mathbf{1})\mathbf{G}(\mathbf{I}_N - \frac{1}{N}\mathbf{1})\)
  3. Compute \(r\) normalized eigenvectors \(w_j\) associated with the \(r\) largest eigenvalues of \(\widetilde{\mathbf{G}}\)
  4. Project the data on the \(r\) normalized eigen vectors of \(\mathbf{\Sigma}\) given by : \[ \forall j \in [1..r], pc_j(x) = \frac{1}{\sqrt{\lambda_j}}w_j^T\begin{bmatrix} k(x_0, x) \\ k(x_1, x) \\ \vdots \\ k(x_{N-1}, x)\end{bmatrix} \]

Feature extraction : Manifold learning

Overview

Methods which seek low dimensional representations that are similar as possible to the high dimensional data layout (Note: PCA does not). You want to focus on the local structure and preserve neighborhood relationships.

It usually involves four steps :

Given \(x_i \in \mathbb{R}^d\), \(y_i \in \mathbb{R}^r\), \(r \ll d\):

1- Quantify the similarity between pairs of points in \(\mathbb{R}^d\)
2- Quantify the similarity between pairs of points in \(\mathbb{R}^r\)
3- Quantify the discrepancy between these similarities
4- Minimize with respect to \(y_i\)

These are not mappings/projections, they don’t transform \(\mathbb{R}^d\) downto \(\mathbb{R}^r\).

Dissimilarities, distances and Euclidean distances

A dissimilarity \(\delta : S \times S \mapsto \mathbb{R}\) satisfies :

  • \(\delta(a,a) = 0\)
  • non-negativity: \(\delta(a, b) \geq 0\)

A metric \(d : S \times S \mapsto \mathbb{R}\) further satisfies :

  • symmetry \(d(a, b) = d(b, a)\)
  • definiteness : \(d(a, b) = 0 \implies a = b\)
  • triangle inequality : \(d(a, b) \leq d(a, c) + d(c, b)\)

If \(S\) is an Euclidean space, the euclidean distance is defined as :
\[ d(x, y) = \sqrt{\sum_i (x_i - y_i)^2} \]

Lot of examples of dissimilarities in (Cox & Cox, 2008)

Multidimensional scaling

Finding a set of points from their pairwise distances

Given the euclidean distances between pairs of points \(x_i \in \mathbb{R}^d\) stored in \(D_{i,j} = \|x_i - x_j\|\), can we recover \(d\) and the points \(x_i\) ?

This can be done up to an Euclidean transformation (which preserves distances) (Young & Householder, 1938), e.g. we can find the centered points \(x_i\) (\(\sum_i x_i = 0\)):

  • from the matrix \(G = -\frac{1}{2}HD^2H\), \(G_{i,j} = <x_i^T,x_j>\), \(H = I - \frac{1}{N}e e^T\)
  • \(d\) is the rank of \(G\)
  • \(G\) is positive semidefinite, symmetrical, i.e. only positive eigenvalues
  • the vectors are the square root of \(G = X^T X\)
Left) Recovering the locations of the cities from geodesic distances
Left) Recovering the locations of the cities from geodesic distances

Note: we can more generally consider dissimilarities in \(D\) but this does not necessarily guarantee that \(G\) is positive semidefine (because of the possible lack of the triangular inequality)

Classical Multidimensional scaling

Multidimensional scaling : “Find a configuration of points in a space, usually Euclidean, where each point represents one of the objects or individuals, and the distances between pairs of points in the configuration match as well as possible the original dissimilarities between the pairs of objects or individuals.” (Cox & Cox, 2008)

Note: 1) we do not necessarily have data points from which the dissimilarities are computed. 2) even if we do, we are interested by low dimensional representatives.

In (Torgerson, 1952) it is suggested to approximate the Gram matrix, i.e. find a matrix \(Y \in \mathcal{M}_{r,N}(\mathbb{R})\), so that \(y_i^Ty_j \approx (-\frac{1}{2}H\delta^2H)_{i,j}\). This can be solved in closed form !

And, if you preserve the Gram matrix built from the dissimilarties, the euclidean distances between \(y\) will exactly match the dissimilarities :

\[ \begin{align} \forall i,j \|x_i - x_j\|^2 &= \|x_i\|^2 + \|x_j\|^2 - 2 <x_i, x_j>\\ D_{i,j} &= G_{i,i} + G_{j,j} -2 G_{i,j} \end{align} \]

Right) Positionning points from travel trip distances as dissimilarities
Right) Positionning points from travel trip distances as dissimilarities

Sammon’s Non Linear Mapping (NLM)

Sammon’s Non linear mapping (NLM)

Sammon’s non linear mapping (Sammon, 1969) seeks to preserve the pairwise dissimilarities :

\[ E_{NLM} = \frac{1}{c} \sum_{i, j< i} \frac{(\|x_i - x_j\|_2 - \delta_{i,j})^2}{\delta_{i,j}} \]

With \(\frac{1}{\delta}\), it is more important to better approximate small distances rather than large distances

This is solved by gradient descent using the quasi-newton method (Newton method, assuming the Hessian is diagonal) :

\[ (x_i)_k(t+1) = (x_i)_k(t) - \epsilon_{MF} \frac{\frac{\partial E_{NLM}}{\partial (x_i)_k}}{ \|\frac{\partial E_{NLM}}{\partial (x_i)_k}\| } \]

with \(\epsilon_{MF} \approx 0.3\) in (Sammon, 1969). See the paper (also (Lee & Verleysen, 2010), p82) for the equations.

Isomap

Isomap

Idea: If we are given datapoints on a non-linear manifold, the euclidean distance may not be an appropriate choice for a dissimilarity measure. Better to compute the distance along the manifold.

Algorithm:

  • compute a graph in \(\mathbb{R}^d\), connecting the K closest points (K-isomap) or the points in a ball of radius \(\epsilon\) (\(\epsilon-\)isomap)
  • compute the shortest path (djikstra) between the points in the graph
  • apply classical MDS
Isomap on the swiss roll. Illustration from (Tenenbaum, Silva, & Langford, 2000)
Isomap on the swiss roll. Illustration from (Tenenbaum, Silva, & Langford, 2000)

T-SNE

t-Stochastic Neighborhood Embedding (t-SNE)

t-SNE is introduced in (Maaten & Hinton, 2008). The idea is to focus on preserving local distances while allowing large distances in \(\mathbb{R}^d\) to be even larger in \(\mathbb{R}^r\).

Similarity in \(\mathbb{R}^d\) :

\[ \begin{align} \forall i,j, p_{i,j} &= \frac{p_{i/j} + p_{j/i}}{2N}\\ \forall i,j, p_{i/j} &= \frac{\exp(-\frac{|x_i - x_j|^2}{2\sigma_i^2})}{\sum_{k\neq i}\exp(-\frac{|x_i - x_k|^2}{2\sigma_i^2})} \end{align} \]

Similarity in \(\mathbb{R}^r\) :

\[ \begin{align} \forall i,j, q_{i,j} = \frac{(1+|y_i - y_j|_2^2)^{-1}}{\sum_{k\neq l}(1+|y_k - y_l|_2^2)^{-1}} \end{align} \]

Discrepancy measured with the Kullback-Leibler divergence between \(p_{i,j}\) and \(q_{i,j}\) :

\[ \begin{aligned} C = \sum_{i,j} p_{i,j} \log(\frac{p_{i,j}}{q_{i,j}}) \end{aligned} \]

Complexity (\(O(N^2)\)). Optimized with Barnes-Hutt, complexity \(O(N\log N)\).

See also t-SNE on distill.pub

t-Stochastic Neighborhood Embedding (t-SNE)

Example on MNIST, samples in \(\mathbb{R}^{784}\) down to \(\mathbb{R}^2\)

t-SNE on MNIST from (Maaten & Hinton, 2008)
t-SNE on MNIST from (Maaten & Hinton, 2008)

t-Stochastic Neighborhood Embedding (t-SNE)

Example on visualizing the last convolution layer features of a MobileNet. Samples in \(\mathbb{R}^{20480}\) down to \(\mathbb{R}^2\)

Bibliography

References

Rather check the full online document references.pdf

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. Retrieved from http://arxiv.org/abs/1607.04606

Cox, M. A. A., & Cox, T. F. (2008). Multidimensional scaling. In Handbook of data visualization (pp. 315–347). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-33037-0_14

Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805. Retrieved from http://arxiv.org/abs/1810.04805

Duch, W. (2006). Filter methods. In I. Guyon, S. Gunn, M. Nikravesh, & L. Zadeh (Eds.), Feature extraction: Foundations and applications (studies in fuzziness and soft computing) (pp. 89–117). Springer-Verlag New York, Inc. Secaucus, NJ, USA.

Goodfellow, I. J., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, MA, USA: http://www.deeplearningbook.org; MIT Press.

Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. (2006). Feature extraction: Foundations and applications (studies in fuzziness and soft computing). Springer-Verlag New York, Inc. Secaucus, NJ, USA.

Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Mach. Learn., 46(1–3), 389–422. https://doi.org/10.1023/A:1012487302797

HaCohen-Kerner, D. A. Y., Yaakov AND Miller. (2020). The influence of preprocessing on text classification using a bag-of-words representation. PLOS ONE, 15(5), 1–22. https://doi.org/10.1371/journal.pone.0232525

Hall, M., & Smith, L. (1997). Feature subset selection: A correlation based filter approach. In Proc. Intl. Conf. Neural inform. Processing intell. Inform. Syst. (pp. 855–858).

Hastie, T., Tibshirani, R., & Friedman, J. (2017). The elements of statistical learning, 2nd edition. New York, NY, USA: https://web.stanford.edu/~hastie/ElemStatLearn/; Springer New York Inc.

Hu, T.-Y., Armandpour, M., Shrivastava, A., Chang, J.-H. R., Koppula, H., & Tuzel, O. (2022). SYNT++: Utilizing imperfect synthetic data to improve speech recognition. In ICASSP 2022 - 2022 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 7682–7686). https://doi.org/10.1109/ICASSP43922.2022.9746217

John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. In MACHINE learning: PROCEEDINGS of the eleventh international (pp. 121–129). Morgan Kaufmann.

Koehn, P. (2012). Europarl: A parallel corpus for statistical machine translation. https://www.statmt.org/europarl/.

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., … al. (2020). The open images dataset v4. International Journal of Computer Vision, 128(7), 1956–1981. https://doi.org/10.1007/s11263-020-01316-z

LeCun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 598–605). Morgan-Kaufmann. Retrieved from http://papers.nips.cc/paper/250-optimal-brain-damage.pdf

Lee, J. A., & Verleysen, M. (2010). Nonlinear dimensionality reduction (1st ed.). Springer Publishing Company, Incorporated.

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis.

Maaten, L. van der, & Hinton, G. (2008). Visualizing high-dimensional data using t-sne.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Y. Bengio & Y. LeCun (Eds.), 1st international conference on learning representations, ICLR 2013, scottsdale, arizona, usa, may 2-4, 2013, workshop track proceedings. Retrieved from http://arxiv.org/abs/1301.3781

Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(6), 559–572.

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Empirical methods in natural language processing (emnlp) (pp. 1532–1543). Retrieved from http://www.aclweb.org/anthology/D14-1162

Reunanen, J. (2003). Overfitting in making comparisons between variable selection methods. J. Mach. Learn. Res., 3, 1371–1382.

Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, 18(5), 401–409. Retrieved from http://scholar.google.com/scholar.bib?q=info:XPZzktNQEYUJ:scholar.google.com/&output=citation&hl=en&as_sdt=2000&scfhb=1&ct=citation&cd=0

Scholkopf, B., Smola, A., & Müller, K.-R. (1999). Kernel principal component analysis. In ADVANCES in kernel methods - support vector learning (pp. 327–352). MIT Press.

Silva, L. O., & Zárate, L. E. (2014). A brief review of the main approaches for treatment of missing data. Intell. Data Anal., 18(6), 1177–1198.

Somol, P., Novovicova, J., & Pudil, P. (2010). Efficient feature subset selection and subset size optimization. In A. Herout (Ed.), Pattern recognition. Rijeka: IntechOpen. https://doi.org/10.5772/9356

Tenenbaum, J. B., Silva, V. de, & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319.

Torgerson, W. S. (1952). Multidimensional scaling: I. Theory and method. Psychometrika, 17, 401–419.

Young, G., & Householder, A. (1938). Discussion of a set of points in terms of their mutual distances. Psychometrika, 3, 19–22.