Jeremy Fix
February 15, 2024
Slides made with slidemakerYou can use open datasets
Others: CIFAR-10, CIFAR-100, MNIST, Fashion-MNIST
Historical dataset: PascalVOC2012, 20 classes, 20K Train, 20K Test, 11K Test
Robust reading ICDAR 2019 challenge :
- \(10.166\) images : \(5.603\) training, \(4.563\) testing
- diversity of shapes,, orientation, language, font
Labels :
- polygon coordinates \((x_1, y_1), (x_2, y_2), \cdots\)
- retranscription
- language
See also Street View House Numbers
Dense labeling : image of labels
Open Images (Kuznetsova et al., 2020)
MovieLens, Netflix Prize, Anime recommendations database, .. MovieLens 20M
Examples: Timit, VoxForge, Mozilla Common Voice Timit, Darpa, TI, MIT
Transcript : 61748 She had your dark suit in greasy wash water all year.
Words : \((7470, 11362)\) she, \((11362, 16000)\) had, \((15420, 17503)\) your, \((17503, 23360)\) dark, \((23360, 28360)\) suit, \((28360, 30960)\) in, \((30960, 36971)\) greasy, \((36971, 42290)\) wash, \((43120, 47480)\) water, \((49021, 52184)\) all, \((52184, 58840)\) year
Phonemes: \((0, 7470)\) h#, \((7470, 9840)\) sh, \((9840, 11362)\) iy, \((11362, 12908)\) hv, …
MusicNet dataset
- 330 classical music recordings
- 1 million annotated labels (verified by experts) of precise timing of each note
Tasks : composer classification, notes transcription, instrument detection, ..
Example: Large Movie Review dataset (IMDB), (Maas et al., 2011)
Example of a positive review
Is it worth seeing? If you can look past the technical
and budgetary limitations, and get into the story,
I think you will enjoy this, especially if you’ve actually read
the original H G Wells novel. If, however, you are easily put
off by cheap production values, you’d best pass on this (unless
you’re a MST3K fan). Be warned, however that the film runs a full
Example of a negative review
Europarl dataset, (Koehn, 2012)
You may have to collect data on your own
Data are not necessarily vectorial :
Ordinal feature name |
numerical feature value |
---|---|
poor | -1 |
fair |
0 |
excellent | 1 |
Categorical feature name |
numerical feature value |
---|---|
English | [1, 0, 0] |
Brasilian |
[0, 1, 0] |
Chinese | [0, 0, 1] |
Even if vectorial :
Vectorial representation of text documents Bag Of Words
e.g. \(\mathcal{V}=\{I, in, love, metz, machine learning, study\}\)
I love machine learning and love metz too. \(\rightarrow x=[1, 0, 2, 1, 1, 0]\)
I love studying machine learning in Metz. \(\rightarrow x=[1,1,1,1,1,1]\)
Does not take the order into account \(\rightarrow N-gram\), but this leads to sparser representations.
See also Term-Frequency-Inverse-Document-Frequency (TF-IDF).
But in practice, the vocabulary can be huge
Suppose a problem of classifying text documents by their topic, e.g. 20 newsgroups dataset
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
print(newsgroups_train.target[0])
print(newsgroups_train.target_names[newsgroups_train.data[0]])
From target ‘rec.autos’
"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of
Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car,
looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of
the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info
you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"
Vocabulary size : unigrams(\(130.107\)), bigrams(\(130.107^2=10^{10}\)), trigrams (😄)
\(\Rightarrow\) Removing some useless variability(HaCohen-Kerner, 2020) by:
Vectorial representation of text documents Word/Sentence embeddings:
- word2vec(Mikolov, Chen, Corrado, & Dean, 2013),
- GLoVe(Pennington, Socher, & Manning, 2014),
- fasttext (Bojanowski, Grave, Joulin, & Mikolov, 2016).
- *BERT (Devlin, Chang, Lee, & Toutanova, 2018)
Continuous Bag of Words (CBOW) : predict a word given its context
Of fixed size, and captures some semantic information.
see also : Bayesian approaches (e.g. Latent Dirichlet Allocation)
Missing features
For missing value imputation, there are plenty of methods :
The bias you may introduce by imputing a value may depend on the causes of the missing values, see (Silva & Zárate, 2014).
Feature scaling
\[\begin{align} \min_\theta J(\theta) = \frac{1}{N} \sum_i \|y_i - \theta^T x_i\| + \lambda \|\theta\|^2 \end{align}\]
Feature scaling Given \(x_i \in \mathbb{R}^d\), you can normalize by :
\[\begin{align} \forall i, \forall j \in [0,d-1] x'_{i,j}= \frac{x_{i,j} - \min_k x_{k,j}}{\max_k x_{k,j} - min_k x_{k,j}} \end{align}\]
\[\begin{align} \forall i, \forall j \in [0,d-1] x_{i,j}&= \frac{x_{i,j} - \mu_j}{\sigma_j}\\ \mu_j &= \frac{1}{N} \sum_k x_{k,j}\\ \sigma_j &= \sqrt{\frac{1}{N} \sum_k (x_{k,j} - \mu_j)^2} \end{align}\]
Your statistics must be computed from the training set and applied also to test data.
Note : your data may also be long tailed, multi-modal, with outliers … See for example this scikit notebook
What Optimally transform \(x_i \in \mathbb{R}^n\) into \(z_i \in \mathbb{R}^d\) so that \(d \ll n\)
It remains to define what means “optimally transform”
Why
Data analysis/Visualization How are your data distributed ? How are your classes intricated ? Do we have discriminative features ?
Visualization of t-SNE projection of CIFAR-10 through mobilenet-v2 network (dimensionality \(\mathbb{R}^{20 480}\mapsto\mathbb{R}^2\)). See here
Example Why does this predictor say the tumor is malignant ?
Real risk estimated by 10-fold CV : \(0.92\) in both cases.
UCI ML Breast Cancer Wisconsin (Diagnostic) dataset: scikit loader
Decreasing dimensionality decreases training/inference times.
For example :
Linear regression \(\hat{y} = \theta^T x + b\)
Logistic regression(classification) : \(P(y=1/x) = \frac{1}{1 + \exp(\theta^T x)}\)
Both training and inference in \(O(n)\), \(x\in \mathbb{R}^n\)
The data may occupy a lower dimensional manifold
\(\rightarrow\) you do not necessarily loose information by reducing the number of dimensions
The data may occupy a lower dimensional manifold You want to classify facial expressions of a single person, controlled illumination:
suppose a huge image resolution, e.g. \(1024\times1024\) RGB pixels, \(x \in \mathbb{R}^{1024\times1024\times3}\)
what is the dimensionality of the data manifold ?
You may even have better predictors. The data become (exponentially) quickly sparse with respect to the dimension. It becomes harder to prevent overfitting.
see also (Hastie, Tibshirani, & Friedman, 2017), Link
What Optimally transform* \(x_i \in \mathbb{R}^n\) into \(z_i \in \mathbb{R}^d\) so that \(d \ll n\)
It remains to define what means “optimally transform”
transform* : not necessarily project
How
Select a subset of the original features/attributes/dimensions
Example with 1) LASSO (L2 regression with L1 penalty), 2) a decision tree classifier,
Univariate filters rank every feature independently of the others, their relationship with the target.
Target | |||
---|---|---|---|
Categorical | Numerical | ||
Feature | Categorical | Chi-2 , Mutual information | ANOVA |
Numerical | ANOVA | Correlation (Pearson, Spearman) |
Univariate filters are handfull in large feature space where we cannot afford more elaborate multivariate filters/wrappers.
Statistical quantification (we will rank all the features) of the independence between the feature and the target : \(\chi^2\) independence test.
Assume one categorical feature (\(3\) values) and a categorical target (\(4\) values) with the following contingency table in the training set, with \(p_{ij} = P(feature=f_i, target=t_j)\):
Target | |||||
---|---|---|---|---|---|
\(t_0\) | \(t_1\) | \(t_2\) | \(t_3\) | ||
Feature | \(f_0\) | ||||
\(f_1\) | |||||
\(f_2\) |
Independance \(\iff P(F,T) = P(F) P(T) = (\sum_j P(F/T=t_j) P(T=t_j)) (\sum_i P(T/F=F_i)P(F=F_i))\)
We perform an hypothesis test :
- Null hypothesis \(H_0\) : the feature and the target are independent
- hypothesis \(H_1\): \(H_0\) is wrong.
If \(H_0\) is true, i.e. the feature and target are independent, and the joint probability is the product of the marginals. How well do the observations confirm this ?
\[
\chi^2 = \sum_{i,j} \frac{(N \hat{p}_{i,j} - N p_{i,j})^2}{N p_{i,j}}
\] with:
- N the total number of samples
- \(\hat{p}_{i,j}\) the observed frequency of the case \(F=f_i, T=t_j\)
- \(p_{i,j}\) the expected frequency of the case \(F=f_i, T=t_j\), were \(H_0\) true
The expected frequency, under \(H_0\), is given by :
\[ p_{i,j} = P(F=f_i) P(T=t_j) = \left[ \frac{\sum_k \#[F=f_i, T=t_k]}{N} \right]\left[ \frac{\sum_k \#[F=f_k, T=t_j]}{N} \right] \]
The observed frequency, is given by :
\[ \hat{p}_{i,j} = \frac{\#[F=f_i, T=tj]}{N} \]
For filtering, we select the \(n_f\) features with the highest \(\chi^2\).
Example: sentiment analysis on Stanford 50K reviews.
Preprocessing: removing HTML tags, linebreaks, links, punctuation, repeated letters, words containing numbers, ..
Features : n-grams presence/absence (\(432746\) (1,2)-grams)
\(H_0\) : the presence or absence of a n-gram is independent on the sentiment
Word | \(\#(f=0,t=0)\) | \(\#(f=0,t=1)\) | \(\#(f=1, t=0)\) | \(\#(f=1,t=1)\) | \(\chi^2\) |
---|---|---|---|---|---|
bad | 8094 | 11002 | 4406 | 1498 | 1875.17 |
the worst | 10836 | 12336 | 1664 | 164 | 1327.95 |
the best | 11588 | 10485 | 912 | 2015 | 470.77 |
… | … | … | … | … | … |
recently still | 12499 | 12499 | 1 | 1 | 0.0 |
Measure the mutual information between one categorical feature and a categorical target.
Mutual information is:
- a measure of statistical dependance between two random variables
- defined as the Kullback-Leibler divergence between the joint and the product of the marginals.
- null iif \(X\) and \(Y\) are independent
- \(I(X,Y) \geq 0\) and upper bounded
\[ I(X, Y) = \sum P(X, Y) \log\left(\frac{P(X,Y)}{P(X)P(Y)}\right) \]
Word | \(\#(f=0,t=0)\) | \(\#(f=0,t=1)\) | \(\#(f=1, t=0)\) | \(\#(f=1,t=1)\) | \(\chi^2\) | \(I(X,Y)\) |
---|---|---|---|---|---|---|
bad | 8094 | 11002 | 4406 | 1498 | 1875.17 | 0.039 |
worst | 10458 | 12273 | 2042 | 227 | 1596.76 | 0.036 |
the worst | 10836 | 12336 | 1664 | 164 | 1327.95 | 0.031 |
waste | 11292 | 12404 | 1208 | 96 | 1000.45 | 0.023 |
awful | 11208 | 12347 | 1292 | 153 | 952.88 | 0.022 |
great | 10410 | 8276 | 2090 | 4224 | 964.96 | 0.020 |
excellent | 12150 | 11070 | 350 | 1430 | 705.51 | 0.015 |
… | … | … | … | … | … |
For filtering, we select the \(n_f\) features with the highest \(I(X,Y)\). Strongly correlated but not identical to the \(\chi^2\) statistics.
Analysis of VARiance (ANOVA) F-Test :
- Are the means of [X / y] significantly different between the \(k\) populations ?
- compute and rank the features w.r.t to the F-values
Hypothesis \(H_0\) : the means \(\mu_1 = E[X / y=1]\), \(\mu_2 = E[X / y=2]\), .. \(\mu_k = E[X / y=k]\) are all equal.
Notes: better to have similar groupsize, and normally distributed values.
The statistics of interest \(F\) is : \[ F = \frac{\frac{1}{k-1}\sum_k n_k (\mu_k - \mu)^2}{\frac{1}{N-k} \sum_{k=1}^{K} \sum_{i \in n_k} (x_{i} - \mu_k)^2} = \frac{\mbox{between group variability}}{\mbox{within group variability}} \] follows a Fischer distribution (under the conditions of the test). High \(F\) invalidates \(H_0\).
For filtering, we select the features with the highest \(F\).
Rank the features w.r.t. their correlation with the target; e.g. Pearson correlation between \(x\in \mathbb{R}, y\in \mathbb{R}\): \[ \begin{align} r(x, y) &= \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sigma_x \sigma_y}\\ \sigma_x &= \sqrt{\frac{1}{n} \sum_i (x_i - \bar{x})^2}\\ \sigma_y &= \sqrt{\frac{1}{n} \sum_i (y_i - \bar{y})^2} \end{align} \]
For filtering, we select the features with the highest correlation magnitude \(|r|\).
Univariate filtering is computationally efficient but not always optimal.
See (Guyon et al., 2006), Introduction. Link
Problem: Finding the best subset of features (John, Kohavi, & Pfleger, 1994)
Denote \(\chi\) a subset of the dimensions/attributes/features :
- suppose we are provided a measure of how good this subset is \(J(\chi)\)
- we optimize \(J(\chi)\) over the possible subsets \(\chi = argmax_{\chi \in \chi_d} J(\chi)\)
If \(x \in \mathbb{R}^n\), we have \(2^n\) possible subsets :
\[
\chi_d = \{\emptyset, \{x_1\}, \{x_2\}, \cdots, \{x_1, x_2\}, \cdots\}
\]
In multivariate filters, a feature subset is evaluated by an heuristics, e.g. statistics between the features and the features and the target (see (Duch, 2006),(Hall & Smith, 1997)).
e.g. Correlation based feature subset evaluation : select the features that are correlated with the target, yet not correlated between each others
\[ \begin{align} J(\chi) &= \frac{k r_{ky}}{\sqrt{k + k (k-1) r_{kk}}}\\ \mbox{Correlation : } r(w, z) &= \frac{\sum (w_i - \bar{w})(z_i - \bar{z}) }{\sqrt{\sum_i(w_i - \bar{w})^2\sum_j(z_j - \bar{z})^2}}\\ \mbox{Average pairwise correlation : } r_{kk}&= \frac{2}{k(k-1)} \sum_{j_1 < j_2} r(x_{j_1}, x_{j_2})\\ \mbox{Average feature-target correlation : } r_{ky} &= \frac{1}{k} \sum_j r(x_j, y) \\ \end{align} \] with \(k = |\chi|\) the size of the subset
In multivariate wrappers, a feature subset is ranked by the real risk of a trained estimator estimated, e.g. by cross-validation.
Example (Somol, Novovicova, & Pudil, 2010) : Ionosphere data
- 351 samples, 34 real valued features, 2 classes (225/126),
- \(80\%\) Train, \(20\%\) Test split.
- \(10\)-fold cross validation
- wrapper : \(3\)-Nearest neighbor
- Sequential Forward Search / Sequential Forward Floating Search / Best Individual Feature / Oscillating Search
Carefull, cross validation performances […] Used with computationally intensive search algorithms, these estimates may overfit and yield biased predictions (Reunanen, 2003). Need independent test data. Said differently, model selection is part of the algorithm to be cross validated.
Principle: the features are ranked by analyzing, locally, the sensitivity of the loss with respect to the features.
Algorithm (Recursive Feature Elimination, RFE):
- fit a predictor
- locally analyze the importance/relevance of the features
- discard the less relevant features
- loop
Example: Optimal Brain Damage (OBD, (LeCun, Denker, & Solla, 1990)):
Denote \(J : \mathbb{R}^n \mapsto \mathbb{R}\) a loss to be optimized, under the “Diagonal”, “Extremal”, “Quadratic” assumptions : \[
\begin{align}
J(w) &\approx J(w_0) + (w - w_0)^T\frac{d J}{dw}(w_0) + \frac{1}{2}(w-w_0)^T.H.(w-w_0) + ... \\
\Delta J_i &\approx \frac{1}{2} H_{i,i} (\Delta w)_i^2
\end{align}
\] For feature selection, we look for dimensions that can be canceled out (so \(\Delta w_i = (w)_i\)) affecting the less the loss .
How to compute \(H_{ii}\) ?
For the last 2 cases, if your input data are normalized (centered, reduced), \(h_{ii}\) does not depend on \(i\) and a criteria to select a feature is based on \(\Delta w_i^2 = w_i^2\) only : remove the feature \(i\) with smallest \(|w_i|\)
Principle: the features are ranked by analyzing, locally, the sensitivity of the loss with respect to the features.
Algorithm (Recursive Feature Elimination, RFE):
- fit a predictor
- locally analyze the importance/relevance of the features
- discard the less relevant features
- loop
Example: RFE-SVM (Guyon, Weston, Barnhill, & Vapnik, 2002). In SVM, we optimize the margin \(J(w) = \frac{1}{2}\|w\|^2\) under constraints.
Linear SVM, we maximize the margin, i.e. minimize \(J(w) = \frac{1}{2}\|w\|^2\) under constraints.
You can drop out more than one feature at a time. See also (Guyon et al., 2006), p145. See (Guyon et al., 2002) for generalization to non-linear SVM.
Given \(N \mbox{ samples } x_i \in \mathbb{R}^d\),
We compute \(r \ll d\) new features from the original \(d\) features.
Objective Find an affine transformation of the data minimizing the reconstruction error.
Intuition and formalisation
In 1D, we seek a line \((w_0, w_1)\) minimizing the sum of the squared length of the red segments.
Note: It is not unique !
Statement : Find an affine transformation of the data minimizing the reconstruction error
Formally :
\[\begin{align} \label{pca:problem}\min_{\{w_0, w_1, ..w_r\} \in \mathbb{R}^d} \sum_{i=0}^{N-1} \left|x_i - (w_0 + \sum_{j=1}^r (w_j^T (x_i-w_0))w_j )\right|_2^2 \end{align}\]
subject to \(w_i^T w_j = \delta_{i,j}\)
Matrix form :
Denote \(\mathbf{W} = (w_1 | \dots | w_r) \in \mathcal{M}_{d \times r}(\mathbb{R})\)
\[\begin{align} \min_{\{w_0, w_1, ..w_r\} \in \mathbb{R}^d} \sum_{i=0}^{N-1} \left| (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)(x_i - w_0)\right|_2^2 \end{align}\]
subject to \(\mathbf{W}^T \mathbf{W} = \mathbf{I}_r\)
Let us simplify the equation :
\[\begin{align} \min_{\{w_0, w_1, ..w_r\} \in \mathbb{R}^d} \sum_{i=0}^{N-1} (x_i - w_0)^T (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)(x_i - w_0) \end{align}\] subject to \(\mathbf{W}^T \mathbf{W} = \mathbf{I}_r\)
Remember : For \(u:\mathbb{R}^n\mapsto\mathbb{R}^m\), \(v:\mathbb{R}^n\mapsto\mathbb{R}^m\), \(A\in\mathcal{M}_{m,m}(\mathbb{R})\): \[\frac{d u^T A v}{dx} = \frac{du}{dx}A v + \frac{dv}{dx}A^T u\]
Finding \(w_0\) :
\[\begin{align} J &= \sum_{i=0}^{N-1} (x_i - w_0)^T (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)(x_i - w_0)\\ \frac{\partial J}{\partial w_0} &= -2 (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T) \sum_{i=0}^{N-1}(x_i - w_0) \end{align}\]
Finding \(w_0\)
\[\begin{align} J &= \sum_{i=0}^{N-1} (x_i - w_0)^T (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)(x_i - w_0)\\ \frac{\partial J}{\partial w_0} &= -2 (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T) \sum_{i=0}^{N-1}(x_i - w_0)\\ \frac{\partial J}{\partial w_0} = 0 &\Leftrightarrow \exists h \in \mathop{\mathrm{span}}\{w_1, ..., w_r\}, w_0 = h + \frac{1}{N}\sum_i x_i \end{align}\]
\((\mathbf{I}_d - \mathbf{W} \mathbf{W}^T) h\) is the residual vector by the orthogonal projection on the column vectors of \(\mathbf{W}\).
Finding \(w_0\)
\[\begin{align} J &= \sum_{i=0}^{N-1} (x_i - w_0)^T (\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)(x_i - w_0)\\ \mathop{\mathrm{argmin}}_{w_0} J &\Rightarrow w_0 = h + \frac{1}{N}\sum_i x_i, h \in \mathop{\mathrm{span}}\{w_1, ..., w_r\} \end{align}\]
For example, \(h=0\) is one solution.
The offset \(w_0\) is the mean of the data points, up to a translation in the space spaned by the principal components vectors.
Step 1: Center the data
Denote:
\[\begin{align} \widetilde{x}_i &= x_i - \bar{x}\\ \bar{x} &= \frac{1}{N}\sum_i x_i\\ \widetilde{\mathbf{X}} &= (\widetilde{x}_0|\dots|\widetilde{x}_{N-1})\\ J &= \sum_{i=0}^{N-1} \widetilde{x}_i^T(\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)\widetilde{x}_i\\ \end{align}\]
We will derive the principal vectors with a (later justified) greedy approach.
\[\begin{align} J &= \sum_{i=0}^{N-1} \widetilde{x}_i^T(\mathbf{I}_d - \mathbf{W} \mathbf{W}^T)\widetilde{x}_i\label{eq:loss_pv} \\ \mathop{\mathrm{argmin}}_{w_1,..w_r} J &= \mathop{\mathrm{argmax}}_{w_1,..w_r} \sum_{i=0}^{N-1} \widetilde{x}_i^T\mathbf{W} \mathbf{W}^T\widetilde{x}_i, \mbox{ s.t. } \mathbf{W}^T\mathbf{W} = \mathbf{I}_r\\ \label{eq:pca_maxvar} &= \mathop{\mathrm{argmax}}_{w_1,..w_r} \sum_{i=1}^{r} w_i^T \mathbf{\widetilde{X}} \mathbf{\widetilde{X}}^Tw_i, \mbox{ s.t. } \mathbf{W}^T\mathbf{W} = \mathbf{I}_r\\ \end{align}\]
Introduce the Lagrangian for this constrained optimization problem.
\[ \mathcal{L}(w_i, \lambda_i) = \sum_{i=1}^r w_i^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_i + \lambda_i (1 - w_i^T w_i) \]
Let us operate in a greedy approach, first looking for \(w_1\) only :
\[ \mathcal{L}(w_1, \lambda_1) = w_1^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_1 + \lambda_1 (1 - w_1^T w_1) \]
Then
\[ \frac{\partial\mathcal{L}}{d w_1} = 0 \Rightarrow \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_1 = \lambda_1 w_1 \]
Therefore, \(w_1 eigen\vec{v}\) but which \(\lambda_1\) ? \[ w_1^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_1 = \lambda_1 \]
\(\lambda_1\) is the largest eigenvalue of \(\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T\)
First principal component vector The first principal component vector is a normalized eigenvector associated with the largest eigenvalue of the "sample covariance matrix \(\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T\)
Suppose \(w_1\) is a normalized eigenvector of \(\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T\) associated with its largest eigenvalue. Denote \(\lambda_1 \geq \lambda_2 \geq \dots \lambda_d \geq 0\) the eigenvalues.
We want to optimize :
\[\begin{align} \mathop{\mathrm{argmax}}_{w_2} w_1^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_1 + w_2^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_2 &= \mathop{\mathrm{argmax}}_{w_2} \lambda_1 + w_2^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_2 \\ &= \mathop{\mathrm{argmax}}_{w_2} w_2^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_2, \mbox{ s.t. } w_i^T w_j = \delta_{i,j}\\ &=\mathop{\mathrm{argmax}}_{w_2} w_2^T (\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T - \lambda_1 w_1 w_1^T)w_2, \mbox{ s.t. } w_i^T w_j = \delta_{i,j} \end{align}\]
Therefore,
\(w_2\) is a normalized eigenvector associated with the largest eigenvalue of \((\widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T-\lambda_1 w_1w_1^T)\), i.e. \(\lambda_2\)
And so on. But is the greedy algorithm finding the optimum ?
Theorem For any symmetric positive semi-definite matrix \(\mathbf{M}\in\mathcal{M}_{d\times d}(\mathbb{R})\), denote \(\{\lambda_i\}_{i=1..d}\) its eigenvalues with \(\lambda_1 \geq \lambda_2\cdots\geq \lambda_d\geq 0\). For any set of \(r\in[|1,d|]\) orthogonal unit vectors, \(\{v_1,\dots,v_r\}\), we have : \[\sum_{j=1}^r v_j^T \mathbf{M} v_j \leq \sum_{j=1}^r \lambda_j\] And this upper bound is reached by eigenvectors associated with the largest eigenvalues of \(\mathbf{M}\)
Given \(\{x_0, \dots, x_{N-1}\} \in \mathbb{R}^d\), to compute the r principal component vectors :
See a simple (naive) implementation at https://github.com/rougier/ML-Recipes
class PCA:
def fit(self, X):
'''
Performs the PCA of X and stores the principal components.
The datapoints are supposed to be stored in the row vectors of X.
It keeps only the n_components projection vectors, associated
with the n_components largest eigenvalues of the covariance matrix X.T X
'''
# Center the datapoints
self.centroid = np.mean(X, axis=0)
# Computes the covariance matrix
sigma = np.dot((X - self.centroid).T, X - self.centroid)
# Compute the eigenvalue/eigenvector decomposition of sigma
eigvals, eigvecs = np.linalg.eigh(sigma)
# Note :The eigenvalues returned by eigh are ordered in ascending order.
# Stores the n_components eigvenvectors/eigenvalues associated
# with the largest eigen values
#self.eigvals = eigvals[-self.n_components:]
#self.eigvecs = eigvecs[:, -self.n_components:]
# Stores all the eigenvectors/eigenvalues
# Usefull for latter computing some statistics of the variance we keep
self.eigvals = eigvals
self.eigvecs = eigvecs
def transform(self, Z):
'''
Uses a fitted PCA to transform the row vectors of Z
Remember the eigen vectors are ordred by ascending order
Denoting Z_trans = transform(Z),
The first component is Z_trans[:, -1]
The second component is Z_trans[:, -2]
...
'''
return np.dot(Z - self.centroid, self.eigvecs[:, -self.n_components:])
Given \(x \in \mathbb{R}^d\), its principal components are its coordinates in the selected eigenspace :
\[\begin{align*} \mathbb{R}^d & \mapsto \mathbb{R}^r\\ x&\rightarrow ((x-\bar{x})^T w_1, (x-\bar{x})^T w_2, \dots, (x-\bar{x})^T w_r) \end{align*}\]
The whole space \(\mathbb{R}^d\) is projected, not just the training samples \(x_i\).
If \(x \in \{x_0, \dots, x_{N-1}\}\), you should better use the SVD which gives you directly the principal components in addition to the principal component vectors.
Some dimensionality reduction methods (i.e. t-SNE) do not project the whole original space \(\mathbb{R}^d\) but only assign a representative in \(\mathbb{R}^r\) of some key points and hence cannot be coined “projection methods”.
What is the reconstruction error if we keep \(r\) components ?
From equation \(\eqref{eq:loss_pv}\), the square of the reconstruction error is :
\[\begin{align*}
J &= \sum_{i=0}^{N-1} \| x_i - \hat{x_i}\|_2^2 = \sum_{i=0}^{N-1} \widetilde{x}_i^T(\mathbf{I}_d - \mathbf{W_r} \mathbf{W_r}^T)\widetilde{x}_i = \sum_{i=0}^{N-1} \widetilde{x}_i^T\widetilde{x}_i - \sum_{i=1}^{r} w_i^T \mathbf{\widetilde{X}} \mathbf{\widetilde{X}}^Tw_i \\
& = \mathop{Tr}(\mathbf{\widetilde{X}}^T\mathbf{\widetilde{X}}) - \sum_{i=1}^{r} \lambda_i = \sum_{i=1}^{d} \lambda_i - \sum_{i=1}^{r} \lambda_i = \sum_{i=1}^{d} \lambda_i ( 1 - \frac{\sum_{i=1}^{r} \lambda_i}{\sum_{i=1}^{d} \lambda_i}) \\
\end{align*}\]
The two matrices \(\mathbf{\widetilde{X}}^T\mathbf{\widetilde{X}}\) and \(\mathbf{\widetilde{X}}\mathbf{\widetilde{X}^T}\) have the same eigenvalues (see the section on the relationship between the Gram and sample Covariance matrices).
Denote \(z_i = \mathbf{W_r}^T \widetilde{x}_i\). We have \(\bar{z} = 0\) and
\[ \begin{align} \mathbf{\Sigma}_z &= \frac{1}{N-1} \sum_i z_i z_i^T = \frac{1}{N-1} \sum_i \mathbf{W_r}^T \widetilde{x}_i \widetilde{x_i}^T \mathbf{W_r}\\ Tr(\mathbf{\Sigma}_z) &= \sum_{j=1}^r (\mathbf{\Sigma}_z)_{j,j} = \frac{1}{N-1} \sum_j \sum_i (w_j^T \widetilde{x}_i)(\widetilde{x}_i^T w_j) = \frac{1}{N-1} \sum_{j=1}^r w_j^T \widetilde{\mathbf{X}}\widetilde{\mathbf{X}}^T w_j = \frac{1}{N-1}\sum_{j=1}^r \lambda_j \end{align} \]
This is exactly what we maximized in equation \(\eqref{eq:pca_maxvar}\). PCA is also the solution to “finding an affine transformation which maximizes the variance of the projections”.
And :
\[ Tr(\mathbf{\Sigma}_x) = Tr(\frac{1}{N-1} \widetilde{x}_i \widetilde{x}_i^T) = \frac{1}{N-1} Tr(\mathbf{P} \mathbf{\Lambda} \mathbf{P}^{-1}) = \frac{1}{N-1}Tr(\mathbf{\Lambda}) = \frac{1}{N-1}\sum_{j=1}^{d} \lambda_j \] Note, for any matrix \(\mathbf{M}\) and orthogonal matrix \(\mathbf{P}\), \(Tr(\mathbf{P}^{-1} \mathbf{M} \mathbf{P}) = Tr(\mathbf{P} \mathbf{P}^{-1} \mathbf{M})= Tr \mathbf{M}\)
Therefore, the quality of the PCA can be evaluation in terms of fraction of variance that we keep : \[ \frac{\sum_{j=1}^{r}\lambda_j}{\sum_{j=1}^{d} \lambda_j} \]
Let us sample \(N=500\) samples from a normal distribution \(\mathcal{N}(\mu=\begin{pmatrix}0.5 \\ 1.0 \end{pmatrix}, \Sigma=\begin{pmatrix} 0.43 & 0.36 \\ 0.36 & 0.76 \end{pmatrix})\)
We perform a PCA from \(d=2\) to \(r=2\). Then \(w_0 = \begin{pmatrix} 0.49 \\ 1.0 \end{pmatrix}\), \(w_1 = \begin{pmatrix} -0.83 \\ 0.54\end{pmatrix}\), \(w_2 = \begin{pmatrix} 0.54 \\ 0.83\end{pmatrix}\)
Fraction of the variance over the principal axes : \(\lambda_1 = 0.95 (\lambda_1 + \lambda_2)\), \(\lambda_2 = 0.045 (\lambda_1 + \lambda_2)\)
Example from https://github.com/rougier/ML-Recipes
\(N=4000\) grayscale \(28\times28\) images of handwritten digits, \(d=784\).
We perform a PCA from \(d=784\) to \(r=2\),
Note \(\frac{\sum_{i=1}^{10} \lambda_i}{\sum_{i=1}^{784} \lambda_i} = 47.67\%\) of the variance is kept with the first \(10\) components.
Example from https://github.com/rougier/ML-Recipes
Theorem For any matrix \(\mathbf{M} \in \mathcal{M}_{d, N}(\mathbb{R})\), there exists an orthogonal matrix \(\mathbf{U} \in \mathcal{M}_{d,d}(\mathbb{R})\), a diagonal matrix \(\mathbf{D} \in \mathcal{M}_{d, N}(\mathbb{R})\) and an orthogonal matrix \(\mathbf{V} \in \mathcal{M}_{N,N}(\mathbb{R})\) such that
\[ \mathbf{M} = \mathbf{U} \mathbf{D} \mathbf{V}^T \]
Note: orthogonal matrices \(\equiv \mathbf{U}^T = \mathbf{U}^{-1}\).
For the PCA, given \(\mathbf{\widetilde{X}} = \mathbf{U} \mathbf{D} \mathbf{V}^T\), we have the diagonalization of \(\mathbf{\widetilde{X}} \mathbf{\widetilde{X}}^T\) :
\[ \mathbf{\widetilde{X}} \mathbf{\widetilde{X}}^T = \mathbf{U} \mathbf{D} \mathbf{D}^T \mathbf{U}^{-1} \]
The principal component vectors are the column vectors of \(\mathbf{U}\) and the principal components of the training set are the first \(r\) rows of :
\[ \mathbf{U}^T \mathbf{\widetilde{X}} = \mathbf{D} \mathbf{V}^T \]
Suppose your data are centered : \(\sum_i x_i = 0\).
The sample covariance matrix is : \[ \mathbf{\Sigma}_x = \frac{1}{N-1}\sum_i x_i x_i^T = \frac{1}{N-1}\mathbf{X}\mathbf{X}^T \] The Gram matrix is (the matrix of the dot-products):
\[\mathbf{G} = \mathbf{X}^T \mathbf{X}= \begin{bmatrix} x_0^T x_0 & x_0^T x_1 & \cdots & x_0^T x_{N-1} \\ \vdots & \vdots & \vdots & \vdots \\ x_{N-1}^T x_0 & x_{N-1}^T x_1 & \cdots & x_{N-1}^T x_{N-1} \end{bmatrix}\]
It turns out that the eigenvectors and eigenvalues of \(\mathbf{G}\) and \(\mathbf{\Sigma}\) are strongly related !
Lemma:\(\forall \mathbf{A} \in \mathbb{R}^{n \times m}, ker(\mathbf{A}) = ker(\mathbf{A}^T \mathbf{A})\)
Theorem (Rank-nullity) \(\forall \mathbf{A} \in \mathbb{R}^{n \times m}, rk({\mathbf{A}}) + dim(ker(\mathbf{A})) = m\).
Theorem \(\forall \mathbf{A} \in \mathbb{R}^{n \times m}, rk(\mathbf{A}^T\mathbf{A}) = rk(\mathbf{A}\mathbf{A}^T) \leq \mathrm{min}(n,m)\)
Lemma (Eigenvalues of the sample covariance and Gram matrices) The non-zero eigenvalues of the scaled covariance matrix \((N-1) \mathbf{\Sigma} = \mathbf{X} \mathbf{X}^T\) and Gram matrix \(\mathbf{G} = \mathbf{X}^T \mathbf{X}\) are the same : \[ \{\lambda \in \mathbb{R}^*, \exists v\neq 0, (N-1) \mathbf{\Sigma} v = \lambda v\} = \{ \lambda \in \mathbb{R}^*, \exists v\neq 0, \mathbf{G} v = \lambda v \} \] And, during the proof we show some relationships :
We will see two applications of this property :
If \(N\ll d\), it is much more efficient to “diagonalize” \(\mathbf{G} \in \mathbb{R}^{N \times N}\) than \(\mathbf{\Sigma} \in \mathbb{R}^{d \times d}\). The eigen face is the PCA done by “diagonalizing” the Gram matrix :
See a simple (naive) implementation at https://github.com/rougier/ML-Recipes
AT&T dataset : 400 images of size \((112, 92)\), i.e. \(N=400, d=10304\).
\(59.95\%\) of the variance is kept with \(10\) compoents. Below we show some samples ordered by their first PC, going from hairy to shaved people.
We can perform a PCA with only dot products :
A linear algorithm involving only dot products can be rendered non-linear using the kernel trick (see SVM). The PCA is performed in the feature space !
But…. we need to ensure that the vectors in the feature space are centered.
We can build the Gram matrix \(\tilde{\mathbf{G}}\) of the vectors centered in the feature space, from the Gram matrix \(\tilde{\mathbf{G}}\) of the original vectors in the feature space with the double centering transform : \[ \widetilde{\mathbf{G}} = (\mathbf{I}_N - \frac{1}{N}\mathbf{1})\mathbf{G}(\mathbf{I}_N - \frac{1}{N}\mathbf{1}) \]
We now a non-linear PCA, the Kernel PCA (Scholkopf et al., 1999)
Methods which seek low dimensional representations that are similar as possible to the high dimensional data layout (Note: PCA does not). You want to focus on the local structure and preserve neighborhood relationships.
It usually involves four steps :
Given \(x_i \in \mathbb{R}^d\), \(y_i \in \mathbb{R}^r\), \(r \ll d\):
1- Quantify the similarity between pairs of points in \(\mathbb{R}^d\)
2- Quantify the similarity between pairs of points in \(\mathbb{R}^r\)
3- Quantify the discrepancy between these similarities
4- Minimize with respect to \(y_i\)
These are not mappings/projections, they don’t transform \(\mathbb{R}^d\) downto \(\mathbb{R}^r\).
A dissimilarity \(\delta : S \times S \mapsto \mathbb{R}\) satisfies :
A metric \(d : S \times S \mapsto \mathbb{R}\) further satisfies :
If \(S\) is an Euclidean space, the euclidean distance is defined as :
\[
d(x, y) = \sqrt{\sum_i (x_i - y_i)^2}
\]
Lot of examples of dissimilarities in (Cox & Cox, 2008)
Given the euclidean distances between pairs of points \(x_i \in \mathbb{R}^d\) stored in \(D_{i,j} = \|x_i - x_j\|\), can we recover \(d\) and the points \(x_i\) ?
This can be done up to an Euclidean transformation (which preserves distances) (Young & Householder, 1938), e.g. we can find the centered points \(x_i\) (\(\sum_i x_i = 0\)):
Note: we can more generally consider dissimilarities in \(D\) but this does not necessarily guarantee that \(G\) is positive semidefine (because of the possible lack of the triangular inequality)
Multidimensional scaling : “Find a configuration of points in a space, usually Euclidean, where each point represents one of the objects or individuals, and the distances between pairs of points in the configuration match as well as possible the original dissimilarities between the pairs of objects or individuals.” (Cox & Cox, 2008)
Note: 1) we do not necessarily have data points from which the dissimilarities are computed. 2) even if we do, we are interested by low dimensional representatives.
In (Torgerson, 1952) it is suggested to approximate the Gram matrix, i.e. find a matrix \(Y \in \mathcal{M}_{r,N}(\mathbb{R})\), so that \(y_i^Ty_j \approx (-\frac{1}{2}H\delta^2H)_{i,j}\). This can be solved in closed form !
And, if you preserve the Gram matrix built from the dissimilarties, the euclidean distances between \(y\) will exactly match the dissimilarities :
\[ \begin{align} \forall i,j \|x_i - x_j\|^2 &= \|x_i\|^2 + \|x_j\|^2 - 2 <x_i, x_j>\\ D_{i,j} &= G_{i,i} + G_{j,j} -2 G_{i,j} \end{align} \]
Sammon’s non linear mapping (Sammon, 1969) seeks to preserve the pairwise dissimilarities :
\[ E_{NLM} = \frac{1}{c} \sum_{i, j< i} \frac{(\|x_i - x_j\|_2 - \delta_{i,j})^2}{\delta_{i,j}} \]
With \(\frac{1}{\delta}\), it is more important to better approximate small distances rather than large distances
This is solved by gradient descent using the quasi-newton method (Newton method, assuming the Hessian is diagonal) :
\[ (x_i)_k(t+1) = (x_i)_k(t) - \epsilon_{MF} \frac{\frac{\partial E_{NLM}}{\partial (x_i)_k}}{ \|\frac{\partial E_{NLM}}{\partial (x_i)_k}\| } \]
with \(\epsilon_{MF} \approx 0.3\) in (Sammon, 1969). See the paper (also (Lee & Verleysen, 2010), p82) for the equations.
Idea: If we are given datapoints on a non-linear manifold, the euclidean distance may not be an appropriate choice for a dissimilarity measure. Better to compute the distance along the manifold.
Algorithm:
t-SNE is introduced in (Maaten & Hinton, 2008). The idea is to focus on preserving local distances while allowing large distances in \(\mathbb{R}^d\) to be even larger in \(\mathbb{R}^r\).
Similarity in \(\mathbb{R}^d\) :
\[ \begin{align} \forall i,j, p_{i,j} &= \frac{p_{i/j} + p_{j/i}}{2N}\\ \forall i,j, p_{i/j} &= \frac{\exp(-\frac{|x_i - x_j|^2}{2\sigma_i^2})}{\sum_{k\neq i}\exp(-\frac{|x_i - x_k|^2}{2\sigma_i^2})} \end{align} \]
Similarity in \(\mathbb{R}^r\) :
\[ \begin{align} \forall i,j, q_{i,j} = \frac{(1+|y_i - y_j|_2^2)^{-1}}{\sum_{k\neq l}(1+|y_k - y_l|_2^2)^{-1}} \end{align} \]
Discrepancy measured with the Kullback-Leibler divergence between \(p_{i,j}\) and \(q_{i,j}\) :
\[ \begin{aligned} C = \sum_{i,j} p_{i,j} \log(\frac{p_{i,j}}{q_{i,j}}) \end{aligned} \]
Complexity (\(O(N^2)\)). Optimized with Barnes-Hutt, complexity \(O(N\log N)\).
See also t-SNE on distill.pub
Example on MNIST, samples in \(\mathbb{R}^{784}\) down to \(\mathbb{R}^2\)
Example on visualizing the last convolution layer features of a MobileNet. Samples in \(\mathbb{R}^{20480}\) down to \(\mathbb{R}^2\)
Rather check the full online document references.pdf
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. Retrieved from http://arxiv.org/abs/1607.04606
Cox, M. A. A., & Cox, T. F. (2008). Multidimensional scaling. In Handbook of data visualization (pp. 315–347). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-33037-0_14
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805. Retrieved from http://arxiv.org/abs/1810.04805
Duch, W. (2006). Filter methods. In I. Guyon, S. Gunn, M. Nikravesh, & L. Zadeh (Eds.), Feature extraction: Foundations and applications (studies in fuzziness and soft computing) (pp. 89–117). Springer-Verlag New York, Inc. Secaucus, NJ, USA.
Goodfellow, I. J., Bengio, Y., & Courville, A. (2016). Deep learning. Cambridge, MA, USA: http://www.deeplearningbook.org; MIT Press.
Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. (2006). Feature extraction: Foundations and applications (studies in fuzziness and soft computing). Springer-Verlag New York, Inc. Secaucus, NJ, USA.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Mach. Learn., 46(1–3), 389–422. https://doi.org/10.1023/A:1012487302797
HaCohen-Kerner, D. A. Y., Yaakov AND Miller. (2020). The influence of preprocessing on text classification using a bag-of-words representation. PLOS ONE, 15(5), 1–22. https://doi.org/10.1371/journal.pone.0232525
Hall, M., & Smith, L. (1997). Feature subset selection: A correlation based filter approach. In Proc. Intl. Conf. Neural inform. Processing intell. Inform. Syst. (pp. 855–858).
Hastie, T., Tibshirani, R., & Friedman, J. (2017). The elements of statistical learning, 2nd edition. New York, NY, USA: https://web.stanford.edu/~hastie/ElemStatLearn/; Springer New York Inc.
Hu, T.-Y., Armandpour, M., Shrivastava, A., Chang, J.-H. R., Koppula, H., & Tuzel, O. (2022). SYNT++: Utilizing imperfect synthetic data to improve speech recognition. In ICASSP 2022 - 2022 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 7682–7686). https://doi.org/10.1109/ICASSP43922.2022.9746217
John, G. H., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. In MACHINE learning: PROCEEDINGS of the eleventh international (pp. 121–129). Morgan Kaufmann.
Koehn, P. (2012). Europarl: A parallel corpus for statistical machine translation. https://www.statmt.org/europarl/.
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., … al. (2020). The open images dataset v4. International Journal of Computer Vision, 128(7), 1956–1981. https://doi.org/10.1007/s11263-020-01316-z
LeCun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. In D. S. Touretzky (Ed.), Advances in neural information processing systems 2 (pp. 598–605). Morgan-Kaufmann. Retrieved from http://papers.nips.cc/paper/250-optimal-brain-damage.pdf
Lee, J. A., & Verleysen, M. (2010). Nonlinear dimensionality reduction (1st ed.). Springer Publishing Company, Incorporated.
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis.
Maaten, L. van der, & Hinton, G. (2008). Visualizing high-dimensional data using t-sne.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In Y. Bengio & Y. LeCun (Eds.), 1st international conference on learning representations, ICLR 2013, scottsdale, arizona, usa, may 2-4, 2013, workshop track proceedings. Retrieved from http://arxiv.org/abs/1301.3781
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space. Philosophical Magazine, 2(6), 559–572.
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Empirical methods in natural language processing (emnlp) (pp. 1532–1543). Retrieved from http://www.aclweb.org/anthology/D14-1162
Reunanen, J. (2003). Overfitting in making comparisons between variable selection methods. J. Mach. Learn. Res., 3, 1371–1382.
Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, 18(5), 401–409. Retrieved from http://scholar.google.com/scholar.bib?q=info:XPZzktNQEYUJ:scholar.google.com/&output=citation&hl=en&as_sdt=2000&scfhb=1&ct=citation&cd=0
Scholkopf, B., Smola, A., & Müller, K.-R. (1999). Kernel principal component analysis. In ADVANCES in kernel methods - support vector learning (pp. 327–352). MIT Press.
Silva, L. O., & Zárate, L. E. (2014). A brief review of the main approaches for treatment of missing data. Intell. Data Anal., 18(6), 1177–1198.
Somol, P., Novovicova, J., & Pudil, P. (2010). Efficient feature subset selection and subset size optimization. In A. Herout (Ed.), Pattern recognition. Rijeka: IntechOpen. https://doi.org/10.5772/9356
Tenenbaum, J. B., Silva, V. de, & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319.
Torgerson, W. S. (1952). Multidimensional scaling: I. Theory and method. Psychometrika, 17, 401–419.
Young, G., & Householder, A. (1938). Discussion of a set of points in terms of their mutual distances. Psychometrika, 3, 19–22.