An introduction to deep learning
Jeremy Fix
February 15, 2024
Slides made with slidemakerbut it is hard to train (except the CNN) and the SVM comes in the play in the 1990s … : second winter
For an overview : (Schmidhuber, 2015)
See also Deep reinforcement learning : Atari / AlphaGO / AlphaStar / AlphaChem; Graph neural networks, etc..
Some of the reasons of the current success :
Libraries allow to easily implement/test/deploy neural networks :
Some of the major contributors to the field:
Some the most important conferences: NIPS/NeurIPS, ICLR, CVPR, (ICML, ICASSP, ..)
Online ressources :
- distill.pub, blog posts (e.g. pytorch.org blog),
- FastAI lectures, CS231n, MIT S191
- awesome deep learning, Awesome deep learning papers
Lecture 1/2 (03/11): Introduction, Linear networks, RBF
Lecture 3/4 (07/11, 10/11): Feedforward networks, differential programming, initialization and gradient descent
Lab work 1 (04/12) : Introduction to pytorch, tensorboard, FCN, CNNs
Lecture 5 (11/12): Regularization, and Convolutional neural networks architectures
Lecture 6 (11/12) : Convolutional Neural Networks : applications
Lab work 2 (18/12) : Convolutional neural networks : Semantic segmentation
Lecture 7 (12/01): Recurrent neural networks : architectures
Lecture 8 (12/01): Recurrent neural networks : applications
Lab work 3 (19/01): Recurrent neural networks : Seq2Seq for Speech to text
Lab work 4 (26/01): Generative Neural Networks (RBM, GAN, DBN)
Labworks : on our DCE/GPU clusters (1080, 2080 Ti, pytorch), in pairs, remotely with VNC (because of tensorboard).
warning We use dcejs. You must install it and test it before the labworks. If you have issues, drop a message on the course forums.
You will be evaluated by the participation to the Kaggle competition GeoLifeCLEF 2022 - LifeCLEF 2022 x FGVC9 on location-based species presence prediction.
Training: 1.663.996 observations covering 17.037 plants and animal species
Testing: one prediction every 10th days in 2017
Evaluation metric Top 30 error (lower is better)
Data: You can download them but they are all already available on the frontal and compute nodes of the DCE on /mounts/Datasets4/GeoLifeCLEF2022/
Some key points :
Constraints
Deliverables
You will be evaluated given :
Deadline (to be confirmed) : Friday 16th February 2024.
Suggested milestones
Write a full basic pipeline with all the boiler plate code before going in depth.
Then you can move on extensions.
A neural network is a directed graph :
There are two types of graphs :
But why do we care about convolutional neural networks with a softmax output, ReLu hiddden activations, cross entropy loss, batch normalization layers, trained with RMSprop with Nesterov momentum regularized with dropout exactly ?
Given fixed, predefined feature functions \(\phi_j\), with \(\phi_0(x) = 1, \forall x \in \mathbb{R}^n\), the perceptron classifies \(x\) as :
\[\begin{align} y &= g(w^T \Phi(x))\\ g(x) &= \begin{cases}-1 &\text{if }\quad x < 0 \\ +1 & \text{if }\quad x \geq 0 \end{cases} \end{align}\]
with \(\phi(x) \in \mathbb{R}^{n_a+1}\), \(\phi(x) = \begin{bmatrix} 1 \\ \phi_1(x) \\ \phi_2(x) \\ \vdots \end{bmatrix}\)
Given \((x_i, y_i)\), \(y_i \in \{-1,1\}\), the perceptron learning rule operates online: \[\begin{align} w = \begin{cases} w &\text{ if the input is correctly classified}\\ w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\ w - \phi(x_i) &\text{ if the input is incorrectly classified as +1} \end{cases} \end{align}\]
Decision rule : \(y = g(w^T \Phi(x))\)
Algorithm:
\[\begin{align}
w = \begin{cases}
w &\text{ if the input is correctly classified}\\
w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\
w - \phi(x_i) &\text{ if the input is incorrectly classified as +1}
\end{cases}
\end{align}\]
Decision rule : \(y = g(w^T \Phi(x))\)
Algorithm:
\[\begin{align}
w = \begin{cases}
w &\text{ if the input is correctly classified}\\
w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\
w - \phi(x_i) &\text{ if the input is incorrectly classified as +1}
\end{cases}
\end{align}\]
Decision rule : \(y = g(w^T \Phi(x))\)
The intersection of the valid halfspaces is called the cone of feasibility (it may be empty).
Consider two samples \(x_1, x_2\) with \(y_1=+1\), \(y_2=-1\)
Given \((x_i, y_i)\), \(y_i \in \{-1,1\}\), the perceptron learning rule operates online: \[\begin{align} w = \begin{cases} w &\text{ if the input is correctly classified}\\ w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\ w - \phi(x_i) &\text{ if the input is incorrectly classified as +1} \end{cases} \end{align}\]
\[\begin{align} w = \begin{cases} w &\text{ if } g(w^T\phi(x_i)) = y_i\\ w + \phi(x_i) &\text{ if } g(w^T \phi(x_i)) = -1 \text{ and } y_i = +1\\ w - \phi(x_i) &\text{ if } g(w^T \phi(x_i)) = +1 \text{ and } y_i = -1 \end{cases} \end{align}\]
\[\begin{align*} w = \begin{cases} w &\text{ if } g(w^T\phi(x_i)) = y_i\\ w + y_i \phi(x_i) &\text{ if } g(w^T \phi(x_i)) \neq y_i \end{cases} \end{align*}\]
Given \((x_i, y_i)\), \(y_i \in \{-1,1\}\), the perceptron learning rule operates online:
\[\begin{align*} w = \begin{cases} w &\text{ if } g(w^T\phi(x_i)) = y_i\\ w + y_i \phi(x_i) &\text{ if } g(w^T \phi(x_i)) \neq y_i \end{cases} \end{align*}\]
\[\begin{align*} w = w + \frac{1}{2} (y_i - \hat{y}_i) \phi(x_i) \end{align*}\]
with \(\hat{y}_i = g(w^T \phi(x_i))\). This is called the delta rule.
Definition (Linear separability)
A binary classification problem \((x_i, y_i) \in \mathbb{R}^d \times \{-1,1\}, i \in [1..N]\) is said to be linearly separable if there exists \(\textbf{w} \in \mathbb{R}^d\) such that~:
\[\begin{align*} \forall i, \mbox{sign}(\textbf{w}^T x_i) = y_i \end{align*}\]
with \(\forall x < 0, \mbox{sign}(x) = -1, \forall x \geq 0, \mbox{sign}(x) = +1\).
Theorem (Perceptron convergence theorem)
A classification problem \((x_i, y_i) \in \mathbb{R}^d \times \{-1,1\}, i \in [1..N]\) is linearly separable if and only if the perceptron learning rule converges to an optimal solution in a finite number of steps.
\(\Leftarrow\): easy; \(\Rightarrow\) : we upper/lower bound \(|w(t)|_2^2\)
\[\begin{equation} w_t = w_0 + \sum_{i}\frac{1}{2} (y_i - \hat{y}_i) \phi(x_i) \end{equation}\]
\((y_i - \hat{y}_i)\) is the prediction error
Any linear predictor involving only scalar products can be kernelized (kernel trick, cf SVM);
Decision rule : \(\mbox{sign}(<w, x>)\)
Given \(w(t) = w_0 + \sum_{i \in \mathcal{I}} y_i x_i\)
\[\begin{align*} <w,x> &= <w_0,x> + \sum_{i \in \mathcal{I}} y_i <x_i, x> \\ \Rightarrow k(w,x) &= k(w_0, x) + \sum_{i \in \mathcal{I}} y_i k(x_i, x) \end{align*}\]
Polynomial kernel of degree \(d=3\) :
\[k(x, y) = (1 + <x, y>)^3\]
Training set : 50 samples
Real risk : \(92\%\)
Code : https://github.com/rougier/ML-Recipes/blob/master/recipes/ANN/kernel-perceptron.py
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \mathbb{R}\), minimize
\[ J(w) = \frac{1}{N} \sum_i ||y_i - w^T x_i||^2 \]
We assume that \(x_i[0] = 1 \forall i\) so that \(w[0]\) hosts the bias term.
Analytically Introduce \(X = [x_0 | x_1 | ... ]\), \(J(w) = \|y-X^Tw\|^2\). In numerator layout (see later)
\[ \nabla_w J(w) = 0 \Rightarrow \nabla_z \|z\|_2^2(z=y-X^T w) \nabla_w (y-X^T w)= -2.(y - X^Tw)^T X^T = 0 \Rightarrow X X^T w = X y \]
Needs to compute \(XX^T\), i.e. over the whole training set…
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \mathbb{R}\), minimize
\[ J(w) = \frac{1}{N} \sum_i ||y_i - w^T x_i||^2 \]
We assume that \(x_i[0] = 1 \forall i\) so that \(w[0]\) hosts the bias term.
start at \(w_0\)
take each sample one after the other (online) \(x_i, y_i\)
denote \(\hat{y}_i = w^T x_i\) the prediction
update \[w_{t+1}= w_t - \epsilon \nabla_w J(w_t) = w_t + \epsilon (y_i - \hat{y}_i) x_i\]
delta rule, \(\delta = (y_i - \hat{y}_i)\) prediction error \[w_{t+1} = w_t + \epsilon \delta x_i\]
\[J(w,x,y) = \frac{1}{N} \sum_{i=1}^N L(w,x_i,y_i)\] e.g. \(L(w,x_i,y_i) = ||y_i - w^T x_i||^2\)
Batch gradient descent
compute the gradient of the loss \(J(w)\) over the whole training set
performs one step in direction of \(-\nabla_w J(w,x,y)\) \[w_{t+1} = w_t - \epsilon_t \textcolor{red}{\nabla_w J(w,x,y)}\]
\(\epsilon\) : learning rate
\[J(w,x,y) = \frac{1}{N} \sum_{i=1}^N L(w,x_i,y_i)\] e.g. \(L(w,x_i,y_i) = ||y_i - w^T x_i||^2\)
Stochastic gradient descent (SGD)
one sample at a time, noisy estimate of \(\nabla_w J\)
performs one step in direction of \(-\nabla_w L(w,x_i,y_i)\) \[w_{t+1} = w_t - \epsilon_t \textcolor{red}{\nabla_w L(w,x_i,y_i)}\]
faster to converge than gradient descent
\[J(w,x,y) = \frac{1}{N} \sum_{i=1}^N L(w,x_i,y_i)\] e.g. \(L(w,x_i,y_i) = ||y_i - w^T x_i||^2\)
Minibatch
\[ w_{t+1} = w_t - \epsilon_t \textcolor{red}{\frac{1}{M} \sum_{j \in \mathcal{J}} \nabla_w L(w,x_j,y_j)} \]
If the batch size is too large, there is a generalization gap (LeCun, Bottou, Orr, & Müller, 1998), maybe due to sharp minimum (Keskar, Mudigere, Nocedal, Smelyanskiy, & Tang, 2017); see also (Hoffer, Hubara, & Soudry, 2017)
Convex function A function \(f: \mathbb{R}^n \mapsto \mathbb{R}\) is convex :
\(\iff \forall x_1, x_2 \in \mathbb{R}^n, \forall t \in [0,1]\) \(f(t x_1 + (1-t)x_2) \leq t f(x_1) + (1-t) f(x_2)\)
with \(f\) twice diff.,
\(\iff \forall x \in \mathbb{R}^n, H = \nabla^2 f(x)\) is positive semidefinite
i.e. \(\forall x \in \mathbb{R}^n, x^T H x \geq 0\)
For a convex function \(f\), all local minima are global minima. Our losses are lower bounded, so these minima exist. Under mild conditions, gradient descent and stochastic gradient descent converge, typically \(\sum \epsilon_t =\infty, \sum \epsilon_t^2 < \infty\) (cf lectures on convex optimization).
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \mathbb{R}\)
Other choices may also be considered (Huber loss, MAE, …).
Possibly regularized (but more on regularization latter).
Indeed,
\[ \begin{align*} \nabla_w L &= (w^T x_i - y_i) x_i\\ \nabla_w^2 L &= x_i x_i^T\\ \forall x \in \mathbb{R}^n x^T x_i x_i^T x &= (x_i^T x)^2 \geq 0 \end{align*} \]
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n}, y_i \in \{0, 1\}\)
Assume that \(P(y=1 | x) = p(x; w)\), parametrized by \(w\), and our samples to be independent, the conditional likelihood of the labels is:
\[ \mathcal{L}(w) = \prod_i P(y=y_i | x_i) = \prod_i p(x_i; w)^{y_i} (1- p(x_i; w))^{1-y_i} \]
With maximum likelihood estimation, we rather equivalently minimize the averaged negative log-likelihood :
\[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = \frac{1}{N} \sum_i -y_i\log(p(x_i; w))-(1-y_i)\log(1-p(x_i; w)) \]
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \{0, 1\}\)
Linear logit model : \(o(x) = w^Tx\) (we still assume \(x[0] = 1\) for the bias)
Sigmoid transfer function : \(\hat{y}(x) = p(x; w) = \sigma(o(x)) = \sigma(w^T x)\)
Following maximum likelihood estimation, we minimize : \[ J(w) = \frac{1}{N} \sum_i -y_i\log(p(x_i; w))-(1-y_i)\log(1-p(x_i; w)) \]
The loss \(L(\hat{y}, y) = -y \log(\hat{y}) - (1-y)\log(1 - \hat{y})\) is called the cross entropy loss, or negative log-likelihood
The gradient of the cross entropy loss with \(\hat{y}(x) = \sigma(x)\) is : \[ \nabla_w L(w,x_i,y_i) = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w} = -(y_i - \hat{y}_i) x_i \]
Indeed,
Compute the gradient to see why
Take L2 loss \(L(\hat{y}, y) = \frac{1}{2}||\hat{y} - y||^2\)
With a cross entropy loss, \(\nabla_w L(w,x_i,y_i)\) is proportional to the error
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in [|0, K-1|]\)
Assume that \(P(y=c | x) = \frac{e^{w_c^T x}}{\sum_k e^{w_k^T x}}\), parametrized by \(w_0, w_1, w_2, ..\), and our samples to be independent, the conditional likelihood of the labels is:
\[ \mathcal{L}(w) = \prod_i P(y=y_i | x_i) \]
With maximum likelihood estimation, we rather equivalently minimize the averaged negative log-likelihood:
\[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = -\frac{1}{N} \sum_i \log(P(y=y_i | x_i)) \]
With a one-hot encoding of the target class (i.e. \(y_i = [0, ..., 0, 1, 0, .. ]\)), it can be written as :
\[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = -\frac{1}{N} \sum_i \sum_c y_c \log(P(y=c | x_i)) \]
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in [|0, K-1|]\)
Softmax regression is convex.
Large exponentials
If you compute naïvely the softmax, you would have \(\exp(..)\) which is quickly large.
Fortunately:
\[ softmax(o_1, o_2, o_3, ..) = softmax(o_1 - o^\star, o_2 - o^\star, o_3-o^\star) = \frac{\exp(o_i - o^\star)}{\sum_j \exp(o_j - o^\star)} \]
You always compute \(\exp(z)\) with \(z \leq 0\).
Avoiding some exponentials with the log-sum-exp trick \(\log(\sum_j \exp(o_j)) = o^\star + \log(\sum_j \exp(o_j-o^\star))\)
You do not really need to compute the \(\log(\hat{y}_j) = \log(softmax_j(x)))\) since :
\[ \log(\hat{y}_i) = \log(\frac{\exp(o_i-o^\star)}{\sum_j \exp(o_j - o^\star)}) = o_i - o^\star - \log(\sum_j \exp(o_j - o^\star)) \]
In practice that explains why we use the Cross entropy loss with logits outputs rather than Softmax + Negative log likelihood or even LogSoftMax + NLLLoss (which does not have the log… yeah confusing…)
Perceptrons and logistic regression perform linear separation in a predefined, fixed feature space.
What about learning these features \(\phi_j(x)\)?
Rather check the full online document references.pdf
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … Amodei, D. (2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 1877–1901). Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [Cs]. Retrieved from http://arxiv.org/abs/1810.04805
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations. Retrieved from https://openreview.net/forum?id=YicbFdNTTy
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202. https://doi.org/10.1007/BF00344251
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Weinberger (Eds.), Advances in neural information processing systems (Vol. 27). Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of Physiology, 117(4), 500–544. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1392413/
Hoffer, E., Hubara, I., & Soudry, D. (2017). Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. arXiv:1705.08741 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1705.08741
Joly, A., Goëau, H., Kahl, S., Picek, L., Lorieul, T., Cole, E., … Hrúz, M. (2022). Overview of lifeclef 2022: An evaluation of machine-learning based species identification and species distribution prediction. In A. Barrón-Cedeño, G. Da San Martino, M. Degli Esposti, F. Sebastiani, C. Macdonald, G. Pasi, … N. Ferro (Eds.), Experimental ir meets multilinguality, multimodality, and interaction (pp. 257–285). Cham: Springer International Publishing.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 [Cs, Math]. Retrieved from http://arxiv.org/abs/1609.04836
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386
LeCun, Y. A., Bottou, L., Orr, G. B., & Müller, K.-R. (1998). Efficient BackProp. In G. Montavon, G. B. Orr, & K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade: Second Edition (pp. 9–48). Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-35289-8_3
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4), 541–551. https://doi.org/10.1162/neco.1989.1.4.541
Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., … Gadde, R. T. (2019). Jasper: An End-to-End Convolutional Neural Acoustic Model. arXiv:1904.03288 [Cs, Eess]. Retrieved from http://arxiv.org/abs/1904.03288
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115–133. https://doi.org/10.1007/BF02478259
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2021). NeRF: Representing scenes as neural radiance fields for view synthesis. Commun. ACM, 65(1), 99–106. https://doi.org/10.1145/3503250
Minksy, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Retrieved from https://github.com/CompVis/latent-diffusionhttps://arxiv.org/abs/2112.10752
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. https://doi.org/10.1037/h0042519
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117. https://doi.org/10.1016/j.neunet.2014.09.003
Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In F. Bach & D. Blei (Eds.), Proceedings of the 32nd international conference on machine learning (Vol. 37, pp. 2256–2265). Lille, France: PMLR. Retrieved from https://proceedings.mlr.press/v37/sohl-dickstein15.html
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is All you Need, 11.
Widrow, B., & Hoff, M. E. (1962). Associative Storage and Retrieval of Digital Information in Networks of Adaptive “Neurons”. In E. E. Bernard & M. R. Kare (Eds.), Biological Prototypes and Synthetic Systems: Volume 1 Proceedings of the Second Annual Bionics Symposium sponsored by Cornell University and the General Electric Company, Advanced Electronics Center, held at Cornell University, August 30–September 1, 1961 (pp. 160–160). Boston, MA: Springer US. https://doi.org/10.1007/978-1-4684-1716-6_25
Ze, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 7962–7966). Vancouver, BC, Canada: IEEE. https://doi.org/10.1109/ICASSP.2013.6639215