Deep learning

An introduction to deep learning

Jeremy Fix

February 15, 2024

Slides made with slidemaker

Introduction

Historical perspective

Hodgkin-Huxley (Hodgkin & Huxley, 1952) : giant squid axon
Formal neuron (McCulloch & Pitts, 1943) : the community gets very excited

Perceptron (Rosenblatt, 1958) : linear classifier
AdaLinE (Widrow & Hoff, 1962) : linear regressor

Minsky/Papert (Minksy & Papert, 1969) : first winter

Convolutional Neural networks (Fukushima, 1980),(LeCun et al., 1989) : great !
Multilayer Perceptron and backprop (Rumelhart, Hinton, & Williams, 1986) : great !

but it is hard to train (except the CNN) and the SVM comes in the play in the 1990s … : second winter

2006 : pretraining greatly helps
2012 : AlexNet on ImageNet (10% better on test than the 2nd)

Now on : lot of SOTA neural networks and much more
GANs (Goodfellow et al., 2014), Transformers for NLP (Vaswani et al., 2017), Vision transformers (Dosovitskiy et al., 2021), NERF (Mildenhall et al., 2021), Diffusion models (Sohl-Dickstein, Weiss, Maheswaranathan, & Ganguli, 2015) (Rombach, Blattmann, Lorenz, Esser, & Ommer, 2022)

For an overview : (Schmidhuber, 2015)

Success stories

Image classification : (Krizhevsky, Sutskever, & Hinton, 2012)
Image segmentation CSAIL repo, detectron2
Depth estimation Github
3D Pose estimation
Generative Adversial Networks (e.g. NVlabs stylegan3), see also fakestylegan

Detectron2

Speech synthesis/recognition (Ze, Senior, & Schuster, 2013), (Li et al., 2019)
Automatic translation (Google Neural Machine Translation)
Language models (BERT, GPT, chatGPT 175B) (Devlin, Chang, Lee, & Toutanova, 2019),(Brown et al., 2020)

See also Deep reinforcement learning : Atari / AlphaGO / AlphaStar / AlphaChem; Graph neural networks, etc..

Why is deep learning working “now”

Some of the reasons of the current success :

GPU (speed of processing) / Data (regularizing)
theoretical understandings on the difficulty of training deep networks (from 2006)

Libraries allow to easily implement/test/deploy neural networks :

Torch (Lua) / PyTorch(Python/C++), Caffe(C++/Python), Caffe2 (RIP 2018)
Google Tensorflow / Keras
Microsoft CNTK (RIP 2022, ONNX)
Theano/Lasagne (Python, RIP 2017)
Chainer (RIP 2019, moved to pytorch), Matlab, Mathematica, ….

Ressources

Books

Introduction to the theory of neural computation Neural network for pattern recognition Deep learning book Dive into deep learning Deep learning with pytorch

People and conferences

Some of the major contributors to the field:

N-2 : McCulloch/Pitts, Rumelhart, Rosenblatt, Hopfield,
N-1 : Hinton, Bengio, LeCun, Schmidhuber
N : Goodfellow, Dauphin, Graves, Sutskever, Karpathy, Krizevsky, Hochreiter

Some the most important conferences: NIPS/NeurIPS, ICLR, CVPR, (ICML, ICASSP, ..)
Online ressources :
- distill.pub, blog posts (e.g. pytorch.org blog),
- FastAI lectures, CS231n, MIT S191
- awesome deep learning, Awesome deep learning papers

Syllabus

Lecture 1/2 (03/11): Introduction, Linear networks, RBF
Lecture 3/4 (07/11, 10/11): Feedforward networks, differential programming, initialization and gradient descent

Lab work 1 (04/12) : Introduction to pytorch, tensorboard, FCN, CNNs

Lecture 5 (11/12): Regularization, and Convolutional neural networks architectures
Lecture 6 (11/12) : Convolutional Neural Networks : applications

Lab work 2 (18/12) : Convolutional neural networks : Semantic segmentation

Lecture 7 (12/01): Recurrent neural networks : architectures
Lecture 8 (12/01): Recurrent neural networks : applications

Lab work 3 (19/01): Recurrent neural networks : Seq2Seq for Speech to text
Lab work 4 (26/01): Generative Neural Networks (RBM, GAN, DBN)

Labworks : on our DCE/GPU clusters (1080, 2080 Ti, pytorch), in pairs, remotely with VNC (because of tensorboard).

We use dcejs. You must install it and test it before the labworks. If you have issues, drop a message on the course forums.

Evaluation [1/2]

You will be evaluated by the participation to the Kaggle competition GeoLifeCLEF 2022 - LifeCLEF 2022 x FGVC9 on location-based species presence prediction.

Training: 1.663.996 observations covering 17.037 plants and animal species

Testing: one prediction every 10th days in 2017

Evaluation metric Top 30 error (lower is better)

Data: You can download them but they are all already available on the frontal and compute nodes of the DCE on /mounts/Datasets4/GeoLifeCLEF2022/

Some key points :

long tailed distribution, large dataset (1 hour/epoch on a RTX 3090, Resnet50), image (rasters) + tabular data
the challenge is over (best to beat 0.68472), an overview is available in (Joly et al., 2022) (see references.pdf)
some litterature is available as some competitors described their solution, e.g. in the leaderboard

Evaluation [2/3]

Constraints

work in teams of \(3\) to \(4\) students : 7 teams of \(4\) students, \(3\) teams of \(3\) students; DEADLINE on the 17th of November on Edunao,
you must use Pytorch,
you must use and adapt the pytorch_template_code as base code,
you must be running experiments on the DCE cluster
you can discuss between the teams but mustn’t share codes !

Deliverables

You will be evaluated given :

your code (create a team’s repository on gitlab-student and invite me) and experiments : \(10\) points
a 20 minutes video explaining your solution and experiments : \(5\) points
your performances (5 points) : 2 extra points for the rank 1st team

Deadline (to be confirmed) : Friday 16th February 2024.

Evaluation [3/3]

Suggested milestones

Write a full basic pipeline with all the boiler plate code before going in depth.

[DATA] write a GeoLifeClefDataset as pytorch Dataset
[MODEL] implement a basic baseline model,
[METRICS] implement the basic metrics of interest, in particulier Topk error
[SUBMISSION] Write the test function to generate the submission file
[LOG] Consider logging your experiments to online dashboards (e.g. wandb)
[DCE] Run your first experiments with basic models and basic pipelines and make your first submissions

Then you can move on extensions.

What is a neural network ?

Definition

A neural network is a directed graph :

nodes : computational units
edges : weighted connections

There are two types of graphs :

no cycle : feedforward neural network
with at least one cycle : recurrent neural networks

But why do we care about convolutional neural networks with a softmax output, ReLu hiddden activations, cross entropy loss, batch normalization layers, trained with RMSprop with Nesterov momentum regularized with dropout exactly ?

Linear Neural networks

Perceptron (Rosenblatt, 1958)

Classification : given \((x_i, y_i) \in \mathbb{R}^n \times \{-1, 1\}\)
Sensory - Associative - Response architecture, \(\phi_j(x)\) with \(\phi_0(x) = 1\)
Algorithm and geometrical interpretation

Architecture of the classifier

Given fixed, predefined feature functions \(\phi_j\), with \(\phi_0(x) = 1, \forall x \in \mathbb{R}^n\), the perceptron classifies \(x\) as :

\[\begin{align} y &= g(w^T \Phi(x))\\ g(x) &= \begin{cases}-1 &\text{if }\quad x < 0 \\ +1 & \text{if }\quad x \geq 0 \end{cases} \end{align}\]

with \(\phi(x) \in \mathbb{R}^{n_a+1}\), \(\phi(x) = \begin{bmatrix} 1 \\ \phi_1(x) \\ \phi_2(x) \\ \vdots \end{bmatrix}\)

Online training algorithm

Given \((x_i, y_i)\), \(y_i \in \{-1,1\}\), the perceptron learning rule operates online: \[\begin{align} w = \begin{cases} w &\text{ if the input is correctly classified}\\ w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\ w - \phi(x_i) &\text{ if the input is incorrectly classified as +1} \end{cases} \end{align}\]

Geometrical interpretation : correct classification

Decision rule : \(y = g(w^T \Phi(x))\)
Algorithm:
\[\begin{align} w = \begin{cases} w &\text{ if the input is correctly classified}\\ w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\ w - \phi(x_i) &\text{ if the input is incorrectly classified as +1} \end{cases} \end{align}\]

A correctly classified sample either positive or negative

Geometrical interpretation : misclassification

An incorrectly classified sample either positive or negative

Geometrical interpretation : multiple samples

Decision rule : \(y = g(w^T \Phi(x))\)

The intersection of the valid halfspaces is called the cone of feasibility (it may be empty).

Consider two samples \(x_1, x_2\) with \(y_1=+1\), \(y_2=-1\)

The cone of feasibility for y_1=+1 and y_2=-1 — The cone of feasibility for \(y_1=+1\) and \(y_2=-1\)

Toward a canonical learning rule

Given \((x_i, y_i)\), \(y_i \in \{-1,1\}\), the perceptron learning rule operates online: \[\begin{align} w = \begin{cases} w &\text{ if the input is correctly classified}\\ w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\ w - \phi(x_i) &\text{ if the input is incorrectly classified as +1} \end{cases} \end{align}\]

\[\begin{align} w = \begin{cases} w &\text{ if } g(w^T\phi(x_i)) = y_i\\ w + \phi(x_i) &\text{ if } g(w^T \phi(x_i)) = -1 \text{ and } y_i = +1\\ w - \phi(x_i) &\text{ if } g(w^T \phi(x_i)) = +1 \text{ and } y_i = -1 \end{cases} \end{align}\]

\[\begin{align*} w = \begin{cases} w &\text{ if } g(w^T\phi(x_i)) = y_i\\ w + y_i \phi(x_i) &\text{ if } g(w^T \phi(x_i)) \neq y_i \end{cases} \end{align*}\]

Toward a canonical learning rule

Given \((x_i, y_i)\), \(y_i \in \{-1,1\}\), the perceptron learning rule operates online:

\[\begin{align*} w = \begin{cases} w &\text{ if } g(w^T\phi(x_i)) = y_i\\ w + y_i \phi(x_i) &\text{ if } g(w^T \phi(x_i)) \neq y_i \end{cases} \end{align*}\]

\[\begin{align*} w = w + \frac{1}{2} (y_i - \hat{y}_i) \phi(x_i) \end{align*}\]

with \(\hat{y}_i = g(w^T \phi(x_i))\). This is called the delta rule.

Perceptron convergence theorem

Definition (Linear separability)

A binary classification problem \((x_i, y_i) \in \mathbb{R}^d \times \{-1,1\}, i \in [1..N]\) is said to be linearly separable if there exists \(\textbf{w} \in \mathbb{R}^d\) such that~:

\[\begin{align*} \forall i, \mbox{sign}(\textbf{w}^T x_i) = y_i \end{align*}\]

with \(\forall x < 0, \mbox{sign}(x) = -1, \forall x \geq 0, \mbox{sign}(x) = +1\).

Theorem (Perceptron convergence theorem)

A classification problem \((x_i, y_i) \in \mathbb{R}^d \times \{-1,1\}, i \in [1..N]\) is linearly separable if and only if the perceptron learning rule converges to an optimal solution in a finite number of steps.

\(\Leftarrow\): easy; \(\Rightarrow\) : we upper/lower bound \(|w(t)|_2^2\)

Various facts

\(w_t = w_0 + \sum_{i \in \mathcal{I}(t)} y_i \phi(x_i)\), with \(\mathcal{I}(t)\) the set of misclassified samples
it minimizes a loss : \(J(w) = \frac{1}{N} \sum_i \max(0, -y_i w^T \phi(x_i))\)
the solution can be written as

\[\begin{equation} w_t = w_0 + \sum_{i}\frac{1}{2} (y_i - \hat{y}_i) \phi(x_i) \end{equation}\]

\((y_i - \hat{y}_i)\) is the prediction error

Kernel perceptron

Any linear predictor involving only scalar products can be kernelized (kernel trick, cf SVM);

Decision rule : \(\mbox{sign}(<w, x>)\)

Given \(w(t) = w_0 + \sum_{i \in \mathcal{I}} y_i x_i\)

\[\begin{align*} <w,x> &= <w_0,x> + \sum_{i \in \mathcal{I}} y_i <x_i, x> \\ \Rightarrow k(w,x) &= k(w_0, x) + \sum_{i \in \mathcal{I}} y_i k(x_i, x) \end{align*}\]

Polynomial kernel of degree \(d=3\) :

\[k(x, y) = (1 + <x, y>)^3\]

Training set : 50 samples

Real risk : \(92\%\)

Code : https://github.com/rougier/ML-Recipes/blob/master/recipes/ANN/kernel-perceptron.py

AdaLinE (Widrow & Hoff, 1962)

Linear regression analytically

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \mathbb{R}\), minimize

\[ J(w) = \frac{1}{N} \sum_i ||y_i - w^T x_i||^2 \]

We assume that \(x_i[0] = 1 \forall i\) so that \(w[0]\) hosts the bias term.

Analytically Introduce \(X = [x_0 | x_1 | ... ]\), \(J(w) = \|y-X^Tw\|^2\). In numerator layout (see later)

\[ \nabla_w J(w) = 0 \Rightarrow \nabla_z \|z\|_2^2(z=y-X^T w) \nabla_w (y-X^T w)= -2.(y - X^Tw)^T X^T = 0 \Rightarrow X X^T w = X y \]

\(XX^T\) non singular : \(w = (X X^T)^{-1} X y\)

\(XX^T\) singular (e.g. points along a line in 2D), infinite nb solutions
- One solution can be found with the regularized least square : \[ min G(w) = J(w) + \alpha w^T w \]
- \(\nabla_w G(w) = 0 \Rightarrow (X X^T + \alpha I) w = X y\)
- as soon as \(\alpha > 0\), \((X X^T + \alpha I)\) is not singular

Needs to compute \(XX^T\), i.e. over the whole training set…

Linear regression with stochastic gradient descent

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \mathbb{R}\), minimize

\[ J(w) = \frac{1}{N} \sum_i ||y_i - w^T x_i||^2 \]

We assume that \(x_i[0] = 1 \forall i\) so that \(w[0]\) hosts the bias term.

start at \(w_0\)
take each sample one after the other (online) \(x_i, y_i\)
denote \(\hat{y}_i = w^T x_i\) the prediction
update \[w_{t+1}= w_t - \epsilon \nabla_w J(w_t) = w_t + \epsilon (y_i - \hat{y}_i) x_i\]
delta rule, \(\delta = (y_i - \hat{y}_i)\) prediction error \[w_{t+1} = w_t + \epsilon \delta x_i\]

Gradient descent

Batch gradient descent

\[J(w,x,y) = \frac{1}{N} \sum_{i=1}^N L(w,x_i,y_i)\] e.g. \(L(w,x_i,y_i) = ||y_i - w^T x_i||^2\)

Batch gradient descent

compute the gradient of the loss \(J(w)\) over the whole training set
performs one step in direction of \(-\nabla_w J(w,x,y)\) \[w_{t+1} = w_t - \epsilon_t \textcolor{red}{\nabla_w J(w,x,y)}\]
\(\epsilon\) : learning rate

Stochastic gradient descent

\[J(w,x,y) = \frac{1}{N} \sum_{i=1}^N L(w,x_i,y_i)\] e.g. \(L(w,x_i,y_i) = ||y_i - w^T x_i||^2\)

Stochastic gradient descent (SGD)

one sample at a time, noisy estimate of \(\nabla_w J\)
performs one step in direction of \(-\nabla_w L(w,x_i,y_i)\) \[w_{t+1} = w_t - \epsilon_t \textcolor{red}{\nabla_w L(w,x_i,y_i)}\]
faster to converge than gradient descent

Minibatch gradient descent

\[J(w,x,y) = \frac{1}{N} \sum_{i=1}^N L(w,x_i,y_i)\] e.g. \(L(w,x_i,y_i) = ||y_i - w^T x_i||^2\)

Minibatch

noisy estimate of the true gradient with \(M\) samples (e.g. \(M=64, 128\)); \(M\) is the minibatch size
Randomize \(\mathcal{J}\) with \(|\mathcal{J}| = M\), one set at a time

\[ w_{t+1} = w_t - \epsilon_t \textcolor{red}{\frac{1}{M} \sum_{j \in \mathcal{J}} \nabla_w L(w,x_j,y_j)} \]

smoother estimate than SGD
great for parallel architectures (GPU)

If the batch size is too large, there is a generalization gap (LeCun, Bottou, Orr, & Müller, 1998), maybe due to sharp minimum (Keskar, Mudigere, Nocedal, Smelyanskiy, & Tang, 2017); see also (Hoffer, Hubara, & Soudry, 2017)

Does it make sense to use gradient descent ?

Convex function A function \(f: \mathbb{R}^n \mapsto \mathbb{R}\) is convex :

\(\iff \forall x_1, x_2 \in \mathbb{R}^n, \forall t \in [0,1]\) \(f(t x_1 + (1-t)x_2) \leq t f(x_1) + (1-t) f(x_2)\)
with \(f\) twice diff.,
\(\iff \forall x \in \mathbb{R}^n, H = \nabla^2 f(x)\) is positive semidefinite
i.e. \(\forall x \in \mathbb{R}^n, x^T H x \geq 0\)

For a convex function \(f\), all local minima are global minima. Our losses are lower bounded, so these minima exist. Under mild conditions, gradient descent and stochastic gradient descent converge, typically \(\sum \epsilon_t =\infty, \sum \epsilon_t^2 < \infty\) (cf lectures on convex optimization).

Linear regression

Summary

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \mathbb{R}\)

We assume that \(x[0] = 1\) to encompass the bias
Linear model : \(\hat{y} = w^T x\)
L2 loss : \(L(ŷ, y) = \|\hat{y} - y \|^2\)
by gradient descent \[ \nabla_w L(w,x_i,y_i) = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w} = -(y_i - \hat{y}_i) x_i \]

Other choices may also be considered (Huber loss, MAE, …).

Possibly regularized (but more on regularization latter).

Linear regression with L2 loss is convex

Indeed,

Given \(x_i, y_i\), \(L(w) = \frac{1}{2}(w^T x_i - y_i)^2\) is convex:

\[ \begin{align*} \nabla_w L &= (w^T x_i - y_i) x_i\\ \nabla_w^2 L &= x_i x_i^T\\ \forall x \in \mathbb{R}^n x^T x_i x_i^T x &= (x_i^T x)^2 \geq 0 \end{align*} \]

a non negative weighted sum of convex functions is convex

Linear classification

Maximum likelihood (binary classification)

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n}, y_i \in \{0, 1\}\)

Assume that \(P(y=1 | x) = p(x; w)\), parametrized by \(w\), and our samples to be independent, the conditional likelihood of the labels is:

\[ \mathcal{L}(w) = \prod_i P(y=y_i | x_i) = \prod_i p(x_i; w)^{y_i} (1- p(x_i; w))^{1-y_i} \]

With maximum likelihood estimation, we rather equivalently minimize the averaged negative log-likelihood :

\[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = \frac{1}{N} \sum_i -y_i\log(p(x_i; w))-(1-y_i)\log(1-p(x_i; w)) \]

Logistic regression (binary classification)

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \{0, 1\}\)

Linear logit model : \(o(x) = w^Tx\) (we still assume \(x[0] = 1\) for the bias)
Sigmoid transfer function : \(\hat{y}(x) = p(x; w) = \sigma(o(x)) = \sigma(w^T x)\)
- \(\sigma(x) = \frac{1}{1 + \exp(-x)}\), \(\sigma(x) \in [0, 1]\)
- \(\frac{d}{dx}\sigma(x) = \sigma(x) (1 - \sigma(x))\)
Following maximum likelihood estimation, we minimize : \[ J(w) = \frac{1}{N} \sum_i -y_i\log(p(x_i; w))-(1-y_i)\log(1-p(x_i; w)) \]
The loss \(L(\hat{y}, y) = -y \log(\hat{y}) - (1-y)\log(1 - \hat{y})\) is called the cross entropy loss, or negative log-likelihood
The gradient of the cross entropy loss with \(\hat{y}(x) = \sigma(x)\) is : \[ \nabla_w L(w,x_i,y_i) = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w} = -(y_i - \hat{y}_i) x_i \]

Logistic regression is convex

Indeed,

Given \(x_i, y_i=1\), \(L_1(w) = -\log(\sigma(w^T x_i) = \log(1 + \exp(-w^Tx_i))\),
\(\nabla_w L_1 = -(1 - \sigma(w^Tx_i)) x_i\)
\(\nabla_w^2 L_1 = \underbrace{\sigma(w^T x_i) (1-\sigma(w^Tx_i))}_{>0} x_i x_i^T\)
Given \(x_i, y_i=0\), \(L_2(w) = -\log(1-\sigma(w^T x_i))\)
\(\nabla_w L_2 = \sigma(w^T x_i) x\)
\(\nabla_w^2 L_2 = \underbrace{\sigma(w^T x_i) (1-\sigma(w^Tx_i))}_{>0} x_i x_i^T\)
a non negative weighted sum of convex functions is convex

Do not use a L2 loss

Compute the gradient to see why

Take L2 loss \(L(\hat{y}, y) = \frac{1}{2}||\hat{y} - y||^2\)

Take the “linear” model : \(\hat{y}_i = \sigma(w^T x_i)\)
Check that \(\frac{d}{dx} \sigma(x) = \sigma(x) (1 - \sigma(x))\)
Compute the gradient wrt \(w\):
\[ \nabla_w L(w,x_i,y_i) = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w} = (\hat{y}_i - y_i) \textcolor{red}{\sigma(w^T x_i) (1 - \sigma(w^T x_i))} x_i \]
If \(x_i\) is strongly misclassified (e.g. \(y_i=1\), \(w^T x_i = -\infty\)). Then \(\sigma(w^T x_i) (1 - \sigma(w^T x_i)) \approx 0\), i.e. \(\nabla_w L(w,x_i,y_i) \approx 0\) \(\Rightarrow\) stepsize is very small while the sample is misclassified

With a cross entropy loss, \(\nabla_w L(w,x_i,y_i)\) is proportional to the error

Softmax regression (multiclass classification)

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in [|0, K-1|]\)

Assume that \(P(y=c | x) = \frac{e^{w_c^T x}}{\sum_k e^{w_k^T x}}\), parametrized by \(w_0, w_1, w_2, ..\), and our samples to be independent, the conditional likelihood of the labels is:

\[ \mathcal{L}(w) = \prod_i P(y=y_i | x_i) \]

With maximum likelihood estimation, we rather equivalently minimize the averaged negative log-likelihood:

\[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = -\frac{1}{N} \sum_i \log(P(y=y_i | x_i)) \]

With a one-hot encoding of the target class (i.e. \(y_i = [0, ..., 0, 1, 0, .. ]\)), it can be written as :

\[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = -\frac{1}{N} \sum_i \sum_c y_c \log(P(y=c | x_i)) \]

Softmax regression (multiclass classification)

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in [|0, K-1|]\)

Linear models for each class \(o_j(x) = w_j^T x\) (we still assume \(x[0] = 1\))
Softmax transfer function : \(P[y=j|x] = \hat{y}_j = \frac{\exp(o_j(x))}{\sum_k \exp(o_k(x))}\)
Generalization of the sigmoid for a vectorial output
Following maximum likelihood estimation, we minimize \[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = -\frac{1}{N} \sum_i \log(P(y=y_i | x_i)) \]
The loss \(L(\hat{y}, y) = -\log(\hat{y}_y)\) is called the cross-entropy loss
by gradient descent: \[\nabla_{w_j} L(w,x,y) = \sum_k \frac{\partial L}{\partial \hat{y}_k} \frac{\partial \hat{y}_k}{\partial w_j} = -(\delta_{j,y} - \hat{y}_j) x\]

Softmax regression is convex.

Numerical issues with the softmax and CE loss

Large exponentials

If you compute naïvely the softmax, you would have \(\exp(..)\) which is quickly large.

Fortunately:

\[ softmax(o_1, o_2, o_3, ..) = softmax(o_1 - o^\star, o_2 - o^\star, o_3-o^\star) = \frac{\exp(o_i - o^\star)}{\sum_j \exp(o_j - o^\star)} \]

You always compute \(\exp(z)\) with \(z \leq 0\).

Avoiding some exponentials with the log-sum-exp trick \(\log(\sum_j \exp(o_j)) = o^\star + \log(\sum_j \exp(o_j-o^\star))\)

You do not really need to compute the \(\log(\hat{y}_j) = \log(softmax_j(x)))\) since :

\[ \log(\hat{y}_i) = \log(\frac{\exp(o_i-o^\star)}{\sum_j \exp(o_j - o^\star)}) = o_i - o^\star - \log(\sum_j \exp(o_j - o^\star)) \]

In practice that explains why we use the Cross entropy loss with logits outputs rather than Softmax + Negative log likelihood or even LogSoftMax + NLLLoss (which does not have the log… yeah confusing…)

Toward non linear networks

Limits of linear classification

Perceptrons and logistic regression perform linear separation in a predefined, fixed feature space.

What about learning these features \(\phi_j(x)\)?

Bibliography

References

Rather check the full online document references.pdf

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … Amodei, D. (2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, & H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 1877–1901). Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [Cs]. Retrieved from http://arxiv.org/abs/1810.04805

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations. Retrieved from https://openreview.net/forum?id=YicbFdNTTy

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202. https://doi.org/10.1007/BF00344251

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, & K. Weinberger (Eds.), Advances in neural information processing systems (Vol. 27). Curran Associates, Inc. Retrieved from https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of Physiology, 117(4), 500–544. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1392413/

Hoffer, E., Hubara, I., & Soudry, D. (2017). Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. arXiv:1705.08741 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1705.08741

Joly, A., Goëau, H., Kahl, S., Picek, L., Lorieul, T., Cole, E., … Hrúz, M. (2022). Overview of lifeclef 2022: An evaluation of machine-learning based species identification and species distribution prediction. In A. Barrón-Cedeño, G. Da San Martino, M. Degli Esposti, F. Sebastiani, C. Macdonald, G. Pasi, … N. Ferro (Eds.), Experimental ir meets multilinguality, multimodality, and interaction (pp. 257–285). Cham: Springer International Publishing.

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 [Cs, Math]. Retrieved from http://arxiv.org/abs/1609.04836

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386

LeCun, Y. A., Bottou, L., Orr, G. B., & Müller, K.-R. (1998). Efficient BackProp. In G. Montavon, G. B. Orr, & K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade: Second Edition (pp. 9–48). Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-35289-8_3

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4), 541–551. https://doi.org/10.1162/neco.1989.1.4.541

Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., … Gadde, R. T. (2019). Jasper: An End-to-End Convolutional Neural Acoustic Model. arXiv:1904.03288 [Cs, Eess]. Retrieved from http://arxiv.org/abs/1904.03288

McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115–133. https://doi.org/10.1007/BF02478259

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., & Ng, R. (2021). NeRF: Representing scenes as neural radiance fields for view synthesis. Commun. ACM, 65(1), 99–106. https://doi.org/10.1145/3503250

Minksy, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Retrieved from https://github.com/CompVis/latent-diffusionhttps://arxiv.org/abs/2112.10752

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. https://doi.org/10.1037/h0042519

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0

Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117. https://doi.org/10.1016/j.neunet.2014.09.003

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. In F. Bach & D. Blei (Eds.), Proceedings of the 32nd international conference on machine learning (Vol. 37, pp. 2256–2265). Lille, France: PMLR. Retrieved from https://proceedings.mlr.press/v37/sohl-dickstein15.html

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is All you Need, 11.

Widrow, B., & Hoff, M. E. (1962). Associative Storage and Retrieval of Digital Information in Networks of Adaptive “Neurons”. In E. E. Bernard & M. R. Kare (Eds.), Biological Prototypes and Synthetic Systems: Volume 1 Proceedings of the Second Annual Bionics Symposium sponsored by Cornell University and the General Electric Company, Advanced Electronics Center, held at Cornell University, August 30–September 1, 1961 (pp. 160–160). Boston, MA: Springer US. https://doi.org/10.1007/978-1-4684-1716-6_25

Ze, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 7962–7966). Vancouver, BC, Canada: IEEE. https://doi.org/10.1109/ICASSP.2013.6639215