Deep learning

An introduction to deep learning

Jeremy Fix

February 15, 2024

Radial basis function networks (RBFN) (Broomhead & Lowe, 1988)

Architecture (Broomhead & Lowe, 1988)

RBFN are prototype based function approximator.
specific architecture with a single layer of learnable feature vectors with “weights” (parameters) $(\mu_j, \sigma_j)_{j\in[0..N_a-1]}$

\[ \begin{eqnarray*} \phi(x) = \begin{pmatrix} 1 \\ \exp{\frac{-||x-\mu_0||^2}{2\sigma_0^2}} \\ \vdots\\ \exp{\frac{-||x-\mu_{N_a-1}||^2}{2\sigma_{N_a-1}^2}} \\ \end{pmatrix} \end{eqnarray*} \]

Regression

identity transfer function $y = w^T \phi(x)$
L2 loss
$L(y, \hat{y}) = \|\hat{y} - y\|^2$

Binary classification

sigmoidal transfer function $y = \sigma(w^T \phi(x))$
CE loss \[ \begin{array}{l} L(y, \hat{y}) =&-y \log(\hat{y})\\ &-(1-y) \log(1-\hat{y}) \end{array} \]

Multi classification

softmax transfer function (see Lecture 1)
CE loss (see Lecture 1)

Learning

We know how to learn the weights $w$ : minibatch gradient descent (or a variant thereof)
What about the centers and variances ? (Schwenker, Kestler, & Palm, 2001)
- place them uniformly, randomly, by vector quantization (K-means++(Arthur & Vassilvitskii, 2007), GNG (Fritzke, 1994))
- two phases : fix the centers/variances, fit the weights
- three phases : fix the centers/variances, fit the weights, fit everything ($\nabla_{\mu} L, \nabla_{\sigma} L, \nabla_w L$)

Universal approximator

Theorem : Universal approximation (Park & Sandberg, 1991)

Denote $\mathcal{S}$ the family of functions based on RBF in $\mathbb{R}^d$: \[\mathcal{S} = \{g \in \mathbb{R}^d \to \mathbb{R}, g(x) = \sum_i w_i K(\frac{x-\mu_i}{\sigma}), w \in \mathbb{R}^N\}\] with $K : \mathbb{R}^d \rightarrow \mathbb{R}$ continuous almost everywhere and $\int_{\mathbb{R}^d}K(x)dx \neq 0$,
Then $\mathcal{S}$ is dense in $L^p(\mathbb{R})$ for every $p \in [1, \infty)$

In particular, it applies to the gaussian kernel introduced before.

Example

Feedforward neural networks (FNN)

Architecture

Architecture

Vocabulary

Depth : number of weight layers
Width : number of units per layer
Parameters : Weights and biases for every unit
Skip layer connections can bypass layers
one hidden transfer function $g$, one task-specific output transfer function $f$

Hidden transfer function

historically: hyperbolic tangent $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ or sigmoid $\sigma(x) = \frac{1}{1 + \exp(-x)}$
now mainly Recitifed Linear Units (ReLu)(Nair & Hinton, 2010),(Krizhevsky, Sutskever, & Hinton, 2012) or variants : \[ \mbox{relu}(x) = \max(x, 0) \]

ReLu are more favorable for the gradient flow than the saturating functions (more on that latter when discussing computational graphs and gradient computation).

Some other recent hidden transfer functions

Relu (Nair & Hinton, 2010)

\[\begin{equation*} \scriptstyle f(x) = \begin{cases} x & \mbox{ if } x \geq 0\\ 0 & \mbox { if } x < 0 \end{cases} \end{equation*}\]

Leaky Relu
(Maas, Hannun, & Ng, 2013)
Parametric ReLu
(He, Zhang, Ren, & Sun, 2015)

\[\begin{equation*} \scriptstyle f(x) = \begin{cases} x & \mbox{ if } x \geq 0\\ \alpha x & \mbox { if } x < 0 \end{cases} \end{equation*}\]

Exponential Linear Unit
(Clevert, Unterthiner, & Hochreiter, 2016)

\[\begin{equation*} \scriptstyle f(x) = \begin{cases} x & \mbox{ if } x \geq 0\\ \alpha (\exp(x) - 1) & \mbox { if } x < 0 \end{cases} \end{equation*}\]

Output transfer function

Exactly as when we discussed about the RBF, this is task dependent.

Regression

identity transfer function $y = w^T \phi(x)$
L2 loss
$L(y, \hat{y}) = \|\hat{y} - y\|^2$

Binary classification

sigmoidal transfer function $y = \sigma(w^T \phi(x))$
CE loss \[ \begin{array}{l} L(y, \hat{y}) =&-y \log(\hat{y})\\ &-(1-y) \log(1-\hat{y}) \end{array} \]

Multi classification

softmax transfer function (see Lecture 1)
CE loss (see Lecture 1)

Universal approximation

Any well behaved function can be arbitrarily approximated with a single layer FNN (Cybenko, 1989), (Hornik, 1991)

Intuition

Transform the input with a linear transform $y=w^Tx$
Take a sigmoid transfer function $z = f(y) = \frac{1}{1+e^{-y}}$ : this is the output of the hidden layer
combine multiple activities in the $z-$layer to build up gaussian like kernels

Substracting z- layer activities to produce RBF kernels — Substracting $z-$ layer activities to produce RBF kernels

weight such substractions and you are back to the RBF universal approximation theorem

At that point, you may wonder why we bother about deep learning, right ?

Why do we bother about deep learning ?

Single hidden layer FFN are universal approximators but the hidden layer can be arbitrarily large

a deep network (large number of layers) builds high level features by composing/factoring lower level features which can be reused by multiple units. Image analogy :
- first layer : extract oriented contours, texture filters, ..
- second layer : learn corners, crosses, curves, by combining contours
- next layers : build up more and more complex features
a shallow network must learn all the possibly complex filters at once, no real way to compose

early theoretical results on logic gates circuits (Hastad, 1986). More recent works on ReLU FFN (Montufar, Pascanu, Cho, & Bengio, 2014)

A logical circuit as of studied in (Hastad, 1986)

Space folding as discussed in (Montufar et al., 2014)

Training : error backpropagation

Training is performed by gradient descent which was popularized by (Rumelhart, Hinton, & Williams, 1986) who called it error backpropagation (but (Werbos, 1981) already introduced the idea, see (Schmidhuber, 2015)).

Gradient descent is an iterative algorithm :

initialize the weights and biases : $w_0$
at every iteration compute : \[ w \leftarrow w - \epsilon \nabla_w J \]

Remember : by minibatch gradient descent (see Lecture 1)

The question is : how do you compute $\frac{\partial J}{\partial w_i}$ ??

But let us first see pytorch in action.

Example on a regression problem

Overall steps :

Training

0- Imports
1- Loading the data
2- Define the network
3- Define the loss, optimizer, callbacks, …
4- Iterate and monitor

Testing

0- Imports
1- Loading the data
2- Define the network and load the trained parameters
3- Define the loss
4- Iterate

Example on a regression problem

0- Imports


import torch
import torch.nn as nn
import torch.optim as optim

import sklearn
import sklearn.datasets

import tqdm

import matplotlib.pyplot as plt

1- Loading the data


# Load the data and build up our dataloader
data = sklearn.datasets.fetch_california_housing()
# X is (20640, 8), y is (20640, )
X, y = data.data, data.target

# At least normalize the input for an easier optimization
mean, std = X.mean(axis=0), X.std(axis=0)
X = (X - mean)/std

X_train = torch.tensor(X).float()
y_train = torch.tensor(y).float()

# A mapable dataset defines __len__ and __getitem__ (see also iterable datasets)
# it can also be an iterable dataset
train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
# A dataloader will create the minibatches
train_dataloader = torch.utils.data.DataLoader(train_dataset,
                                               batch_size=64,
                                               shuffle=True,
                                               pin_memory=True)

Doc: Dataset, DataLoader. Pin memory
Iterating over train_dataloader gives a pair of tensors of shape $(64, 8)$ and $(64,)$.

Example on a regression problem

2- Define the network


if torch.cuda.is_available():
    device = torch.device('gpu')
else:
    device = torch.device('cpu')

# Build up the model
Nh = 64
model = nn.Sequential(
    nn.Linear(8, Nh), nn.ReLU(),
    nn.Linear(Nh, Nh), nn.ReLU(),
    nn.Linear(Nh, Nh), nn.ReLU(),
    nn.Linear(Nh, 1)
)

model.to(device)

Doc: Linear, Sequential

Example on a regression problem

3- Define the loss, optimizer, callbacks, …


# Define the loss
loss = nn.MSELoss()

# Define the gradient descent algorithm
optimizer = optim.Adam(model.parameters(),
                       lr=0.001)
scheduler = optim.lr_scheduler.StepLR(optimizer,
                                      step_size=10,
                                      gamma=0.5,
                                      verbose=True)

Doc: MSELoss, Adam, StepLR

Example on a regression problem

4- Iterate and monitor


for e in range(num_epochs):
    # Switch the network in train mode
    model.train()

    print(f"Epoch {e}")
    for X, y in tqdm.tqdm(train_dataloader):
        # Send the data to the GPU if necessary
        X, y = X.to(device), y.to(device)

        # Reset the gradient accumulator
        optimizer.zero_grad()

        # Forward pass
        y_pred = model(X).squeeze()
        loss_value = loss(y_pred, y)

        # Backward pass
        loss_value.backward()

        # Weight update
        optimizer.step()

    print(f"MSELoss on the training set : {mseloss(train_dataloader)}")

    # Update the learning rate after one epoch
    scheduler.step()

Evaluation


def mseloss(loader):
    # After every epoch, compute the risk
    # on the loader
    cum_loss = 0.0
    n_samples = 0
   
    # Switch the network in eval mode
    model.eval()

    with torch.no_grad():
        for X, y in tqdm.tqdm(loader):
            X, y = X.to(device), y.to(device)

            # Forward pass
            y_pred = model(X).squeeze()
            loss_value = loss(y_pred, y)

            # The loss is 'mean' reduced so be carefull when cumulating it
            cum_loss += loss_value * y_pred.size()[0]
            n_samples += y_pred.size()[0]
    return cum_loss/n_samples

Computational graph and differential programming

Computational graph

A computational graph is a directed acyclic graph where nodes are :

variables (weights, inputs, outputs, targets, …)
operations (ReLu, Softmax, $w^Tx + b$, losses, updates, ..)

Example graph for a linear regression $\mathbb{R}^8 \mapsto \mathbb{R}$ with minibatch $(X, y)$

\[ J = \frac{1}{M} \sum_{i=0}^{63} (w_1^T x_i + b_1 - y_i)^2 \]

Computational graph

Problem computing the partial derivatives with respect to the variables $\frac{\partial J}{\partial var}$.

You just need to provide the local derivatives of the output w.r.t the inputs.

And then apply the chain rule.

ex : $\frac{\partial J}{\partial w_1} \in \mathcal{M}_{1, 8}(\mathbb{R})$, assuming numerator layout

Computational graph : the chain rule

Numerator layout convention (otherwise, we transpose and reverse the jacobian product order):

The derivative of a scalar with respect to a vector is a row vector : \[ y \in \mathbb{R}, x \in \mathbb{R}^n, \frac{dy}{dx} \in \mathcal{M}_{1, n}(\mathbb{R}) \]

More generally, the derivative of a vector valued function $y : \mathbb{R}^{n_x} \mapsto \mathbb{R}^{n_y}$ with respect to its input (the Jacobian) is a $n_y \times n_x$ matrix :

\[ x \in \mathbb{R}^{n_x}, y(x) \in \mathbb{R}^{n_y}, \frac{dy}{dx}(x) = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_{1}}{\partial x_{n_x}} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_{2}}{\partial x_{n_x}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_{n_y}}{\partial x_1} & \frac{\partial y_{n_y}}{\partial x_2} & \cdots & \frac{\partial y_{n_y}}{\partial x_{n_x}} \end{bmatrix}(x) \]

Computational graph : the chain rule

For a (single-path) chain $y_1 \rightarrow y_2 = f_1(y_1) \rightarrow y_3 = f_2(y_2) \cdots y_n = f_{n-1}(y_{n-1})$, of vector valued functions $y_1 \in \mathbb{R}^{n_1}, y_2\in\mathbb{R}^{n_2}, \cdots y_n \in \mathbb{R}^{n_n}$,

\[ \frac{\partial y_n}{\partial y_1} = \frac{\partial y_n}{\partial y_{n-1}}\frac{\partial y_{n-1}}{\partial y_{n-2}}\cdots\frac{\partial y_2}{\partial y_1} \]

ex : $\frac{\partial J}{\partial w_1} = \frac{\partial J}{\partial y_1} \frac{\partial y_1}{\partial w_1} = \frac{2}{M} (y_1-y)^T X \in \mathbb{R}^8$, assuming numerator layout

For matrix variables, we should be introducing tensors. See also this and this

The chain rule : multiple paths

For multiple paths, in principle we sum over all the paths : \[ \frac{\partial y}{\partial x} = \sum_{j=3,4} \frac{\partial y}{\partial y_i}\frac{\partial y_i}{\partial x} = y_4 \frac{\partial y_3}{\partial x} + y_3 \frac{\partial y_4}{\partial x} = y_4 f'(y_1) w_1^T + y_3 f'(y_2) w_2^T \in \mathbb{R}^8 \]

The chain rule : multiples paths

But this can be computationally (too) expensive :

there can be many paths you need to identify and sum over : L layers, N units, $N^L$ paths
and you must repeat the process for every variable w.r.t. which you want to differentiate
some computations can be factored (e.g. $\frac{\partial y_2}{\partial x}$, $\frac{\partial y_1}{\partial x}$)

Automatic differentiation : forward mode

Let us be more efficient : forward mode differentiation

Idea: To compute $\frac{\partial y}{\partial x}$, forward propagate $\frac{\partial }{\partial x}$
e.g. $\frac{\partial y}{\partial x} = y_3 e^{y_1} \left[ w_2^T + y_2 w_1^T\right] + y_4 e^{y_2}\left[ w_1^T + y_1 w_2^T\right]$

Welcome to the field of automatic differentiation (AD). For more, see (Griewank, 2012), (Griewank & Walther, 2008) (see also (Olah, 2015), (Paszke et al., 2017))

Automatic differentiation : reverse mode

Let us be (sometimes) even more efficient : reverse mode differentiation

Idea: To compute $\frac{\partial y}{\partial x}$, backward propagate $\frac{\partial y}{\partial }$ (compute the adjoint)
e.g. $\frac{\partial y}{\partial x} = (y_4y_1e^{y_2} + y_3 e^{y_1})w_2^T + (y_3y_2e^{y_1} + y_4 e^{y_2})w_1^T$

Oh ! We also got $\frac{\partial y}{\partial w_2} = \frac{\partial y}{\partial y_2}\frac{\partial y_2}{\partial w_2}$, $\frac{\partial y}{\partial b_2} = \frac{\partial y}{\partial y_2}\frac{\partial y_2}{\partial b_2}$, …

This is more efficient than forward mode when we have much more inputs ($n$) than outputs ($m$) for $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$, computing $\frac{df}{dx}(x)$

A Review of Automatic Differentiationand its Efficient Implementation

Gradient error backpropagation

In (Rumelhart et al., 1986), the algorithm was called “error backpropagation” : why ?

Suppose a 2-layer multi-layer feedforward network and propagating one sample, with a scalar loss : \[ L = g( y_i, \begin{bmatrix} & & \\ & W_2 (n_2 \times n_1) & \\ & & \end{bmatrix} f( \begin{bmatrix} & & \\ & W_1 (n_1 \times n_x) & \\ & & \end{bmatrix} \begin{bmatrix} \\ x_i \\ \phantom{} \end{bmatrix} )) \in \mathbb{R} \]

$g$ could be a squared loss for regression (with $n_2=1$), or CrossEntropyLoss (with logits and $n_2=n_{class}$) for multiclass classification.

We denote $z_1 = W_1 x_i, z_2 = W_2 f(z_1)$ and $\delta_i = \frac{\partial L}{\partial z_i} \in \mathbb{R}^{n_i}$. Then : \[ \begin{align} \delta_2 &= \frac{\partial L}{\partial z_2} = \frac{\partial g(x_1, x_2)}{\partial x_2}(y_i, z_2) \\ \delta_1 &= \frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial z_2} \frac{\partial z_2}{\partial z_1} = \begin{bmatrix} & & \delta_2 & & \end{bmatrix} \begin{bmatrix} & & \\ & W_2(n_2 \times n_1) & \\ & & \phantom{}\end{bmatrix} \text{diag}(f'(z_1)) \end{align} \] The errors of $\delta_2$ are integrated back through the weight matrix that was used for the forward pass. (See also (Nielsen, 2015), chap 2).

Gradient descent in deep learning

Training in two phase

Evaluation of the outputs : forward propagation / forward pass
Evaluation of the gradient : reverse-mode differentiation / backward pass

The reverse-mode differentiation uses the variables computed in the forward pass

$\rightarrow$ we can apply efficiently stochastic gradient descent to optimize the parameters of our neural networks !

Note the computational graph can be extended to encompass the operations of the backward pass.

Gradient descent in practice

The deep learning frameworks all compute the backward pass automatically.

optimizer = optim.Adam(model.parameters())

for e in range(epochs):
    for X,y in train_dataloader:
        optimizer.zero_grad()
        ...
        loss.backward()
        optimizer.step()

The computational graphs can be built dynamically (eager mode) or static

If you want/need to extend the frameworks with new operations, e.g. extending pytorch autograd


import torch.autograd.Function as Function

class MyFunction(Function):

    @staticmethod
    def forward(ctx, input, ..): 
        ...

    @staticmethod
    def backward(ctx, grad_output):
        ...

The way toward differentiable programming

The computational graph is a central notion in modern neural networks/deep learning. Broaden the scope with differential programming.

In the recent years, “fancier” differentiable blocks others than $f(W f(W..))$, and that are dynamically built (eager mode vs static graph).

Spatial Transformer Networks
(Jaderberg, Simonyan, & Zisserman, 2015)

Content/Location based addressing
Neural Turing Machine / Differential Neural computer (Graves et al., 2016)

Gradient descent algorithms

Does it make sense to use gradient descent ?

Indeed :

we cannot do better than a local minima
neural networks lead to non convex optimization. For example, consider a 2-layer FFN :

\[ \begin{align*} & \begin{bmatrix} \\ x \\ \phantom{} \end{bmatrix}\\ \begin{bmatrix} & & \\ & W_1 & \\ & & \phantom{} \end{bmatrix} & \begin{bmatrix} \\ y_1 \\ \phantom{} \end{bmatrix} \end{align*} \]

\[ \begin{align*} & \begin{bmatrix} \\ f(y_1) \\ \phantom{} \end{bmatrix}\\ \begin{bmatrix} & & \\ & W_2 & \\ & & \phantom{} \end{bmatrix} & \begin{bmatrix} \\ y_2 \\ \phantom{} \end{bmatrix} \end{align*} \]

But empirically, most local minima are close (in performance) to the global minimum, especially with large/deep networks. See (Dauphin et al., 2014), (Pascanu, Dauphin, Ganguli, & Bengio, 2014), (Choromanska, Henaff, Mathieu, Arous, & LeCun, 2015). Saddle points seem to be more critical.

First order methods : Minibatch stochastic gradient descent

Algorithm

Start at $\theta_0$
for every minibatch : \[ \begin{align*} \theta(t+1) &= \theta(t) - \epsilon \nabla_\theta L(\theta(t))\\ L(\theta) &= \frac{1}{M} \sum_i J(\theta, x_i, y_i) \end{align*} \]

Rationale (Taylor expansion) : $L(\theta_{t+1}) \approx L(\theta_{t}) + (\theta_{t+1} - \theta_{t})^T \nabla_{\theta} L(\theta_{t})$

The choice of the batch size :

Stochastic gradient descent (small minibatch, $M=1$) : noisy estimate, not GPU friendly
Batch Gradient descent ($M=N$) : More GPU friendly. But more prone to bad generalization (generalization gap) and to local minima (Keskar, Mudigere, Nocedal, Smelyanskiy, & Tang, 2017). Roughly speaking, we should avoid sharp minima.

The optimization may converge slowly or even diverge if the learning rate $\epsilon$ is not appropriate.

Choosing a learning rate

The impact of the learning rate on the optimization (LeCun, Bottou, Orr, & Müller, 1998)

Bengio: “The optimal learning rate is usually close to the largest learning rate that does not cause divergence of the training criterion” (Bengio, 2012)

Karpathy “$0.0003$ is the best learning rate for Adam, hands down.” (Twitter, 2016)

(Note: Adam will be discussed in few slides)

See also :
- Practical Recommendations for gradient-based training of deep architectures (Bengio, 2012)
- Efficient Backprop (LeCun et al., 1998)

Example regression problem

Setup

$N=30$ samples generated with : \[ y = 3 x + 2 + \mathcal{U}(-0.1, 0.1) \]
Model : $f_\theta(x) = \theta^T\begin{bmatrix} 1 \\ x\end{bmatrix}$,
L2 loss : $L(y_i, f_{\theta}(x_i)) = (y_i - f_{\theta}(x_i))^2$

Example using SGD

Parameters : $\epsilon=0.005$, $\theta_0 = \begin{bmatrix} 10 \\ 5\end{bmatrix}$

Converges to $\theta_{\infty} = \begin{bmatrix} 1.9882 \\ 2.9975\end{bmatrix}$

The components of \nabla_\theta J — The components of $\nabla_\theta J$

First order methods : momentum

Algorithm : Let us damp the oscillations with a low pass on $\nabla_{\theta}$

Start at $\theta_0$, $v_0 = 0$
for every minibatch : \[ \begin{align*} v(t+1) &= \mu v(t) - \epsilon \nabla_{\theta} J(\theta(t))\\ \theta(t+1) &= \theta(t) + v(t+1) \end{align*} \]

Usually $\mu \approx 0.9$ or $0.99$.

as an exponential moving average, it low pass filters and therefore dampen oscillations along fast varying dimensions
it can accelerate (increase the learning rate) in constant directions (or low curvature).
If $\nabla_{\theta} J = g$, $v(0) = 0$ \[ v(t) = -\epsilon g \sum_{i=0}^{t-1} \mu^i = -\epsilon g \frac{1-\mu^{t}}{1-\mu} \]

See also distill.pub. Note the frameworks may implement subtle variations.

Example using SGD with momentum

Parameters : $\epsilon=0.005$, $\mu=0.6$, $\theta_0 = \begin{bmatrix} 10 \\ 5\end{bmatrix}$

Converges to $\theta_{\infty} = \begin{bmatrix} 1.9837 \\ 2.9933\end{bmatrix}$

First order methods : Nesterov momentum

Idea Look ahead to potentially correct the update. Based on Nesterov Accelerated Gradient. Formulation of (Sutskever, Martens, Dahl, & Hinton, 2013)

Algorithm

Start at $\theta_0$
for every minibatch : \[ \begin{align*} \overline{\theta}(t) &= \theta(t) + \mu v(t)\\ v(t+1) &= \mu v(t) - \epsilon \nabla_{\theta}J(\overline{\theta}(t))\\ \theta(t+1) &= \theta(t) + v(t+1) \end{align*} \]

Example using SGD with Nesterov momentum

Parameters : $\epsilon=0.005$, $\mu=0.8$, $\theta_0 = \begin{bmatrix} 10 \\ 5\end{bmatrix}$

Converges to $\theta_{\infty} = \begin{bmatrix} 1.9738 \\ 2.9914\end{bmatrix}$

Comparison of the 1st order methods

On our simple regression problem

First order : adaptive learning rate

You should always adapt your learning rate with a learning rate scheduler

Linear decrease from $\epsilon_0$ downto $\epsilon_f$
halve the learning rate when the validation error stops improving
halve the learning rate on a fixed schedule (every $50th$ epochs)

Resnet training curves. “The learning rate starts from 0.1 and is divided by 10 when the error plateaus”

Adaptive first order : adaptive learning rate

Some more recent approaches are changing the picture of “decreasing learning rate” (“Robbins Monro conditions”)

See (Smith, 2018), The 1cycle policy - S. Gugger

Stochastic Gradient Descent with Warm Restart (Loshchilov & Hutter, 2017)

The improved performances may be linked to reaching flatter minimums (i.e. with predictions less sensitive than sharper minimums). The models reached before the warm restarts can be averaged (see Snapshot ensemble).

It seems also that initial large learning rates tend to lead to better models on the long run (Li, Wei, & Ma, 2019)

Adaptive first order : Adagrad

Adagrad Adaptive Gradient (Duchi, Hazan, & Singer, 2011)

Accumulate the square of the gradient \[ r(t+1) = r(t) + \nabla_{\theta}J(\theta(t)) \odot \nabla_{\theta}J(\theta(t))\\ \]
Scale individually the learning rates \[ \theta(t+1) = \theta(t) - \frac{\epsilon}{\delta + \sqrt{r(t+1)}} \odot \nabla_{\theta}J(\theta(t)) \]

The $\sqrt{.}$ is experimentally critical ; $\delta \approx [1e-8, 1e-4]$ for numerical stability.

Small gradients $\rightarrow$ bigger learning rate for moving fast along flat directions
Big gradients $\rightarrow$ smaller learning rate to calm down on high curvature.

But accumulation from the beginning is too aggressive. Learning rates decrease too fast.

Adaptive first order : RMSprop

RMSprop Hinton(unpublished, Coursera)

Idea: we should be using an exponential moving average when accumulating the gradient.

Accumulate the square of the gradient \[ r(t+1) = \rho r(t) + (1-\rho)\nabla_{\theta}J(\theta(t)) \odot \nabla_{\theta}J(\theta(t))\\ \]
Scale individually the learning rates \[ \theta(t+1) = \theta(t) - \frac{\epsilon}{\delta + \sqrt{r(t+1)}} \odot \nabla_{\theta}J(\theta(t)) \] $\rho \approx 0.9$

Adaptive first order : ADAM

Adaptive Moments (ADAM) (Kingma & Ba, 2015)

Like momentum and RMSprop, store running averages of past gradients : \[ \begin{align*} m(t+1) &= \beta_1 m(t) + (1-\beta_1)\nabla_{\theta}J(\theta(t)\\ v(t+1) &= \beta_2 v(t) + (1-\beta_2)\nabla_{\theta}J(\theta(t)\odot \nabla_{\theta}J(\theta(t) \end{align*} \] $m(t)$ and $v(t)$ are the first moment and second (uncentered) moments of $\nabla_{\theta} J$. They are bias corrected $\hat{m}(t)$, $\hat{v}(t)$ and then :

\[ \theta(t+1) = \theta(t) - \frac{\epsilon}{\delta + \sqrt{\hat{v}(t+1)}} \hat{m}(t+1) \]

and some others : Adadelta (Zeiler, 2012), … , YellowFin (Zhang & Mitliagkas, 2018).

See Sebastian ruder blog post, or John Chen blog post

First order : to sum up

(Goodfellow, Bengio, & Courville, 2016) There is currently no consensus[…] no single best algorithm has emerged[…]the most popular and actively in use include SGD,SGD with momentum, RMSprop, RMSprop with momentum, Adadelta and Adam.

See also Chap. 8 of (Goodfellow et al., 2016)

A glimpse into second order methods

Rationale : \[ J(\theta) \approx J(\theta_0) + (\theta - \theta_0)^T \nabla_{\theta}J(\theta_0) + \frac{1}{2}(\theta - \theta_0)^T \nabla^2_{\theta} J(\theta_0) (\theta - \theta_0) \] with $H = \nabla^2J$ the Hessian matrix, a $n_\theta \times n_\theta$ matrix hosting the second derivatives of $J$.

The second derivates are much more noisy than the first derivative (gradient), a larger batch size is usually required to prevent instabilities.

Conjugate gradient : using line search (or hessian) along $\nabla_{\theta}J(\theta_k)$
Newton : never use except if you want to find critical points (Dauphin et al., 2014). Solves above for $\theta$ and find $\nabla_\theta^2J(\theta_0) . (\theta - \theta_0) = -\nabla_\theta J(\theta_0)$
Quasi Newton : BFGS (approximating $H^{-1}$), L-BFGS, and saddle-free versions (Dauphin et al., 2014).

Initialization and the distributions of activations and gradients

The starting point is important : XOR

XOR is easy right ?

Model : 2-4-1, Sigmoid activations (great!); 17 parameters
Init : $\mathcal{U}(−10, 10)$, bias=0 (hum hum)
Loss : Binary cross entropy (great!)
Optimizer : SGD ( = 0.1, momentum=0.99 )

But it fails miserably (6/20 fails). Tmax=1000

BCE Loss and accuracy on the training set

The starting point is important : XOR

XOR is easy right ?

Model : 2-4-1, Sigmoid activations (great!); 17 parameters

Init : $\mathcal{N}(0, \frac{1}{\sqrt{fan_{in}}})$, bias=0 (great!)

Loss : Binary cross entropy (great!)
Optimizer : SGD ( = 0.1, momentum=0.99 )

Now it is better (0/20 fails). Tmax=1000

Pretraining

Historically, training deep FNN was known to be hard, i.e. bad generalization errors.

The starting point of a gradient descent has a dramatic impact :

neural history compressors (Schmidhuber, 1992)
competitive learning (Maclin & Shavlik, 1995)
unsupervised pretraining based on Boltzman machines (Hinton, 2006)
unsupervised pretraining based on auto-encoders (Bengio, Lamblin, Popovici, & Larochelle, 2006)

Pretraining is no more used (because of xxRelu, Initialization schemes, ..)

Standardizing your inputs

Gradient descent converges faster if your data are normalized and decorrelated. Denote by $x_i \in \mathbb{R}^d$ your input data, $\hat{x}_i$ its normalized.

Min-max scaling \[ \forall i,j \hat{x}_{i,j} = \frac{x_{i,j} - \min_k x_{k,j}}{\max_k x_{k,j} - \min_k x_{k,j} + \epsilon} \]
Z-score normalization (goal: $\hat{\mu}_j = 0, \hat{\sigma}_j = 1$) \[ \forall i,j, \hat{x}_{i,j} = \frac{x_{i,j} - \mu_j}{\sigma_j + \epsilon} \]
ZCA whitening (goal: $\hat{\mu}_j = 0, \hat{\sigma}_j = 1$, $\frac{1}{n-1} \hat{X} \hat{X}^T = I$)

\[ \hat{X} = W X, W = \frac{1}{\sqrt{n-1}} (XX^T)^{-1/2} \]

Z-score normalization / Standardizing the inputs

Remember our linear regression : $y = 3x+2+\mathcal{U}(-0.1, 0.1)$, L2 loss, 30 1D samples

General strategy

A good initialization should break the symmetry : constant initialization schemes make units learning all the same thing
A good initialization should start optimization in a region of low capacity : linear neural network

A good initialization scheme should preserve the distribution of the activations and gradients : exploding/vanishing gradients

The exploding and vanishing gradient problem

The Fundamental Deep Learning Problem first observed by (Josef Hochreiter, 1991) for RNN, the gradient can either vanish or explode, especially in deep networks (RNN are very deep).

Remember that the backpropagated gradient involves : \[ \frac{\partial J}{\partial x_l} = \frac{\partial J}{\partial y_L} W_L f'(y_l) W_{L-1} f'(y_{l-1}) \cdots \] with $y_l = W_l x_l + b, x_l = f(y_{l-1})$.
We see a pattern like $(W.f')^L$ which can diverge or vanish for large $L$.
especially, with the sigmoid :$f' < 1$.

With a ReLu, the positive part has $f' = 1$.

Preventing vanishing/exploding gradient

We must ensure a good flow of gradient :
- using appropriate transfer functions ReLu, PreLu, etc..
- using architectural elements :
  - ResNet (CNN) : shortcurt connections
  - LSTM (RNN): constant error caroussel
We can prevent exploding gradient by clipping (Pascanu, Mikolov, & Bengio, 2013)

Exploding gradient and the effect of clipping. Experiment with 50 layers, single unit, sigmoid transfer function

LeCun Initialization

In (LeCun et al., 1998), Y. LeCun provided some guidelines on the design:

Aim Initialize the weights/biases to keep $f$ in its linear part through multiple layers:

Use a symmetric transfer function $f(x) = 1.7159 \tanh(\frac{2}{3}x)$, $\rightarrow$ $f(1) = 1$, $f(-1) = -1$

set the biases to $0$
initialize randomly and independently from $\mathcal{N}(\mu=0, \sigma^2=\frac{1}{fan_{in}})$.

If $x \in \mathbb{R}^n$ is $\mathcal{N}(0, \Sigma = I)$, $w \in \mathbb{R}^n$ is $\mathcal{N}(μ=0, \Sigma=\frac{1}{n}I)$, then :

\[ \begin{align*} E[w^T x + b] &= E[w^T x] = \sum_i E[w_i x_i] = \sum_i E[w_i] E[x_i] = 0\\ var[w^T x + b] &= var[w^T x] \\ & = \sum_i \sigma^2_{w_i}\sigma^2_{x_i} + \sigma^2_{w_i}\mu^2_{x_i} + \mu^2_{w_i}\sigma^2_{x_i}\\ &= \sum_i \sigma^2_{w_i}\sigma^2_{x_i} = \frac{1}{n}\sum \sigma^2_{x_i} = 1 \end{align*} \]

$x_i, w_i$ are all pairwise independent.

Xavier (Glorot) Initialization

Idea we must preserve the same distribution along the forward and backward pass (Glorot & Bengio, 2010).

This prevents:

the saturation of saturating transfer functions (e.g. tanh, sigmoid)
vanishing/exploding gradient

Glorot (Xavier) initialization scheme for a feedforward network $f(W_{n}..f(W^1f(W_0 x+b_0)+b_1)...+b_n)$ with layer sizes $n_i$:

the input dimensions should be centered, normalized, uncorrelated
symmetric activation function, with $f'(0) = 1$ (e.g. $f(x)=\tanh(x), f(x)=4(\frac{1}{1+e^{-x}}-0.5)$)

Assuming the linear regime $f'() = 1$ of the network : \[ \begin{align*} \mbox{Forward propagation variance constraint :} \forall i, fan_{in_i} \sigma^2_{W_i} &= 1\\ \mbox{Backward propagation variance constraint :} \forall i, fan_{out_i} \sigma^2_{W_i} &= 1 \end{align*} \] Compromise : $\forall i, \frac{1}{\sigma^2_{W_i}} = \frac{fan_{in} + fan_{out}}{2}$
- Glorot (Xavier) uniform : $\mathcal{U}(-\frac{\sqrt{6}}{\sqrt{fan_{in}+fan_{out}}}, \frac{\sqrt{6}}{\sqrt{fan_{in}+fan_{out}}})$, b=0
- Glorot (Xavier) normal : $\mathcal{N}(0, \frac{\sqrt{2}}{\sqrt{fan_{in}+fan_{out}}})$, b=0

He Initialization

Idea we must preserve the same distribution along the forward and backward pass for rectifiers (He et al., 2015). For a feedforward network $f(W_{n}..f(W^1f(W_0 x+b_0)+b_1)...+b_n)$ with layer sizes $n_i$:

the input dimensions should be centered, normalized, uncorrelated
ReLu activation $f(x) = \max(x, 0)$
weights initialized with symmetric distribution, zero mean $\mu_{w_l} = 0$, independently. Bias set to $b=0$
the components of $x_l$ are assumed i.i.d.; Note they are not centered (because of ReLu) \[ \begin{align*} \mathbf{y}_l &= \begin{bmatrix} \vdots \\ y_l \\ \vdots\end{bmatrix} = W_l \mathbf{x}_l + \mathbf{b} = W_l f(\mathbf{y}_{l-1}) + \mathbf{b}\\ \mu_{y_l} &= E[\sum_i w_{l,i}x_{l,i}] = \mu_{w_l}\sum_i \mu_{x_{l,i}} = 0\\ \sigma^2_{y_l} &= n_l\sigma^2_{w_l x_l} = n_l (\mu_{w_l^2x_l^2}-\mu_{w_lx_l}^2) = n_l \mu_{w_l^2} \mu_{x_l^2} = n_l \sigma^2_{w_l} \mu_{x_l^2} (\mbox{because $\mu_{w_l} = 0$})\\ \mu_{x_l^2} &= \int_{y_{l-1}} \max(0, y_{l-1})^2dp_{y_{l-1}} = \frac{1}{2} \mu_{y_{l-1}^2} =\frac{1}{2} \sigma^2_{y_{l-1}} (\mbox{$\mu_{y_{l-1}}=0$ and $y_{l-1}$ has symmetric distrib.}) \end{align*} \]

So, $\sigma^2_{y_l} = \frac{1}{2}n_l \sigma^2_{w_l} \sigma^2_{y_{l-1}}$. To preserve the variance, we must guarantee $\frac{1}{2} n_l \sigma^2_{w_l} = 1$.

We used : if $X$ and $Y$ are independent : $\sigma^2_{X.Y} = \mu_{X^2}\mu_{Y^2} - \mu_X^2 \mu_Y^2$

He initialization

the input dimensions should be centered, normalized, uncorrelated
ReLu activation $f(x) = \max(x, 0)$
weights initialized with symmetric distribution, zero mean $\mu_{w_l} = 0$, independently. Bias set to $b=0$
the components of $x_l$ are assumed i.i.d.; Note they are not centered (because of ReLu)

\[ \begin{align*} \mbox{Forward propagation variance constraint :} \forall i, \frac{1}{2}fan_{in_i} \sigma^2_{W_i} &= 1\\ \mbox{Backward propagation variance constraint :} \forall i, \frac{1}{2}fan_{out_i} \sigma^2_{W_i} &= 1 \end{align*} \]

He suggests to use either one or the other, e.g. $\sigma^2_{W_i} = \frac{2}{fan_{in}}$
- He uniform : $\mathcal{U}(-\frac{\sqrt{6}}{\sqrt{fan_{in}}}, \frac{\sqrt{6}}{\sqrt{fan_{in}}})$, b=0
- He normal : $\mathcal{N}(0, \frac{\sqrt{2}}{\sqrt{fan_{in}}})$, b=0

Note: for PreLu : $\frac{1}{2} (1 + a^2) fan_{in} \sigma^2_{W_i} = 1$

Weight initialization in practice (PyTorch)

By default, the parameters are initialized randomly. e.g. in torch.nn.Linear :

class Linear(torch.nn.Module):
    def __init__(self):
        ...
        self.reset_parameters()
    
    def reset_parameters(self) -> None:
        # Setting a=sqrt(5) in kaiming_uniform is the same as initializing with
        # uniform(-1/sqrt(in_features), 1/sqrt(in_features)). For details, see
        # https://github.com/pytorch/pytorch/issues/57109
        torch.nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
        fan_in, _ = torch.init.calculate_fan_in_and_fan_out(self.weight)
        bound = 1/math.sqrt(fan_in)
        init.uniform_(self.bias, -bound, bound)

Oh, but that’s not what we should use for ReLu ?!?! Indeed you are right, see this issue. This is to avoid breaking with the way torch(lua) was initializing.


import torch.nn.init as init

class MyModel(torch.nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.classifier =  nn.Sequential(
            *linear_relu(input_size, 256),
            *linear_relu(256, 256),
            nn.Linear(256, num_classes)
        )
        self.init()

    def init(self):
        @torch.no_grad()
        def finit(m):
            if type(m) == nn.Linear:
                init.kaiming_uniform_(m.weight,
                                      a=0,
                                      mode='fan_in',
                                      nonlinearity='relu')
                m.bias.fill_(0.0)
        self.apply(finit)

def linear_relu(dim_in, dim_out):
    return [nn.Linear(dim_in, dim_out),
            nn.ReLU(inplace=True)]

Internal covariate shift

(Ioffe & Szegedy, 2015) observed the change in distribution of network activations due to the change in network parameters during training.

Experiment 3 fully connected layers (100 units), sigmoid, softmax output, MNIST dataset

left) Test accuracy, right)Distribution of the activations of the last hidden layer during training, {15, 50, 85}th percentile

Batch Normalization

Idea standardize the activations of every layers to keep the same distributions during training (Ioffe & Szegedy, 2015)

The gradient must be aware of this normalization, otherwise may get parameter explosion (see (Ioffe & Szegedy, 2015)) $\rightarrow$ we need a differentiable normalization layer
introduces a differentiable Batch Normalization layer : \[ z = g(W x + b) \rightarrow z = g(BN(W x)) \]

BN operates element-wise : \[ \begin{align*} y_i &= BN_{\gamma,\beta} (x_i) = \gamma \hat{x}_i + \beta\\ \hat{x}_i &= \frac{x_i - \mu_{\mathcal{B}, i} }{\sqrt{\sigma^2_{\mathcal{B}, i} + \epsilon}} \end{align*} \] with $\mu_{\mathcal{B},i}$ and $\sigma_{\mathcal{B},i}$ statistics computed on the mini batch during training.

Learning faster, with better generalization.

Batch normalization

During training

put BN layers everywhere along the network, after the linear/conv layer, before the ReLus
evaluate the statistics $\mu, \sigma$ over the minibatches (over $B$ for $(B, N)$ and over $B\times H \times W$ for $(B, C, H, W)$)
update an exponential moving average of the mean $\mu_{\mathcal{B}}$ and variance $\sigma^2_{\mathcal{B}}$

During inference (test) :

use the running average as the statistics to standardize : this is now just a fixed affine transform.

Do not forget to switch to test mode :


model = MyModel()  # a pytorch nn.Module
# For training
model.train()
# For testing
model.test()

Some recent works challenge the idea of covariate shift (Santurkar, Tsipras, Ilyas, & Ma, 2018), (Bjorck, Gomes, Selman, & Weinberger, 2018). The loss seems smoother allowing larger learning rates, better generalization, robustness to hyperparameters.

Regularization

L2 penalty

Add a L2 penalty on the weights, $\alpha > 0$

\[ \begin{align*} J(\theta) &= L(\theta) + \frac{\alpha}{2} \|\theta\|^2_2 = L(\theta) + \frac{\alpha}{2}\theta^T \theta\\ \nabla_\theta J &= \nabla_\theta L + \alpha \theta\\ \theta &\leftarrow \theta - \epsilon \nabla_\theta J = (1 - \alpha \epsilon) \theta - \epsilon \nabla_\theta L \end{align*} \] Called L2 regularization, Tikhonov regularization, weight decay

Example RBF, 1 kernel per sample, $N=30$, noisy inputs,

$\theta^\star$

See chap 7 of (Goodfellow et al., 2016) for a geometrical interpretation

Intuition : for linear layers, the gradient of the function equals the weights. Small weights $\rightarrow$ small gradient $\rightarrow$ smooth function.

L2 penalty

In theory, regularizing the bias will cause underfitting

Example

\[ \begin{align*} J(w, b) &= \frac{1}{N} \sum_{i=1}^N \| y_i - b - w^T x_i\|_2^2\\ \nabla_b J(w,b) &\implies b = (\frac{1}{N} \sum_i y_i) - w^T (\frac{1}{N} \sum_i x_i) \end{align*} \]

If your data are centered (as they should), the optimal bias is the mean of the targets.

L1 penalty

Add a L1 penalty to the weights : \[ \begin{align*} J(\theta) &= L(\theta) + \alpha \|\theta\|_1 = L(\theta) + \alpha \sum_i |\theta_i|\\ \nabla_\theta J &= \nabla_\theta L + \alpha \mbox{sign}(\theta) \end{align*} \]

Example RBF, 1 kernel per sample, $N=30$, noisy inputs,

$\theta^\star$

See chap 7 of (Goodfellow et al., 2016) for a mathematical explanation in a specific case. Sparsity used for feature selection with LASSO (filter/wrapper/embedded).

Dropout

Introduced in (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014):

Idea 1 : preventing co-adaptation. A pattern is robust by itself not because of others doing part of the job.
Idea 2 : average of all the sub-networks (ensemble learning)

How :

for every minibatch, zeroes hidden and input activations with probability $p$ ($p=0.5$ for hidden, $p=0.2$ for input). At test time, multiply every activations by $p$
“Inverted” dropout : multiply the kept activations by $p$ at train time. At test time, just do a normal forward pass.

Dropout

Usually, after all fully connected layers (p=0.5) and input layer
less usual on convolutional layers (because these are already regularized)

Can be interpreted as if training/averaging all the possible subnetworks.

L1/L2/Dropout in pytorch

L1/L2

class MyModel(nn.Module):
    def __init__(..., l2_reg, ..):
        self.lin1 = nn.Linear(784, 256)
        self.lin2 = nn.Linear(256, 256)
        self.l2_reg = l2_reg

    def penalty(self):
        return l2_reg * (self.lin1.weight.norm(2) + ...)

def train():
    ...
    optimizer.zero_grad()
    loss.backward()
    model.penalty().backward()

Dropout


import torch.nn as nn

class MyModel(nn.Module):
    def __init__(self, ..):
        self.classifier = nn.Sequential(
            *dropout_linear_relu(784, 128, 0.5),
            *dropout_linear_relu(128, 256, 0.5),
            nn.Linear(256, num_classes)
        )

def dropout_linear_relu(dim_in, dim_out, p_zeroed):
    return [nn.Dropout(p_zeroed),
            nn.Linear(dim_in, dim_out),
            nn.ReLU(inplace=True)]

Early stopping

Split your data in three sets :

training set : for training ..
validation set: for choosing the hyperparameters (learning rates, number of layers, layer size, momentum, …)
test set : for estimation the generalization error

Everything can be placed in a cross validation loop.

Early stopping is about keeping the model with the lowest validation loss.


for e in range(nepochs):
    # Training over an epoch
    for X, y in tqdm.tqdm(train_dataloader):
        X, y = X.to(device), y.to(device)
        ...
        optimizer.step()

    # Model checkpoint
    val_loss = test(model, valid_loader)
    if val_loss < best_val_loss:
        torch.save(model.state_dict(), filepath)
        best_val_loss = val_loss

Data, data, we need data !

The best regularizer you may find is data. The more you have, the better you learn.

you can use pretrained models on some tasks as an initialization for learning your task (but may fail due to domain shift) : check the Pytorch Hub, timm, Hugging face hub
you can use unlabeled data for pretraining your networks (as done in 2006s) with auto-encoders / RBM : unsupervised/semi-supervised learning. Note also the recent works on self supervision (Balestriero et al., 2023)
you can apply random transformations to your data : dataset augmentation, see for example albumentations.ai

Label smoothing

Introduced in (Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2015) in the context of Convolutional Neural Networks.

Idea : Preventing the network to be over confident on its predictions on the training set.

Recipe : in a $k$-class problem, instead of using hard targets $\in \{0, 1\}$, use soft targets $\in \{\frac{\alpha}{k}, 1-\alpha\frac{k-1}{k}\}$ (weighted average between the hard targets and uniform target). $\alpha \approx 0.1$.

See also (Müller, Kornblith, & Hinton, 2020) for several experiments.

See also Mixup regularization (Zhang, Cisse, Dauphin, & Lopez-Paz, 2017).

References

Bibliography

Rather check the full online document references.pdf

Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms.

Balestriero, R., Ibrahim, M., Sobal, V., Morcos, A., Shekhar, S., Goldstein, T., … Goldblum, M. (2023). A cookbook of self-supervised learning. Retrieved from http://arxiv.org/abs/2304.12210

Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures. arXiv:1206.5533 [Cs]. Retrieved from http://arxiv.org/abs/1206.5533

Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy Layer-Wise Training of Deep Networks. In (p. 8).

Bjorck, N., Gomes, C. P., Selman, B., & Weinberger, K. Q. (2018). Understanding Batch Normalization. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018) (p. 12).

Broomhead, D., & Lowe, D. (1988). Multivariable Functional Interpolation and Adaptive Networks. Complex Systems, 2, 321–355.

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2015). The Loss Surfaces of Multilayer Networks. In Roceedings of the 18thInternational Con-ference on Artificial Intelligence and Statistics (p. 13).

Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2016). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). arXiv:1511.07289 [Cs]. Retrieved from http://arxiv.org/abs/1511.07289

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314. https://doi.org/10.1007/BF02551274

Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems, 27, 2933–2941. Retrieved from https://papers.nips.cc/paper/2014/hash/17e23e50bedc63b4095e3d8204ce063b-Abstract.html

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, 12, 2121–2159.

Fritzke, B. (1994). A growing neural gas network learns topologies. In Proceedings of the 7th International Conference on Neural Information Processing Systems (pp. 625–632). Cambridge, MA, USA: MIT Press.

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Roceedings of the13thInternational Conferenceon Artificial Intelligence and Statistics (AISTATS) 2010 (p. 8).

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., … Hassabis, D. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471–476. https://doi.org/10.1038/nature20101

Griewank, A. (2012). Who Invented the Reverse Mode of Differentiation? Documenta Mathematica, 12.

Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2 édition). Philadelphia, PA: Society for Industrial; Applied Mathematics.

Hastad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the eighteenth annual ACM symposium on Theory of computing (pp. 6–20). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/12130.12132

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv:1502.01852 [Cs]. Retrieved from http://arxiv.org/abs/1502.01852

Hinton, G. E. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507. https://doi.org/10.1126/science.1127647

Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251–257. https://doi.org/10.1016/0893-6080(91)90009-T

Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (pp. 448–456). PMLR. Retrieved from http://proceedings.mlr.press/v37/ioffe15.html

Jaderberg, M., Simonyan, K., & Zisserman, A. (2015). Spatial Transformer Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems (p. 9).

Josef Hochreiter. (1991). Untersuchungen zu dynamischen neuronalen Netzen (PhD thesis). Retrieved from http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 [Cs, Math]. Retrieved from http://arxiv.org/abs/1609.04836

Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. In arXiv:1412.6980 [cs]. Retrieved from http://arxiv.org/abs/1412.6980

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386

LeCun, Y. A., Bottou, L., Orr, G. B., & Müller, K.-R. (1998). Efficient BackProp. In G. Montavon, G. B. Orr, & K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade: Second Edition (pp. 9–48). Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-35289-8_3

Li, Y., Wei, C., & Ma, T. (2019). Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks. In NIPS 2019 (p. 12).

Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. In arXiv:1608.03983 [cs, math]. Retrieved from http://arxiv.org/abs/1608.03983

Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning (ICML13) (p. 6).

Maclin, R., & Shavlik, J. W. (1995). Combining the predictions of multiple classifiers: Using competitive learning to initialize neural networks. In Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1 (pp. 524–530). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Montufar, G. F., Pascanu, R., Cho, K., & Bengio, Y. (2014). On the Number of Linear Regions of Deep Neural Networks. In Advances in Neural Information Processing Systems (Vol. 27, pp. 2924–2932). Retrieved from https://papers.nips.cc/paper/2014/hash/109d2dd3608f669ca17920c511c2a41e-Abstract.html

Müller, R., Kornblith, S., & Hinton, G. (2020). When Does Label Smoothing Help? arXiv:1906.02629 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1906.02629

Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (pp. 807–814). Madison, WI, USA: Omnipress.

Nielsen, M. A. (2015). Neural Networks and Deep Learning. Retrieved from http://neuralnetworksanddeeplearning.com

Olah, C. (2015). Calculus on computational graphs: Backpropagation.

Park, J., & Sandberg, I. W. (1991). Universal Approximation Using Radial-Basis-Function Networks. Neural Computation, 3(2), 246–257. https://doi.org/10.1162/neco.1991.3.2.246

Pascanu, R., Dauphin, Y. N., Ganguli, S., & Bengio, Y. (2014). On the saddle point problem for non-convex optimization. arXiv:1405.4604 [Cs]. Retrieved from http://arxiv.org/abs/1405.4604

Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30thInternational Conference on Machine Learning (p. 9).

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., … Lerer, A. (2017). Automatic differentiation in PyTorch. In Proceedings of the 31st International Conference on Neural Information Processing Systems (p. 4).

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0

Santurkar, S., Tsipras, D., Ilyas, A., & Ma, A. (2018). How Does Batch Normalization Help Optimization? In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018) (p. 11).

Schmidhuber, J. (1992). Learning Complex, Extended Sequences Using the Principle of History Compression. Neural Computation, 4(2), 234–242. https://doi.org/10.1162/neco.1992.4.2.234

Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117. https://doi.org/10.1016/j.neunet.2014.09.003

Schwenker, F., Kestler, H. A., & Palm, G. (2001). Three learning phases for radial-basis-function networks. Neural Networks, 14(4-5), 439–458. Retrieved from http://dblp.uni-trier.de/db/journals/nn/nn14.html#SchwenkerKP01

Smith, L. N. (2018). A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay. arXiv:1803.09820 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1803.09820

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(56), 1929–1958. Retrieved from http://jmlr.org/papers/v15/srivastava14a.html

Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In ICML (p. 14).

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567 [Cs]. Retrieved from http://arxiv.org/abs/1512.00567

Werbos, P. (1981). Application of advances in nonlinear sensitivity analysis. In Proc. Of the 10th IFIP conference (pp. 762–770).

Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv:1212.5701 [Cs]. Retrieved from http://arxiv.org/abs/1212.5701

Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). Mixup: Beyond empirical risk minimization. arXiv. https://doi.org/10.48550/ARXIV.1710.09412

Zhang, J., & Mitliagkas, I. (2018). YellowFin and the Art of Momentum Tuning. arXiv:1706.03471 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1706.03471