# Deep learning

An introduction to deep learning

September 18, 2021

## Architecture (Broomhead & Lowe, 1988)

• RBFN are prototype based function approximator.
• specific architecture with a single layer of learnable feature vectors with “weights” (parameters) $$(\mu_j, \sigma_j)_{j\in[0..N_a-1]}$$

$\begin{eqnarray*} \phi(x) = \begin{pmatrix} 1 \\ \exp{\frac{-||x-\mu_0||^2}{2\sigma_0^2}} \\ \vdots\\ \exp{\frac{-||x-\mu_{N_a-1}||^2}{2\sigma_{N_a-1}^2}} \\ \end{pmatrix} \end{eqnarray*}$

Regression

• identity transfer function $$y = w^T \phi(x)$$
• L2 loss
$$L(y, \hat{y}) = \|\hat{y} - y\|^2$$

Binary classification

• sigmoidal transfer function $$y = \sigma(w^T \phi(x))$$
• CE loss $\begin{array}{l} L(y, \hat{y}) =&-y \log(\hat{y})\\ &-(1-y) \log(1-\hat{y}) \end{array}$

Multi classification

• softmax transfer function (see Lecture 1)
• CE loss (see Lecture 1)

## Learning

• We know how to learn the weights $$w$$ : minibatch gradient descent (or a variant thereof)

• What about the centers and variances ? (Schwenker, Kestler, & Palm, 2001)

• place them uniformly, randomly, by vector quantization (K-means++(Arthur & Vassilvitskii, 2007), GNG (Fritzke, 1994))

• two phases : fix the centers/variances, fit the weights

• three phases : fix the centers/variances, fit the weights, fit everything ($$\nabla_{\mu} L, \nabla_{\sigma} L, \nabla_w L$$)

## Universal approximator

Theorem : Universal approximation (Park & Sandberg, 1991)

Denote $$\mathcal{S}$$ the family of functions based on RBF in $$\mathbb{R}^d$$: $\mathcal{S} = \{g \in \mathbb{R}^d \to \mathbb{R}, g(x) = \sum_i w_i K(\frac{x-\mu_i}{\sigma}), w \in \mathbb{R}^N\}$ with $$K : \mathbb{R}^d \rightarrow \mathbb{R}$$ continuous almost everywhere and $$\int_{\mathbb{R}^d}K(x)dx \neq 0$$,
Then $$\mathcal{S}$$ is dense in $$L^p(\mathbb{R})$$ for every $$p \in [1, \infty)$$

In particular, it applies to the gaussian kernel introduced before.

# Feedforward neural networks (FNN)

## Architecture

Vocabulary

• Depth : number of weight layers
• Width : number of units per layer
• Parameters : Weights and biases for every unit
• Skip layer connections can bypass layers
• one hidden transfer function $$f$$, one task-specific output transfer function $$g$$

## Hidden transfer function

• historically: hyperbolic tangent $$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$ or sigmoid $$\sigma(x) = \frac{1}{1 + \exp(-x)}$$
• now mainly Recitifed Linear Units (ReLu)(Nair & Hinton, 2010),(Krizhevsky, Sutskever, & Hinton, 2012) or variants : $\mbox{relu}(x) = \max(x, 0)$

ReLu are more favorable for the gradient flow than the saturating functions (more on that latter when discussing computational graphs and gradient computation).

## Some other recent hidden transfer functions

Relu (Nair & Hinton, 2010)

$\begin{equation*} f(x) = \begin{cases} x & \mbox{ if } x \geq 0\\ 0 & \mbox { if } x < 0 \end{cases} \end{equation*}$

Leaky Relu
(Maas, Hannun, & Ng, 2013)
Parametric ReLu
(He, Zhang, Ren, & Sun, 2015)

$\begin{equation*} f(x) = \begin{cases} x & \mbox{ if } x \geq 0\\ \alpha x & \mbox { if } x < 0 \end{cases} \end{equation*}$

Exponential Linear Unit
(Clevert, Unterthiner, & Hochreiter, 2016)

$\begin{equation*} f(x) = \begin{cases} x & \mbox{ if } x \geq 0\\ \alpha (\exp(x) - 1) & \mbox { if } x < 0 \end{cases} \end{equation*}$

## Output transfer function

Exactly as when we discussed about the RBF, this is task dependent.

Regression

• identity transfer function $$y = w^T \phi(x)$$
• L2 loss
$$L(y, \hat{y}) = \|\hat{y} - y\|^2$$

Binary classification

• sigmoidal transfer function $$y = \sigma(w^T \phi(x))$$
• CE loss $\begin{array}{l} L(y, \hat{y}) =&-y \log(\hat{y})\\ &-(1-y) \log(1-\hat{y}) \end{array}$

Multi classification

• softmax transfer function (see Lecture 1)
• CE loss (see Lecture 1)

## Universal approximation

Any well behaved function can be arbitrarily approximated with a single layer FNN (Cybenko, 1989), (Hornik, 1991)

Intuition

• Transform the input with a linear transform $$y=w^Tx$$
• Take a sigmoid transfer function $$z = f(y) = \frac{1}{1+e^{-y}}$$ : this is the output of the hidden layer
• combine multiple activities in the $$z-$$layer to build up gaussian like kernels Substracting $$z-$$ layer activities to produce RBF kernels
• weight such substractions and you are back to the RBF universal approximation theorem

At that point, you may wonder why we bother about deep learning, right ?

## Why do we bother about deep learning ?

• Single hidden layer FFN are universal approximators but the hidden layer can be arbitrarily large
• a deep network (large number of layers) builds high level features by composing lower level features which can be reused by multiple units. Image analogy :
• first layer : extract oriented contours, texture filters, ..
• second layer : learn corners, crosses, curves, by combining contours
• next layers : build up more and more complex features
• a shallow network must learn all the possibly complex filters at once, no real way to compose
• early theoretical results on logic gates circuits (Hastad, 1986). More recent works on ReLU FFN (Montufar, Pascanu, Cho, & Bengio, 2014)

## Training : error backpropagation

Training is performed by gradient descent which was popularized by (Rumelhart, Hinton, & Williams, 1986) who called it error backpropagation (but (Werbos, 1981) already introduced the idea, see (Schmidhuber, 2015)).

Gradient descent is an iterative algorithm :

• initialize the weights and biases : $$w_0$$
• at every iteration compute : $w \leftarrow w - \epsilon \nabla_w J$

Remember : by minibatch gradient descent (see Lecture 1)

The question is : how do you compute $$\frac{\partial J}{\partial w_i}$$ ??

## Example on a regression problem

0- Imports


import torch
import torch.nn as nn
import torch.optim as optim

import sklearn
import sklearn.datasets

import tqdm

import matplotlib.pyplot as plt


data = sklearn.datasets.fetch_california_housing()
# X is (20640, 8), y is (20640, )
X, y = data.data, data.target

# At least normalize the input for an easier optimization
mean, std = X.mean(axis=0), X.std(axis=0)
X = (X - mean)/std

X_train = torch.tensor(X).float()
y_train = torch.tensor(y).float()

# A dataset defines __len__ and __getitem__
# it can also be an iterable dataset
train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
# A dataloader will create the minibatches
batch_size=64,
shuffle=True,
pin_memory=True)

Doc: Dataset, DataLoader. Pin memory Iterating over train_dataloader gives a pair of tensors of shape $$(64, 8)$$ and $$(64,)$$.

## Example on a regression problem

2- Define the network


if torch.cuda.is_available():
device = torch.device('gpu')
else:
device = torch.device('cpu')

# Build up the model
Nh = 64
model = nn.Sequential(
nn.Linear(8, Nh), nn.ReLU(),
nn.Linear(Nh, Nh), nn.ReLU(),
nn.Linear(Nh, Nh), nn.ReLU(),
nn.Linear(Nh, 1)
)

model.to(device)

Doc: Linear, Sequential

## Example on a regression problem

3- Define the loss, optimizer, callbacks, …


# Define the loss
loss = nn.MSELoss()

# Define the gradient descent algorithm
lr=0.001)
scheduler = optim.lr_scheduler.StepLR(optimizer,
step_size=10,
gamma=0.5,
verbose=True)


## Example on a regression problem

4- Iterate and monitor


for e in range(num_epochs):
model.train()
print(f"Epoch {e}")
X, y = X.to(device), y.to(device)

# Forward pass
y_pred = model(X).squeeze()
loss_value = loss(y_pred, y)

loss_value.backward()
optimizer.step()

print(f"MSELoss on the training set : {mseloss(train_dataloader)}")
scheduler.step()

Evaluation


# After every epoch, compute the risk
cum_loss = 0.0
n_samples = 0
model.eval()
X, y = X.to(device), y.to(device)

# Forward pass
y_pred = model(X).squeeze()
loss_value = loss(y_pred, y)

# The loss is 'mean' reduced so be carefull when cumulating it
cum_loss += loss_value * y_pred.size()
n_samples += y_pred.size()
return cum_loss/n_samples

# Computational graph and differential programming

## Computational graph

A computational graph is a directed acyclic graph where nodes are :

• variables (weights, inputs, outputs, targets, …)
• operations (ReLu, Softmax, $$w^Tx + b$$, losses, updates, ..)

Example graph for a linear regression $$\mathbb{R}^8 \mapsto \mathbb{R}$$ with minibatch $$(X, y)$$

$J = \frac{1}{M} \sum_{i=0}^{63} (w_1^T x_i + b_1 - y_i)^2$

## Computational graph

Problem computing the partial derivatives with respect to the variables $$\frac{\partial J}{\partial var}$$.

You just need to provide the local derivatives of the output w.r.t the inputs.

And then apply the chain rule.

ex : $$\frac{\partial J}{\partial w_1} \in \mathcal{M}_{1, 8}(\mathbb{R})$$, assuming numerator layout

## Computational graph : the chain rule

Numerator layout convention (otherwise, we transpose and reverse the jacobian product order):

The derivative of a scalar with respect to a vector is a row vector : $y \in \mathbb{R}, x \in \mathbb{R}^n, \frac{dy}{dx} \in \mathcal{M}_{1, n}(\mathbb{R})$

More generally, the derivative of a vector valued function $$y : \mathbb{R}^{n_x} \mapsto \mathbb{R}^{n_y}$$ with respect to its input (the Jacobian) is a $$n_y \times n_x$$ matrix :

$x \in \mathbb{R}^{n_x}, y(x) \in \mathbb{R}^{n_y}, \frac{dy}{dx}(x) = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_{1}}{\partial x_{n_x}} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_{2}}{\partial x_{n_x}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_{n_y}}{\partial x_1} & \frac{\partial y_{n_y}}{\partial x_2} & \cdots & \frac{\partial y_{n_y}}{\partial x_{n_x}} \end{bmatrix}(x)$

## Computational graph : the chain rule

For a (single-path) chain $$y_1 \rightarrow y_2 = f_1(y_1) \rightarrow y_3 = f_2(y_2) \cdots y_n = f_{n-1}(y_{n-1})$$, of vector valued functions $$y_1 \in \mathbb{R}^{n_1}, y_2\in\mathbb{R}^{n_2}, \cdots y_n \in \mathbb{R}^{n_n}$$,

$\frac{\partial y_n}{\partial y_1} = \frac{\partial y_n}{\partial y_{n-1}}\frac{\partial y_{n-1}}{\partial y_{n-2}}\cdots\frac{\partial y_2}{\partial y_1}$

ex : $$\frac{\partial J}{\partial w_1} = \frac{\partial J}{\partial y_1} \frac{\partial y_1}{\partial w_1} = \frac{2}{N} (y_1-y)^T X \in \mathbb{R}^8$$, assuming numerator layout

For matrix variables, we should be introducing tensors. See also this and this

## The chain rule : multiple paths

For multiple paths, in principle we sum over all the paths : $\frac{\partial y}{\partial x} = \sum_{j=3,4} \frac{\partial y}{\partial y_i}\frac{\partial y_i}{\partial x} = y_4 \frac{\partial y_3}{\partial x} + y_3 \frac{\partial y_4}{\partial x} = y_4 f'(y_1) w_1^T + y_3 f'(y_2) w_2^T \in \mathbb{R}^8$

## The chain rule : multiples paths

But this can be computationally (too) expensive :

• there can be many paths you need to identify and sum over : L layers, N units, $$N^L$$ paths
• and you must repeat the process for every variable w.r.t. which you want to differentiate
• some computations can be factored (e.g. $$\frac{\partial y_2}{\partial x}$$, $$\frac{\partial y_1}{\partial x}$$)

## Automatic differentiation : forward mode

Let us be more efficient : forward mode differentiation

Idea: To compute $$\frac{\partial y}{\partial x}$$, forward propagate $$\frac{\partial }{\partial x}$$
e.g. $$\frac{\partial y}{\partial x} = e^{y_1+y_2} \left[ y_2(1+y_1)w_1^T + y_1(1+y_2)w_2^T\right]$$

Welcome to the field of automatic differentiation (AD). For more, see (Griewank, 2012), (Griewank & Walther, 2008) (see also (Olah, 2015), (Paszke et al., 2017))

## Automatic differentiation : reverse mode

Let us be (sometimes) even more efficient : reverse mode differentiation

Idea: To compute $$\frac{\partial y}{\partial x}$$, backward propagate $$\frac{\partial y}{\partial }$$ (compute the adjoint)
e.g. $$\frac{\partial y}{\partial x} = (y_3 y_4 + y_3 e^{y_1})w_2^T + (y_4y_3 + y_4 e^{y_2})w_1^T$$

Oh ! We also got $$\frac{\partial y}{\partial w_2}$$, $$\frac{\partial y}{\partial b_2}$$, …

This is more efficient than forward mode when we have much more inputs than outputs.

A Review of Automatic Differentiationand its Efficient Implementation

In (Rumelhart et al., 1986), the algorithm was called “error backpropagation” : why ?

Suppose a 2-layer multi-layer feedforward network and propagating one sample, with a scalar loss : $L = g( y_i, \begin{bmatrix} & & \\ & W_2 (n_2 \times n_1) & \\ & & \end{bmatrix} f( \begin{bmatrix} & & \\ & W_1 (n_1 \times n_x) & \\ & & \end{bmatrix} \begin{bmatrix} \\ x_i \\ \phantom{} \end{bmatrix} )) \in \mathbb{R}$

$$g$$ could be a squared loss for regression (with $$n_2=1$$), or CrossEntropyLoss (with logits and $$n_2=n_{class}$$) for multiclass classification.

We denote $$z_1 = W_1 x_i, z_2 = W_2 f(z_1)$$ and $$\delta_i = \frac{\partial L}{\partial z_i} \in \mathbb{R}^{n_i}$$.

Then : \begin{align} \delta_2 &= \frac{\partial L}{\partial z_2} = \frac{\partial g(x_1, x_2)}{\partial x_2}(y_i, z_2) \\ \delta_1 &= \frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial z_2} \frac{\partial z_2}{\partial z_1} = \begin{bmatrix} & & \delta_2 & & \end{bmatrix} \begin{bmatrix} & & \\ & W_2(n_2 \times n_1) & \\ & & \phantom{}\end{bmatrix} \text{diag}(f'(z_1)) \end{align} The errors of $$\delta_2$$ are integrated back through the weight matrix that was used for the forward pass. (See also (Nielsen, 2015), chap 2).

## Gradient descent in deep learning

Training in two phase

• Evaluation of the outputs : forward propagation / forward pass
• Evaluation of the gradient : reverse-mode differentiation / backward pass

warning The reverse-mode differentiation uses the variables computed in the forward pass

$$\rightarrow$$ we can apply efficiently stochastic gradient descent to optimize the parameters of our neural networks !

Note the computational graph can be extended to encompass the operations of the backward pass.

• The deep learning frameworks all compute the backward pass automatically.
optimizer = optim.Adam(model.parameters())

for e in range(epochs):
...
loss.backward()
optimizer.step()
• The computational graphs can be built dynamically (eager mode) or static (JIT)


class MyFunction(Function):

@staticmethod
def forward(ctx, input, ..):
...

@staticmethod
...

## The way toward differentiable programming

The computational graph is a central notion in modern neural networks/deep learning. Broaden the scope with differential programming.

In the recent years, “fancier” differentiable blocks others than $$f(W f(W..))$$, and that are dynamically built (eager mode).

Spatial Transformer Networks

Neural Turing Machine(Graves et al., 2016)

## Does it make sense to use gradient descent ?

Indeed :

• we cannot do better than a local minima
• neural networks lead to non convex optimization. For example, consider a 2-layer FFN :

\begin{align*} & \begin{bmatrix} \\ x \\ \phantom{} \end{bmatrix}\\ \begin{bmatrix} & & \\ & W_1 & \\ & & \phantom{} \end{bmatrix} & \begin{bmatrix} \\ y_1 \\ \phantom{} \end{bmatrix} \end{align*}

\begin{align*} & \begin{bmatrix} \\ f(y_1) \\ \phantom{} \end{bmatrix}\\ \begin{bmatrix} & & \\ & W_2 & \\ & & \phantom{} \end{bmatrix} & \begin{bmatrix} \\ y_2 \\ \phantom{} \end{bmatrix} \end{align*}

But empirically, most local minima are close (in performance) to the global minimum, especially with large/deep networks. See (Dauphin et al., 2014), (Pascanu, Dauphin, Ganguli, & Bengio, 2014), (Choromanska, Henaff, Mathieu, Arous, & LeCun, 2015). Saddle points seem to be more critical.

## First order methods : Minibatch stochastic gradient descent

Algorithm

• Start at $$\theta_0$$
• for every minibatch : \begin{align*} \theta(t+1) &= \theta(t) - \epsilon \nabla_\theta L(\theta(t))\\ L(\theta) &= \frac{1}{M} \sum_i J(\theta, x_i, y_i) \end{align*}

Rationale (Taylor expansion) : $$L(\theta_{t+1}) \approx L(\theta_{t}) + (\theta_{t+1} - \theta_{t})^T \nabla_{\theta} L(\theta_{t})$$

The choice of the batch size :

• Stochastic gradient descent (small minibatch, $$M=1$$) : noisy estimate, not GPU friendly
• Batch Gradient descent ($$M=N$$) : More GPU friendly. But more prone to bad generalization (generalization gap) and to local minima (Keskar, Mudigere, Nocedal, Smelyanskiy, & Tang, 2017). Roughly speaking, we should avoid sharp minima.

The optimization may converge slowly or even diverge if the learning rate $$\epsilon$$ is not appropriate.

## Choosing a learning rate The impact of the learning rate on the optimization (LeCun, Bottou, Orr, & Müller, 1998)

Bengio: “The optimal learning rate is usually close to the largest learning rate that does not cause divergence of the training criterion” (Bengio, 2012)

Karpathy “$$0.0003$$ is the best learning rate for Adam, hands down.” (Twitter, 2016)

(Note: Adam will be discussed in few slides)

- Practical Recommendations for gradient-based training of deep architectures (Bengio, 2012)
- Efficient Backprop (LeCun et al., 1998)

## Example regression problem

Setup

• $$N=30$$ samples generated with : $y = 3 x + 2 + \mathcal{U}(-0.1, 0.1)$
• Model : $$f_\theta(x) = \theta^T\begin{bmatrix} 1 \\ x\end{bmatrix}$$,
• L2 loss : $$L(y_i, f_{\theta}(x_i)) = (y_i - f_{\theta}(x_i))^2$$

## Example using SGD

Parameters : $$\epsilon=0.005$$, $$\theta_0 = \begin{bmatrix} 10 \\ 5\end{bmatrix}$$

Converges to $$\theta_{\infty} = \begin{bmatrix} 1.9882 \\ 2.9975\end{bmatrix}$$ The components of $$\nabla_\theta J$$

## First order methods : momentum

Algorithm : Let us damp the oscillations with a low pass on $$\nabla_{\theta}$$

• Start at $$\theta_0$$, $$v_0 = 0$$
• for every minibatch : \begin{align*} v(t+1) &= \mu v(t) - \epsilon \nabla_{\theta} J(\theta(t))\\ \theta(t+1) &= \theta(t) + v(t+1) \end{align*}

Usually $$\mu \approx 0.9$$ or $$0.99$$.

• as an exponential moving average, it low pass filters and therefore dampen oscillations along fast varying dimensions
• it can accelerate (increase the learning rate) in constant directions (or low curvature).
If $$\nabla_{\theta} J = g$$, $$v(0) = 0$$ $v(t) = -\epsilon g \sum_{i=0}^{t-1} \mu^i = -\epsilon g \frac{1-\mu^{t}}{1-\mu}$

## Example using SGD with momentum

Parameters : $$\epsilon=0.005$$, $$\mu=0.6$$, $$\theta_0 = \begin{bmatrix} 10 \\ 5\end{bmatrix}$$

Converges to $$\theta_{\infty} = \begin{bmatrix} 1.9837 \\ 2.9933\end{bmatrix}$$

## First order methods : Nesterov momentum

Idea Look ahead to potentially correct the update. Based on Nesterov Accelerated Gradient. Formulation of (Sutskever, Martens, Dahl, & Hinton, 2013)

Algorithm

• Start at $$\theta_0$$
• for every minibatch : \begin{align*} \overline{\theta}(t) &= \theta(t) + \mu v(t)\\ v(t+1) &= \mu v(t) - \epsilon \nabla_{\theta}J(\overline{\theta}(t))\\ \theta(t+1) &= \theta(t) + v(t+1) \end{align*}

## Example using SGD with Nesterov momentum

Parameters : $$\epsilon=0.005$$, $$\mu=0.8$$, $$\theta_0 = \begin{bmatrix} 10 \\ 5\end{bmatrix}$$

Converges to $$\theta_{\infty} = \begin{bmatrix} 1.9738 \\ 2.9914\end{bmatrix}$$

## Comparison of the 1st order methods

On our simple regression problem

## First order : adaptive learning rate

You should always adapt your learning rate with a learning rate scheduler

• Linear decrease from $$\epsilon_0$$ downto $$\epsilon_f$$
• halve the learning rate when the validation error stops improving
• halve the learning rate on a fixed schedule (every $$50th$$ epochs) Resnet training curves. “The learning rate starts from 0.1 and is divided by 10 when the error plateaus”

Some more recent approaches are changing the picture of “decreasing learning rate” (“Robbins Monro conditions”)

See (Smith, 2018), The 1cycle policy - S. Gugger

Stochastic Gradient Descent with Warm Restart (Loshchilov & Hutter, 2017)

The improved performances may be linked to reaching flatter minimums (i.e. with predictions less sensitive than sharper minimums). The models reached before the warm restarts can be averaged (see Snapshot ensemble).

It seems also that initial large learning rates tend to lead to better models on the long run (Li, Wei, & Ma, 2019)

• Accumulate the square of the gradient $r(t+1) = r(t) + \nabla_{\theta}J(\theta(t)) \odot \nabla_{\theta}J(\theta(t))\\$
• Scale individually the learning rates $\theta(t+1) = \theta(t) - \frac{\epsilon}{\delta + \sqrt{r(t+1)}} \odot \nabla_{\theta}J(\theta(t))$

The $$\sqrt{.}$$ is experimentally critical ; $$\delta \approx [1e-8, 1e-4]$$ for numerical stability.

Small gradients $$\rightarrow$$ bigger learning rate for moving fast along flat directions
Big gradients $$\rightarrow$$ smaller learning rate to calm down on high curvature.

But accumulation from the beginning is too aggressive. Learning rates decrease too fast.

## Adaptive first order : RMSprop

RMSprop Hinton(unpublished, Coursera)

Idea: we should be using an exponential moving average when accumulating the gradient.

• Accumulate the square of the gradient $r(t+1) = \rho r(t) + (1-\rho)\nabla_{\theta}J(\theta(t)) \odot \nabla_{\theta}J(\theta(t))\\$
• Scale individually the learning rates $\theta(t+1) = \theta(t) - \frac{\epsilon}{\delta + \sqrt{r(t+1)}} \odot \nabla_{\theta}J(\theta(t))$ $$\rho \approx 0.9$$

• Like momentum and RMSprop, store running averages of past gradients : \begin{align*} m(t+1) &= \beta_1 m(t) + (1-\beta_1)\nabla_{\theta}J(\theta(t)\\ v(t+1) &= \beta_2 v(t) + (1-\beta_2)\nabla_{\theta}J(\theta(t)\odot \nabla_{\theta}J(\theta(t) \end{align*} $$m(t)$$ and $$v(t)$$ are the first moment and second (uncentered) moments of $$\nabla_{\theta} J$$. They are bias corrected $$\hat{m}(t)$$, $$\hat{v}(t)$$ and then :

$\theta(t+1) = \theta(t) - \frac{\epsilon}{\delta + \sqrt{\hat{v}(t+1)}} \hat{m}(t+1)$

and some others : Adadelta (Zeiler, 2012), … , YellowFin (Zhang & Mitliagkas, 2018).

## First order : to sum up

(Goodfellow, Bengio, & Courville, 2016) There is currently no consensus[…] no single best algorithm has emerged[…]the most popular and actively in use include SGD,SGD with momentum, RMSprop, RMSprop with momentum, Adadelta and Adam.

## A glimpse into second order methods

Rationale : $J(\theta) \approx J(\theta_0) + (\theta - \theta_0)^T \nabla_{\theta}J(\theta_0) + \frac{1}{2}(\theta - \theta_0)^T \nabla^2_{\theta} J(\theta_0) (\theta - \theta_0)$ with $$H = \nabla^2J$$ the Hessian matrix, a $$n_\theta \times n_\theta$$ matrix hosting the second derivatives of $$J$$.

The second derivates are much more noisy than the first derivative (gradient), a larger batch size is usually required to prevent instabilities.

• Conjugate gradient : using line search (or hessian) along $$\nabla_{\theta}J(\theta_k)$$
• Newton : never use except if you want to find critical points (Dauphin et al., 2014). Solves above for $$\theta$$ and find $$\nabla_\theta^2J(\theta_0) . (\theta - \theta_0) = -\nabla_\theta J(\theta_0)$$
• Quasi Newton : BFGS (approximating $$H^{-1}$$), L-BFGS, and saddle-free versions (Dauphin et al., 2014).

# Initialization and the distributions of activations and gradients

## The starting point is important : XOR

XOR is easy right ?

• Model : 2-4-1, Sigmoid activations (great!); 17 parameters
• Init : $$\mathcal{U}(−10, 10)$$, bias=0 (hum hum)
• Loss : Binary cross entropy (great!)
• Optimizer : SGD ( = 0.1, momentum=0.99 )

But it fails miserably (6/20 fails). Tmax=1000

## The starting point is important : XOR

XOR is easy right ?

• Model : 2-4-1, Sigmoid activations (great!); 17 parameters
• Init : $$\mathcal{N}(0, \frac{1}{\sqrt{fan_{in}}})$$, bias=0 (great!)
• Loss : Binary cross entropy (great!)
• Optimizer : SGD ( = 0.1, momentum=0.99 )

Now it is better (0/20 fails). Tmax=1000

## Pretraining

Historically, training deep FNN was known to be hard, i.e. bad generalization errors.

The starting point of a gradient descent has a dramatic impact :

• neural history compressors (Schmidhuber, 1992)
• competitive learning (Maclin & Shavlik, 1995)
• unsupervised pretraining based on Boltzman machines (Hinton, 2006)
• unsupervised pretraining based on auto-encoders (Bengio, Lamblin, Popovici, & Larochelle, 2006)

Pretraining is no more used (because of xxRelu, Initialization schemes, ..)

Gradient descent converges faster if your data are normalized and decorrelated. Denote by $$x_i \in \mathbb{R}^d$$ your input data, $$\hat{x}_i$$ its normalized.

• Min-max scaling $\forall i,j \hat{x}_{i,j} = \frac{x_{i,j} - \min_k x_{k,j}}{\max_k x_{k,j} - \min_k x_{k,j} + \epsilon}$
• Z-score normalization (goal: $$\hat{\mu}_j = 0, \hat{\sigma}_j = 1$$) $\forall i,j, \hat{x}_{i,j} = \frac{x_{i,j} - \mu_j}{\sigma_j + \epsilon}$

• ZCA whitening (goal: $$\hat{\mu}_j = 0, \hat{\sigma}_j = 1$$, $$\frac{1}{n-1} \hat{X} \hat{X}^T = I$$)

$\hat{X} = W X, W = \frac{1}{\sqrt{n-1}} (XX^T)^{-1/2}$

## Z-score normalization / Standardizing the inputs

Remember our linear regression : $$y = 3x+2+\mathcal{U}(-0.1, 0.1)$$, L2 loss, 30 1D samples

## General strategy

• A good initialization should break the symmetry : constant initialization schemes make units learning all the same thing

• A good initialization should start optimization in a region of low capacity : linear neural network

• A good initialization scheme should preserve the distribution of the activations and gradients : exploding/vanishing gradients

## The exploding and vanishing gradient problem

The Fundamental Deep Learning Problem first observed by (Josef Hochreiter, 1991) for RNN, the gradient can either vanish or explode, especially in deep networks (RNN are very deep).

• Remember that the backpropagated gradient involves : $\frac{\partial J}{\partial x_l} = \frac{\partial J}{\partial y_L} W_L f'(y_l) W_{L-1} f'(y_{l-1}) \cdots$ with $$y_l = W_l x_l + b, x_l = f(y_{l-1})$$.

• We see a pattern like $$(W.f')^L$$ which can diverge or vanish for large $$L$$.

• especially, with the sigmoid :$$f' < 1$$.

With a ReLu, the positive part has $$f' = 1$$.

• We must ensure a good flow of gradient :
• using appropriate transfer functions ReLu, PreLu, etc..
• using architectural elements :
• ResNet (CNN) : shortcurt connections
• LSTM (RNN): constant error caroussel
• We can prevent exploding gradient by clipping (Pascanu, Mikolov, & Bengio, 2013) Exploding gradient and the effect of clipping. Experiment with 50 layers, single unit, sigmoid transfer function

## LeCun Initialization

In (LeCun et al., 1998), Y. LeCun provided some guidelines on the design:

Aim Initialize the weights/biases to keep $$f$$ in its linear part through multiple layers:

• Use a symmetric transfer function $$f(x) = 1.7159 \tanh(\frac{2}{3}x)$$, $$\rightarrow$$ $$f(1) = 1$$, $$f(-1) = -1$$
• set the biases to $$0$$

• initialize randomly and independently from $$\mathcal{N}(\mu=0, \sigma^2=\frac{1}{fan_{in}})$$.

If $$x \in \mathbb{R}^n$$ is $$\mathcal{N}(0, \Sigma = I)$$, $$w \in \mathbb{R}^n$$ is $$\mathcal{N}(μ=0, \Sigma=\frac{1}{n}I)$$, then :

\begin{align*} E[w^T x + b] &= E[w^T x] = \sum_i E[w_i x_i] = \sum_i E[w_i] E[x_i] = 0\\ var[w^T x + b] &= var[w^T x] \\ & = \sum_i \sigma^2_{w_i}\sigma^2_{x_i} + \sigma^2_{w_i}\mu^2_{x_i} + \mu^2_{w_i}\sigma^2_{x_i}\\ &= \sum_i \sigma^2_{w_i}\sigma^2_{x_i} = \frac{1}{n}\sum \sigma^2_{x_i} = 1 \end{align*}

## Xavier (Glorot) Initialization

Idea we must preserve the same distribution along the forward and backward pass (Glorot & Bengio, 2010).

This prevents:

• the saturation of saturating transfer functions (e.g. tanh, sigmoid)

Glorot (Xavier) initialization scheme for a feedforward network $$f(W_{n}..f(W^1f(W_0 x+b_0)+b_1)...+b_n)$$ with layer sizes $$n_i$$:

• the input dimensions should be centered, normalized, uncorrelated
• symmetric activation function, with $$f'(0) = 1$$ (e.g. $$f(x)=\tanh(x), f(x)=4(\frac{1}{1+e^{-x}}-0.5)$$)

Assuming the linear regime $$f'() = 1$$ of the network : \begin{align*} \mbox{Forward propagation variance constraint :} \forall i, fan_{in_i} \sigma^2_{W_i} &= 1\\ \mbox{Backward propagation variance constraint :} \forall i, fan_{out_i} \sigma^2_{W_i} &= 1 \end{align*} Compromise : $$\forall i, \sigma^2_{W_i} = \frac{2}{fan_{in} + fan_{out}}$$
- Glorot (Xavier) uniform : $$\mathcal{U}(-\frac{\sqrt{6}}{\sqrt{fan_{in}+fan_{out}}}, \frac{\sqrt{6}}{\sqrt{fan_{in}+fan_{out}}})$$, b=0
- Glorot (Xavier) normal : $$\mathcal{N}(0, \frac{\sqrt{2}}{\sqrt{fan_{in}+fan_{out}}})$$, b=0

## He Initialization

Idea we must preserve the same distribution along the forward and backward pass for rectifiers (He et al., 2015). For a feedforward network $$f(W_{n}..f(W^1f(W_0 x+b_0)+b_1)...+b_n)$$ with layer sizes $$n_i$$:

• the input dimensions should be centered, normalized, uncorrelated
• ReLu activation $$f(x) = \max(x, 0)$$
• weights initialized with symmetric distribution, zero mean $$\mu_{w_l} = 0$$, independently. Bias set to $$b=0$$
• the components of $$x_l$$ are assumed i.i.d.; Note they are not centered (because of ReLu) \begin{align*} \mathbf{y}_l &= \begin{bmatrix} \vdots \\ y_l \\ \vdots\end{bmatrix} = W_l \mathbf{x}_l + \mathbf{b} = W_l f(\mathbf{y}_{l-1}) + \mathbf{b}\\ \mu_{y_l} &= E[\sum_i w_{l,i}x_{l,i}] = \mu_{w_l}\sum_i \mu_{x_{l,i}} = 0\\ \sigma^2_{y_l} &= n_l\sigma^2_{w_l x_l} = n_i \mu_{w_l^2} \mu_{x_l^2} = n_l \sigma^2_{w_l} \mu_{x_l^2} (\mbox{because \mu_{w_l} = 0})\\ \mu_{x_l^2} &= \int_{y_{l-1}} \max(0, y_{l-1})^2dp_{y_{l-1}} = \frac{1}{2} \mu_{y_{l-1}^2} =\frac{1}{2} \sigma^2_{y_{l-1}} (\mbox{\mu_{y_{l-1}}=0 and y_{l-1} has symmetric distrib.}) \end{align*}

So, $$\sigma^2_{y_l} = \frac{1}{2}n_l \sigma^2_{w_l} \sigma^2_{y_{l-1}}$$. To preserve the variance, we must guarantee $$\frac{1}{2} n_l \sigma^2_{w_l} = 1$$.

We used : if $$X$$ and $$Y$$ are independent : $$\sigma^2_{X.Y} = \mu_{X^2}\mu_{Y^2} - \mu_X^2 \mu_Y^2$$

## He initialization

Idea we must preserve the same distribution along the forward and backward pass for rectifiers (He et al., 2015). For a feedforward network $$f(W_{n}..f(W^1f(W_0 x+b_0)+b_1)...+b_n)$$ with layer sizes $$n_i$$:

• the input dimensions should be centered, normalized, uncorrelated
• ReLu activation $$f(x) = \max(x, 0)$$
• weights initialized with symmetric distribution, zero mean $$\mu_{w_l} = 0$$, independently. Bias set to $$b=0$$
• the components of $$x_l$$ are assumed i.i.d.; Note they are not centered (because of ReLu)

\begin{align*} \mbox{Forward propagation variance constraint :} \forall i, \frac{1}{2}fan_{in_i} \sigma^2_{W_i} &= 1\\ \mbox{Backward propagation variance constraint :} \forall i, \frac{1}{2}fan_{out_i} \sigma^2_{W_i} &= 1 \end{align*}

He suggests to use either one or the other, e.g. $$\sigma^2_{W_i} = \frac{2}{fan_{in}}$$
- He uniform : $$\mathcal{U}(-\frac{\sqrt{6}}{\sqrt{fan_{in}}}, \frac{\sqrt{6}}{\sqrt{fan_{in}}})$$, b=0
- He normal : $$\mathcal{N}(0, \frac{\sqrt{2}}{\sqrt{fan_{in}}})$$, b=0

Note: for PreLu : $$\frac{1}{2} (1 + a^2) fan_{in} \sigma^2_{W_i} = 1$$

## Weight initialization in practice (PyTorch)

By default, the parameters are initialized randomly. e.g. in torch.nn.Linear :

class Linear(torch.nn.Module):
def __init__(self):
...
self.reset_parameters()

def reset_parameters(self) -> None:
torch.nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
fan_in, _ = torch.init.calculate_fan_in_and_fan_out(self.weight)
bound = 1/math.sqrt(fan_in)
init.uniform_(self.bias, -bound, bound)

Oh, but that’s not what we should use for ReLu ?!?! Indeed you are right, see this issue. This is to avoid breaking with the way torch(lua) was initializing.


import torch.nn.init as init

class MyModel(torch.nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.classifier =  nn.Sequential(
*linear_relu(input_size, 256),
*linear_relu(256, 256),
nn.Linear(256, num_classes)
)
self.init()

def init(self):
def finit(m):
if type(m) == nn.Linear:
init.kaiming_uniform_(m.weight,
a=0,
mode='fan_in',
nonlinearity='relu')
m.bias.fill_(0.0)
self.apply(finit)
def linear_relu(dim_in, dim_out):
return [nn.Linear(dim_in, dim_out),
nn.ReLU(inplace=True)]

## Internal covariate shift

(Ioffe & Szegedy, 2015) observed the change in distribution of network activations due to the change in network parameters during training.

Experiment 3 fully connected layers (100 units), sigmoid, softmax output, MNIST dataset left) Test accuracy, right)Distribution of the activations of the last hidden layer during training, {15, 50, 85}th percentile

## Batch Normalization

Idea standardize the activations of every layers to keep the same distributions during training (Ioffe & Szegedy, 2015)

• The gradient must be aware of this normalization, otherwise may get parameter explosion (see (Ioffe & Szegedy, 2015)) $$\rightarrow$$ we need a differentiable normalization layer

• introduces a differentiable Batch Normalization layer : $z = g(W x + b) \rightarrow z = g(BN(W x))$

BN operates element-wise : \begin{align*} y_i &= BN_{\gamma,\beta} (x_i) = \gamma \hat{x}_i + \beta\\ \hat{x}_i &= \frac{x_i - \mu_{\mathcal{B}, i} }{\sqrt{\sigma^2_{\mathcal{B}, i} + \epsilon}} \end{align*} with $$\mu_{\mathcal{B},i}$$ and $$\sigma_{\mathcal{B},i}$$ statistics computed on the mini batch during training.

Learning faster, with better generalization.

## Batch normalization

During training

• put BN layers everywhere along the network, after the linear layer, before the ReLus
• evaluate the statistics $$\mu, \sigma$$ over the minibatches
• update an exponential moving average of the mean $$\mu_{\mathcal{B}}$$ and variance $$\sigma^2_{\mathcal{B}}$$

During inference (test) :

• use the running average as the statistics to standardize : this is now just a fixed affine transform.

warningDo not forget to switch to test mode :


model = MyModel()  # a pytorch nn.Module
# For training
model.train()
# For testing
model.test()

Some recent works challenge the idea of covariate shift (Santurkar, Tsipras, Ilyas, & Ma, 2018), (Bjorck, Gomes, Selman, & Weinberger, 2018). The loss seems smoother allowing larger learning rates, better generalization, robustness to hyperparameters.

# Regularization

## L2 penalty

Add a L2 penalty on the weights, $$\alpha > 0$$

\begin{align*} J(\theta) &= L(\theta) + \frac{\alpha}{2} \|\theta\|^2_2 = L(\theta) + \frac{\alpha}{2}\theta^T \theta\\ \nabla_\theta J &= \nabla_\theta L + \alpha \theta\\ \theta &\leftarrow \theta - \epsilon \nabla_\theta J = (1 - \alpha \epsilon) \theta - \epsilon \nabla_\theta L \end{align*} Called L2 regularization, Tikhonov regularization, weight decay

Example RBF, 1 kernel per sample, $$N=30$$, noisy inputs, $$\alpha=0$$ $$\alpha=2$$ $$\theta^\star$$

See chap 7 of (Goodfellow et al., 2016) for a geometrical interpretation

Intuition : for linear layers, the gradient of the function equals the weights. Small weights $$\rightarrow$$ small gradient $$\rightarrow$$ smooth function.

## L2 penalty

In theory, regularizing the bias will cause underfitting

Example

\begin{align*} J(w, b) &= \frac{1}{N} \sum_{i=1}^N \| y_i - b - w^T x_i\|_2^2\\ \nabla_b J(w,b) &\implies b = (\frac{1}{N} \sum_i y_i) - w^T (\frac{1}{N} \sum_i x_i) \end{align*}

If your data are centered (as they should), the optimal bias is the mean of the targets.

## L1 penalty

Add a L1 penalty to the weights : \begin{align*} J(\theta) &= L(\theta) + \alpha \|\theta\|_1 = L(\theta) + \alpha \sum_i |\theta_i|\\ \nabla_\theta J &= \nabla_\theta L + \alpha \mbox{sign}(\theta) \end{align*}

Example RBF, 1 kernel per sample, $$N=30$$, noisy inputs, $$\alpha=0$$ $$\alpha=0.003$$ $$\theta^\star$$

See chap 7 of (Goodfellow et al., 2016) for a mathematical explanation in a specific case. Sparsity used for feature selection with LASSO (filter/wrapper/embedded).

## Dropout

Introduced in (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014):

Idea 1 : preventing co-adaptation. A pattern is robust by itself not because of others doing part of the job.
Idea 2 : average of all the sub-networks (ensemble learning)

How :

• for every minibatch, zeroes hidden and input activations with probability $$p$$ ($$p=0.5$$ for hidden, $$p=0.2$$ for input). At test time, multiply every activations by $$p$$

• “Inverted” dropout : multiply the kept activations by $$p$$ at train time. At test time, just do a normal forward pass.

## Dropout

• Usually, after all fully connected layers (p=0.5) and input layer
• less usual on convolutional layers (because these are already regularized)

Can be interpreted as if training/averaging all the possible subnetworks.

## L1/L2/Dropout in pytorch

L1/L2

class MyModel(nn.Module):
def __init__(..., l2_reg, ..):
self.lin1 = nn.Linear(784, 256)
self.lin2 = nn.Linear(256, 256)
self.l2_reg = l2_reg

def penalty(self):
return l2_reg * (self.lin1.weight.norm(2) + ...) 
def train():
...
loss.backward()
model.penalty().backward()

Dropout


import torch.nn as nn

class MyModel(nn.Module):
def __init__(self, ..):
self.classifier = nn.Sequential(
*dropout_linear_relu(784, 128, 0.5),
*dropout_linear_relu(128, 256, 0.5),
nn.Linear(256, num_classes)
)
def dropout_linear_relu(dim_in, dim_out, p_zeroed):
return [nn.Dropout(p_zeroed),
nn.Linear(dim_in, dim_out),
nn.ReLU(inplace=True)]

## Early stopping

Split your data in three sets :

• training set : for training ..
• validation set: for choosing the hyperparameters (learning rates, number of layers, layer size, momentum, …)
• test set : for estimation the generalization error

Everything can be placed in a cross validation loop.

Early stopping is about keeping the model with the lowest validation loss.


# Training over an epoch
X, y = X.to(device), y.to(device)
...
optimizer.step()
# Model checkpoint
if val_loss < best_val_loss:
torch.save(model.state_dict(), filepath)

## Data, data, we need data !

The best regularizer you may find is data. The more you have, the better you learn.

• you can use pretrained models on some tasks as an initialization for learning your task (but may fail due to domain shift)

• you can use unlabeled data for pretraining your networks (as done in 2006s) with auto-encoders / RBM : unsupervised/semi-supervised learning

• you can apply random transformations to your data : dataset augmentation

## Label smoothing

Introduced in (Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2015) in the context of Convolutional Neural Networks.

Idea : Preventing the network to be over confident on its predictions on the training set.

Recipe : in a $$k$$-class problem, instead of using hard targets $$\in \{0, 1\}$$, use soft targets $$\in \{\frac{\alpha}{k}, 1-\alpha\frac{k-1}{k}\}$$ (weighted average between the hard targets and uniform target). $$\alpha \approx 0.1$$.

## Bibliography

Rather check the full online document references.pdf

Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms.

Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy Layer-Wise Training of Deep Networks. In (p. 8).

Bjorck, N., Gomes, C. P., Selman, B., & Weinberger, K. Q. (2018). Understanding Batch Normalization. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018) (p. 12).

Broomhead, D., & Lowe, D. (1988). Multivariable Functional Interpolation and Adaptive Networks. Complex Systems, 2, 321–355.

Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2015). The Loss Surfaces of Multilayer Networks. In Roceedings of the 18thInternational Con-ference on Artificial Intelligence and Statistics (p. 13).

Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2016). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). arXiv:1511.07289 [Cs]. Retrieved from http://arxiv.org/abs/1511.07289

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314. https://doi.org/10.1007/BF02551274

Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems, 27, 2933–2941. Retrieved from https://papers.nips.cc/paper/2014/hash/17e23e50bedc63b4095e3d8204ce063b-Abstract.html

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, 12, 2121–2159.

Fritzke, B. (1994). A growing neural gas network learns topologies. In Proceedings of the 7th International Conference on Neural Information Processing Systems (pp. 625–632). Cambridge, MA, USA: MIT Press.

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Roceedings of the13thInternational Conferenceon Artificial Intelligence and Statistics (AISTATS) 2010 (p. 8).

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., … Hassabis, D. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471–476. https://doi.org/10.1038/nature20101

Griewank, A. (2012). Who Invented the Reverse Mode of Differentiation? Documenta Mathematica, 12.

Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2 édition). Philadelphia, PA: Society for Industrial; Applied Mathematics.

Hastad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the eighteenth annual ACM symposium on Theory of computing (pp. 6–20). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/12130.12132

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv:1502.01852 [Cs]. Retrieved from http://arxiv.org/abs/1502.01852

Hinton, G. E. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507. https://doi.org/10.1126/science.1127647

Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251–257. https://doi.org/10.1016/0893-6080(91)90009-T

Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (pp. 448–456). PMLR. Retrieved from http://proceedings.mlr.press/v37/ioffe15.html

Jaderberg, M., Simonyan, K., & Zisserman, A. (2015). Spatial Transformer Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems (p. 9).

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 [Cs, Math]. Retrieved from http://arxiv.org/abs/1609.04836

Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. In arXiv:1412.6980 [cs]. Retrieved from http://arxiv.org/abs/1412.6980

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386

LeCun, Y. A., Bottou, L., Orr, G. B., & Müller, K.-R. (1998). Efficient BackProp. In G. Montavon, G. B. Orr, & K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade: Second Edition (pp. 9–48). Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-35289-8_3

Li, Y., Wei, C., & Ma, T. (2019). Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks. In NIPS 2019 (p. 12).

Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. In arXiv:1608.03983 [cs, math]. Retrieved from http://arxiv.org/abs/1608.03983

Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning (ICML13) (p. 6).

Maclin, R., & Shavlik, J. W. (1995). Combining the predictions of multiple classifiers: Using competitive learning to initialize neural networks. In Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1 (pp. 524–530). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Montufar, G. F., Pascanu, R., Cho, K., & Bengio, Y. (2014). On the Number of Linear Regions of Deep Neural Networks. In Advances in Neural Information Processing Systems (Vol. 27, pp. 2924–2932). Retrieved from https://papers.nips.cc/paper/2014/hash/109d2dd3608f669ca17920c511c2a41e-Abstract.html

Müller, R., Kornblith, S., & Hinton, G. (2020). When Does Label Smoothing Help? arXiv:1906.02629 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1906.02629

Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (pp. 807–814). Madison, WI, USA: Omnipress.

Nielsen, M. A. (2015). Neural Networks and Deep Learning. Retrieved from http://neuralnetworksanddeeplearning.com

Olah, C. (2015). Calculus on computational graphs: Backpropagation.

Park, J., & Sandberg, I. W. (1991). Universal Approximation Using Radial-Basis-Function Networks. Neural Computation, 3(2), 246–257. https://doi.org/10.1162/neco.1991.3.2.246

Pascanu, R., Dauphin, Y. N., Ganguli, S., & Bengio, Y. (2014). On the saddle point problem for non-convex optimization. arXiv:1405.4604 [Cs]. Retrieved from http://arxiv.org/abs/1405.4604

Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30thInternational Conference on Machine Learning (p. 9).

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., … Lerer, A. (2017). Automatic differentiation in PyTorch. In Proceedings of the 31st International Conference on Neural Information Processing Systems (p. 4).

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0

Santurkar, S., Tsipras, D., Ilyas, A., & Ma, A. (2018). How Does Batch Normalization Help Optimization? In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018) (p. 11).

Schmidhuber, J. (1992). Learning Complex, Extended Sequences Using the Principle of History Compression. Neural Computation, 4(2), 234–242. https://doi.org/10.1162/neco.1992.4.2.234

Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117. https://doi.org/10.1016/j.neunet.2014.09.003

Schwenker, F., Kestler, H. A., & Palm, G. (2001). Three learning phases for radial-basis-function networks. Neural Networks, 14(4-5), 439–458. Retrieved from http://dblp.uni-trier.de/db/journals/nn/nn14.html#SchwenkerKP01

Smith, L. N. (2018). A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay. arXiv:1803.09820 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1803.09820

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(56), 1929–1958. Retrieved from http://jmlr.org/papers/v15/srivastava14a.html

Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In ICML (p. 14).

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567 [Cs]. Retrieved from http://arxiv.org/abs/1512.00567

Werbos, P. (1981). Application of advances in nonlinear sensitivity analysis. In Proc. Of the 10th IFIP conference (pp. 762–770).