An introduction to deep learning
Jeremy Fix
February 15, 2024
Slides made with slidemaker\[ \begin{eqnarray*} \phi(x) = \begin{pmatrix} 1 \\ \exp{\frac{-||x-\mu_0||^2}{2\sigma_0^2}} \\ \vdots\\ \exp{\frac{-||x-\mu_{N_a-1}||^2}{2\sigma_{N_a-1}^2}} \\ \end{pmatrix} \end{eqnarray*} \]
Regression
Binary classification
Multi classification
We know how to learn the weights \(w\) : minibatch gradient descent (or a variant thereof)
What about the centers and variances ? (Schwenker, Kestler, & Palm, 2001)
place them uniformly, randomly, by vector quantization (K-means++(Arthur & Vassilvitskii, 2007), GNG (Fritzke, 1994))
two phases : fix the centers/variances, fit the weights
three phases : fix the centers/variances, fit the weights, fit everything (\(\nabla_{\mu} L, \nabla_{\sigma} L, \nabla_w L\))
Theorem : Universal approximation (Park & Sandberg, 1991)
Denote \(\mathcal{S}\) the family of functions based on RBF in \(\mathbb{R}^d\): \[\mathcal{S} = \{g \in \mathbb{R}^d \to \mathbb{R}, g(x) = \sum_i w_i K(\frac{x-\mu_i}{\sigma}), w \in \mathbb{R}^N\}\] with \(K : \mathbb{R}^d \rightarrow \mathbb{R}\) continuous almost everywhere and \(\int_{\mathbb{R}^d}K(x)dx \neq 0\),
Then \(\mathcal{S}\) is dense in \(L^p(\mathbb{R})\) for every \(p \in [1, \infty)\)
In particular, it applies to the gaussian kernel introduced before.
Vocabulary
ReLu are more favorable for the gradient flow than the saturating functions (more on that latter when discussing computational graphs and gradient computation).
Relu (Nair & Hinton, 2010)
\[\begin{equation*} \scriptstyle f(x) = \begin{cases} x & \mbox{ if } x \geq 0\\ 0 & \mbox { if } x < 0 \end{cases} \end{equation*}\]
Leaky Relu
(Maas, Hannun, & Ng, 2013)
Parametric ReLu
(He, Zhang, Ren, & Sun, 2015)
\[\begin{equation*} \scriptstyle f(x) = \begin{cases} x & \mbox{ if } x \geq 0\\ \alpha x & \mbox { if } x < 0 \end{cases} \end{equation*}\]
Exponential Linear Unit
(Clevert, Unterthiner, & Hochreiter, 2016)
Exactly as when we discussed about the RBF, this is task dependent.
Regression
Binary classification
Multi classification
Any well behaved function can be arbitrarily approximated with a single layer FNN (Cybenko, 1989), (Hornik, 1991)
Intuition
At that point, you may wonder why we bother about deep learning, right ?
Training is performed by gradient descent which was popularized by (Rumelhart, Hinton, & Williams, 1986) who called it error backpropagation (but (Werbos, 1981) already introduced the idea, see (Schmidhuber, 2015)).
Gradient descent is an iterative algorithm :
Remember : by minibatch gradient descent (see Lecture 1)
The question is : how do you compute \(\frac{\partial J}{\partial w_i}\) ??
But let us first see pytorch in action.
Overall steps :
Training
0- Imports
1- Loading the data
2- Define the network
3- Define the loss, optimizer, callbacks, …
4- Iterate and monitor
Testing
0- Imports
1- Loading the data
2- Define the network and load the trained parameters
3- Define the loss
4- Iterate
0- Imports
import torch
import torch.nn as nn
import torch.optim as optim
import sklearn
import sklearn.datasets
import tqdm
import matplotlib.pyplot as plt
1- Loading the data
# Load the data and build up our dataloader
data = sklearn.datasets.fetch_california_housing()
# X is (20640, 8), y is (20640, )
X, y = data.data, data.target
# At least normalize the input for an easier optimization
mean, std = X.mean(axis=0), X.std(axis=0)
X = (X - mean)/std
X_train = torch.tensor(X).float()
y_train = torch.tensor(y).float()
# A mapable dataset defines __len__ and __getitem__ (see also iterable datasets)
# it can also be an iterable dataset
train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
# A dataloader will create the minibatches
train_dataloader = torch.utils.data.DataLoader(train_dataset,
batch_size=64,
shuffle=True,
pin_memory=True)
Doc: Dataset, DataLoader. Pin memory
Iterating over train_dataloader gives a pair of tensors of shape \((64, 8)\) and \((64,)\).
2- Define the network
if torch.cuda.is_available():
device = torch.device('gpu')
else:
device = torch.device('cpu')
# Build up the model
Nh = 64
model = nn.Sequential(
nn.Linear(8, Nh), nn.ReLU(),
nn.Linear(Nh, Nh), nn.ReLU(),
nn.Linear(Nh, Nh), nn.ReLU(),
nn.Linear(Nh, 1)
)
model.to(device)
Doc: Linear, Sequential
3- Define the loss, optimizer, callbacks, …
4- Iterate and monitor
for e in range(num_epochs):
# Switch the network in train mode
model.train()
print(f"Epoch {e}")
for X, y in tqdm.tqdm(train_dataloader):
# Send the data to the GPU if necessary
X, y = X.to(device), y.to(device)
# Reset the gradient accumulator
optimizer.zero_grad()
# Forward pass
y_pred = model(X).squeeze()
loss_value = loss(y_pred, y)
# Backward pass
loss_value.backward()
# Weight update
optimizer.step()
print(f"MSELoss on the training set : {mseloss(train_dataloader)}")
# Update the learning rate after one epoch
scheduler.step()
Evaluation
def mseloss(loader):
# After every epoch, compute the risk
# on the loader
cum_loss = 0.0
n_samples = 0
# Switch the network in eval mode
model.eval()
with torch.no_grad():
for X, y in tqdm.tqdm(loader):
X, y = X.to(device), y.to(device)
# Forward pass
y_pred = model(X).squeeze()
loss_value = loss(y_pred, y)
# The loss is 'mean' reduced so be carefull when cumulating it
cum_loss += loss_value * y_pred.size()[0]
n_samples += y_pred.size()[0]
return cum_loss/n_samples
A computational graph is a directed acyclic graph where nodes are :
Example graph for a linear regression \(\mathbb{R}^8 \mapsto \mathbb{R}\) with minibatch \((X, y)\)
\[ J = \frac{1}{M} \sum_{i=0}^{63} (w_1^T x_i + b_1 - y_i)^2 \]
Problem computing the partial derivatives with respect to the variables \(\frac{\partial J}{\partial var}\).
You just need to provide the local derivatives of the output w.r.t the inputs.
And then apply the chain rule.
ex : \(\frac{\partial J}{\partial w_1} \in \mathcal{M}_{1, 8}(\mathbb{R})\), assuming numerator layout
Numerator layout convention (otherwise, we transpose and reverse the jacobian product order):
The derivative of a scalar with respect to a vector is a row vector : \[ y \in \mathbb{R}, x \in \mathbb{R}^n, \frac{dy}{dx} \in \mathcal{M}_{1, n}(\mathbb{R}) \]
More generally, the derivative of a vector valued function \(y : \mathbb{R}^{n_x} \mapsto \mathbb{R}^{n_y}\) with respect to its input (the Jacobian) is a \(n_y \times n_x\) matrix :
\[ x \in \mathbb{R}^{n_x}, y(x) \in \mathbb{R}^{n_y}, \frac{dy}{dx}(x) = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_{1}}{\partial x_{n_x}} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_{2}}{\partial x_{n_x}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_{n_y}}{\partial x_1} & \frac{\partial y_{n_y}}{\partial x_2} & \cdots & \frac{\partial y_{n_y}}{\partial x_{n_x}} \end{bmatrix}(x) \]
For a (single-path) chain \(y_1 \rightarrow y_2 = f_1(y_1) \rightarrow y_3 = f_2(y_2) \cdots y_n = f_{n-1}(y_{n-1})\), of vector valued functions \(y_1 \in \mathbb{R}^{n_1}, y_2\in\mathbb{R}^{n_2}, \cdots y_n \in \mathbb{R}^{n_n}\),
\[ \frac{\partial y_n}{\partial y_1} = \frac{\partial y_n}{\partial y_{n-1}}\frac{\partial y_{n-1}}{\partial y_{n-2}}\cdots\frac{\partial y_2}{\partial y_1} \]
ex : \(\frac{\partial J}{\partial w_1} = \frac{\partial J}{\partial y_1} \frac{\partial y_1}{\partial w_1} = \frac{2}{M} (y_1-y)^T X \in \mathbb{R}^8\), assuming numerator layout
For matrix variables, we should be introducing tensors. See also this and this
But this can be computationally (too) expensive :
Let us be more efficient : forward mode differentiation
Idea: To compute \(\frac{\partial y}{\partial x}\), forward propagate \(\frac{\partial }{\partial x}\)
e.g. \(\frac{\partial y}{\partial x} = y_3 e^{y_1} \left[ w_2^T + y_2 w_1^T\right] + y_4 e^{y_2}\left[ w_1^T + y_1 w_2^T\right]\)
Welcome to the field of automatic differentiation (AD). For more, see (Griewank, 2012), (Griewank & Walther, 2008) (see also (Olah, 2015), (Paszke et al., 2017))
Let us be (sometimes) even more efficient : reverse mode differentiation
Idea: To compute \(\frac{\partial y}{\partial x}\), backward propagate \(\frac{\partial y}{\partial }\) (compute the adjoint)
e.g. \(\frac{\partial y}{\partial x} = (y_4y_1e^{y_2} + y_3 e^{y_1})w_2^T + (y_3y_2e^{y_1} + y_4 e^{y_2})w_1^T\)
Oh ! We also got \(\frac{\partial y}{\partial w_2} = \frac{\partial y}{\partial y_2}\frac{\partial y_2}{\partial w_2}\), \(\frac{\partial y}{\partial b_2} = \frac{\partial y}{\partial y_2}\frac{\partial y_2}{\partial b_2}\), …
This is more efficient than forward mode when we have much more inputs (\(n\)) than outputs (\(m\)) for \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\), computing \(\frac{df}{dx}(x)\)
A Review of Automatic Differentiationand its Efficient Implementation
In (Rumelhart et al., 1986), the algorithm was called “error backpropagation” : why ?
Suppose a 2-layer multi-layer feedforward network and propagating one sample, with a scalar loss : \[ L = g( y_i, \begin{bmatrix} & & \\ & W_2 (n_2 \times n_1) & \\ & & \end{bmatrix} f( \begin{bmatrix} & & \\ & W_1 (n_1 \times n_x) & \\ & & \end{bmatrix} \begin{bmatrix} \\ x_i \\ \phantom{} \end{bmatrix} )) \in \mathbb{R} \]
\(g\) could be a squared loss for regression (with \(n_2=1\)), or CrossEntropyLoss (with logits and \(n_2=n_{class}\)) for multiclass classification.
We denote \(z_1 = W_1 x_i, z_2 = W_2 f(z_1)\) and \(\delta_i = \frac{\partial L}{\partial z_i} \in \mathbb{R}^{n_i}\). Then : \[ \begin{align} \delta_2 &= \frac{\partial L}{\partial z_2} = \frac{\partial g(x_1, x_2)}{\partial x_2}(y_i, z_2) \\ \delta_1 &= \frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial z_2} \frac{\partial z_2}{\partial z_1} = \begin{bmatrix} & & \delta_2 & & \end{bmatrix} \begin{bmatrix} & & \\ & W_2(n_2 \times n_1) & \\ & & \phantom{}\end{bmatrix} \text{diag}(f'(z_1)) \end{align} \] The errors of \(\delta_2\) are integrated back through the weight matrix that was used for the forward pass. (See also (Nielsen, 2015), chap 2).
Training in two phase
warning The reverse-mode differentiation uses the variables computed in the forward pass
\(\rightarrow\) we can apply efficiently stochastic gradient descent to optimize the parameters of our neural networks !
Note the computational graph can be extended to encompass the operations of the backward pass.
optimizer = optim.Adam(model.parameters())
for e in range(epochs):
for X,y in train_dataloader:
optimizer.zero_grad()
...
loss.backward()
optimizer.step()
The computational graph is a central notion in modern neural networks/deep learning. Broaden the scope with differential programming.
In the recent years, “fancier” differentiable blocks others than \(f(W f(W..))\), and that are dynamically built (eager mode vs static graph).
Spatial Transformer Networks
(Jaderberg, Simonyan, & Zisserman, 2015)
Content/Location based addressing
Neural Turing Machine / Differential Neural computer (Graves et al., 2016)
Indeed :
\[ \begin{align*} & \begin{bmatrix} \\ x \\ \phantom{} \end{bmatrix}\\ \begin{bmatrix} & & \\ & W_1 & \\ & & \phantom{} \end{bmatrix} & \begin{bmatrix} \\ y_1 \\ \phantom{} \end{bmatrix} \end{align*} \]
\[ \begin{align*} & \begin{bmatrix} \\ f(y_1) \\ \phantom{} \end{bmatrix}\\ \begin{bmatrix} & & \\ & W_2 & \\ & & \phantom{} \end{bmatrix} & \begin{bmatrix} \\ y_2 \\ \phantom{} \end{bmatrix} \end{align*} \]
But empirically, most local minima are close (in performance) to the global minimum, especially with large/deep networks. See (Dauphin et al., 2014), (Pascanu, Dauphin, Ganguli, & Bengio, 2014), (Choromanska, Henaff, Mathieu, Arous, & LeCun, 2015). Saddle points seem to be more critical.
Algorithm
Rationale (Taylor expansion) : \(L(\theta_{t+1}) \approx L(\theta_{t}) + (\theta_{t+1} - \theta_{t})^T \nabla_{\theta} L(\theta_{t})\)
The choice of the batch size :
The optimization may converge slowly or even diverge if the learning rate \(\epsilon\) is not appropriate.
Bengio: “The optimal learning rate is usually close to the largest learning rate that does not cause divergence of the training criterion” (Bengio, 2012)
Karpathy “\(0.0003\) is the best learning rate for Adam, hands down.” (Twitter, 2016)
(Note: Adam will be discussed in few slides)
See also :
- Practical Recommendations for gradient-based training of deep architectures (Bengio, 2012)
- Efficient Backprop (LeCun et al., 1998)
Setup
Parameters : \(\epsilon=0.005\), \(\theta_0 = \begin{bmatrix} 10 \\ 5\end{bmatrix}\)
Converges to \(\theta_{\infty} = \begin{bmatrix} 1.9882 \\ 2.9975\end{bmatrix}\)
Algorithm : Let us damp the oscillations with a low pass on \(\nabla_{\theta}\)
Usually \(\mu \approx 0.9\) or \(0.99\).
See also distill.pub. Note the frameworks may implement subtle variations.
Parameters : \(\epsilon=0.005\), \(\mu=0.6\), \(\theta_0 = \begin{bmatrix} 10 \\ 5\end{bmatrix}\)
Converges to \(\theta_{\infty} = \begin{bmatrix} 1.9837 \\ 2.9933\end{bmatrix}\)
Idea Look ahead to potentially correct the update. Based on Nesterov Accelerated Gradient. Formulation of (Sutskever, Martens, Dahl, & Hinton, 2013)
Algorithm
Parameters : \(\epsilon=0.005\), \(\mu=0.8\), \(\theta_0 = \begin{bmatrix} 10 \\ 5\end{bmatrix}\)
Converges to \(\theta_{\infty} = \begin{bmatrix} 1.9738 \\ 2.9914\end{bmatrix}\)
On our simple regression problem
You should always adapt your learning rate with a learning rate scheduler
Some more recent approaches are changing the picture of “decreasing learning rate” (“Robbins Monro conditions”)
See (Smith, 2018), The 1cycle policy - S. Gugger
Stochastic Gradient Descent with Warm Restart (Loshchilov & Hutter, 2017)
The improved performances may be linked to reaching flatter minimums (i.e. with predictions less sensitive than sharper minimums). The models reached before the warm restarts can be averaged (see Snapshot ensemble).
It seems also that initial large learning rates tend to lead to better models on the long run (Li, Wei, & Ma, 2019)
Adagrad Adaptive Gradient (Duchi, Hazan, & Singer, 2011)
The \(\sqrt{.}\) is experimentally critical ; \(\delta \approx [1e-8, 1e-4]\) for numerical stability.
Small gradients \(\rightarrow\) bigger learning rate for moving fast along flat directions
Big gradients \(\rightarrow\) smaller learning rate to calm down on high curvature.
But accumulation from the beginning is too aggressive. Learning rates decrease too fast.
RMSprop Hinton(unpublished, Coursera)
Idea: we should be using an exponential moving average when accumulating the gradient.
Adaptive Moments (ADAM) (Kingma & Ba, 2015)
\[ \theta(t+1) = \theta(t) - \frac{\epsilon}{\delta + \sqrt{\hat{v}(t+1)}} \hat{m}(t+1) \]
and some others : Adadelta (Zeiler, 2012), … , YellowFin (Zhang & Mitliagkas, 2018).
(Goodfellow, Bengio, & Courville, 2016) There is currently no consensus[…] no single best algorithm has emerged[…]the most popular and actively in use include SGD,SGD with momentum, RMSprop, RMSprop with momentum, Adadelta and Adam.
See also Chap. 8 of (Goodfellow et al., 2016)
Rationale : \[ J(\theta) \approx J(\theta_0) + (\theta - \theta_0)^T \nabla_{\theta}J(\theta_0) + \frac{1}{2}(\theta - \theta_0)^T \nabla^2_{\theta} J(\theta_0) (\theta - \theta_0) \] with \(H = \nabla^2J\) the Hessian matrix, a \(n_\theta \times n_\theta\) matrix hosting the second derivatives of \(J\).
The second derivates are much more noisy than the first derivative (gradient), a larger batch size is usually required to prevent instabilities.
XOR is easy right ?
But it fails miserably (6/20 fails). Tmax=1000
XOR is easy right ?
Now it is better (0/20 fails). Tmax=1000
Historically, training deep FNN was known to be hard, i.e. bad generalization errors.
The starting point of a gradient descent has a dramatic impact :
Pretraining is no more used (because of xxRelu, Initialization schemes, ..)
Gradient descent converges faster if your data are normalized and decorrelated. Denote by \(x_i \in \mathbb{R}^d\) your input data, \(\hat{x}_i\) its normalized.
Z-score normalization (goal: \(\hat{\mu}_j = 0, \hat{\sigma}_j = 1\)) \[ \forall i,j, \hat{x}_{i,j} = \frac{x_{i,j} - \mu_j}{\sigma_j + \epsilon} \]
ZCA whitening (goal: \(\hat{\mu}_j = 0, \hat{\sigma}_j = 1\), \(\frac{1}{n-1} \hat{X} \hat{X}^T = I\))
\[ \hat{X} = W X, W = \frac{1}{\sqrt{n-1}} (XX^T)^{-1/2} \]
Remember our linear regression : \(y = 3x+2+\mathcal{U}(-0.1, 0.1)\), L2 loss, 30 1D samples
A good initialization should break the symmetry : constant initialization schemes make units learning all the same thing
A good initialization should start optimization in a region of low capacity : linear neural network
The Fundamental Deep Learning Problem first observed by (Josef Hochreiter, 1991) for RNN, the gradient can either vanish or explode, especially in deep networks (RNN are very deep).
Remember that the backpropagated gradient involves : \[ \frac{\partial J}{\partial x_l} = \frac{\partial J}{\partial y_L} W_L f'(y_l) W_{L-1} f'(y_{l-1}) \cdots \] with \(y_l = W_l x_l + b, x_l = f(y_{l-1})\).
We see a pattern like \((W.f')^L\) which can diverge or vanish for large \(L\).
especially, with the sigmoid :\(f' < 1\).
With a ReLu, the positive part has \(f' = 1\).
In (LeCun et al., 1998), Y. LeCun provided some guidelines on the design:
Aim Initialize the weights/biases to keep \(f\) in its linear part through multiple layers:
set the biases to \(0\)
initialize randomly and independently from \(\mathcal{N}(\mu=0, \sigma^2=\frac{1}{fan_{in}})\).
If \(x \in \mathbb{R}^n\) is \(\mathcal{N}(0, \Sigma = I)\), \(w \in \mathbb{R}^n\) is \(\mathcal{N}(μ=0, \Sigma=\frac{1}{n}I)\), then :
\[ \begin{align*} E[w^T x + b] &= E[w^T x] = \sum_i E[w_i x_i] = \sum_i E[w_i] E[x_i] = 0\\ var[w^T x + b] &= var[w^T x] \\ & = \sum_i \sigma^2_{w_i}\sigma^2_{x_i} + \sigma^2_{w_i}\mu^2_{x_i} + \mu^2_{w_i}\sigma^2_{x_i}\\ &= \sum_i \sigma^2_{w_i}\sigma^2_{x_i} = \frac{1}{n}\sum \sigma^2_{x_i} = 1 \end{align*} \]
\(x_i, w_i\) are all pairwise independent.
Idea we must preserve the same distribution along the forward and backward pass (Glorot & Bengio, 2010).
This prevents:
Glorot (Xavier) initialization scheme for a feedforward network \(f(W_{n}..f(W^1f(W_0 x+b_0)+b_1)...+b_n)\) with layer sizes \(n_i\):
Assuming the linear regime \(f'() = 1\) of the network : \[
\begin{align*}
\mbox{Forward propagation variance constraint :} \forall i, fan_{in_i} \sigma^2_{W_i} &= 1\\
\mbox{Backward propagation variance constraint :} \forall i, fan_{out_i} \sigma^2_{W_i} &= 1
\end{align*}
\] Compromise : \(\forall i, \frac{1}{\sigma^2_{W_i}} = \frac{fan_{in} + fan_{out}}{2}\)
- Glorot (Xavier) uniform : \(\mathcal{U}(-\frac{\sqrt{6}}{\sqrt{fan_{in}+fan_{out}}}, \frac{\sqrt{6}}{\sqrt{fan_{in}+fan_{out}}})\), b=0
- Glorot (Xavier) normal : \(\mathcal{N}(0, \frac{\sqrt{2}}{\sqrt{fan_{in}+fan_{out}}})\), b=0
Idea we must preserve the same distribution along the forward and backward pass for rectifiers (He et al., 2015). For a feedforward network \(f(W_{n}..f(W^1f(W_0 x+b_0)+b_1)...+b_n)\) with layer sizes \(n_i\):
So, \(\sigma^2_{y_l} = \frac{1}{2}n_l \sigma^2_{w_l} \sigma^2_{y_{l-1}}\). To preserve the variance, we must guarantee \(\frac{1}{2} n_l \sigma^2_{w_l} = 1\).
We used : if \(X\) and \(Y\) are independent : \(\sigma^2_{X.Y} = \mu_{X^2}\mu_{Y^2} - \mu_X^2 \mu_Y^2\)
Idea we must preserve the same distribution along the forward and backward pass for rectifiers (He et al., 2015). For a feedforward network \(f(W_{n}..f(W^1f(W_0 x+b_0)+b_1)...+b_n)\) with layer sizes \(n_i\):
\[ \begin{align*} \mbox{Forward propagation variance constraint :} \forall i, \frac{1}{2}fan_{in_i} \sigma^2_{W_i} &= 1\\ \mbox{Backward propagation variance constraint :} \forall i, \frac{1}{2}fan_{out_i} \sigma^2_{W_i} &= 1 \end{align*} \]
He suggests to use either one or the other, e.g. \(\sigma^2_{W_i} = \frac{2}{fan_{in}}\)
- He uniform : \(\mathcal{U}(-\frac{\sqrt{6}}{\sqrt{fan_{in}}}, \frac{\sqrt{6}}{\sqrt{fan_{in}}})\), b=0
- He normal : \(\mathcal{N}(0, \frac{\sqrt{2}}{\sqrt{fan_{in}}})\), b=0
Note: for PreLu : \(\frac{1}{2} (1 + a^2) fan_{in} \sigma^2_{W_i} = 1\)
By default, the parameters are initialized randomly. e.g. in torch.nn.Linear :
class Linear(torch.nn.Module):
def __init__(self):
...
self.reset_parameters()
def reset_parameters(self) -> None:
# Setting a=sqrt(5) in kaiming_uniform is the same as initializing with
# uniform(-1/sqrt(in_features), 1/sqrt(in_features)). For details, see
# https://github.com/pytorch/pytorch/issues/57109
torch.nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
fan_in, _ = torch.init.calculate_fan_in_and_fan_out(self.weight)
bound = 1/math.sqrt(fan_in)
init.uniform_(self.bias, -bound, bound)
Oh, but that’s not what we should use for ReLu ?!?! Indeed you are right, see this issue. This is to avoid breaking with the way torch(lua) was initializing.
import torch.nn.init as init
class MyModel(torch.nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.classifier = nn.Sequential(
*linear_relu(input_size, 256),
*linear_relu(256, 256),
nn.Linear(256, num_classes)
)
self.init()
def init(self):
@torch.no_grad()
def finit(m):
if type(m) == nn.Linear:
init.kaiming_uniform_(m.weight,
a=0,
mode='fan_in',
nonlinearity='relu')
m.bias.fill_(0.0)
self.apply(finit)
(Ioffe & Szegedy, 2015) observed the change in distribution of network activations due to the change in network parameters during training.
Experiment 3 fully connected layers (100 units), sigmoid, softmax output, MNIST dataset
Idea standardize the activations of every layers to keep the same distributions during training (Ioffe & Szegedy, 2015)
The gradient must be aware of this normalization, otherwise may get parameter explosion (see (Ioffe & Szegedy, 2015)) \(\rightarrow\) we need a differentiable normalization layer
introduces a differentiable Batch Normalization layer : \[ z = g(W x + b) \rightarrow z = g(BN(W x)) \]
BN operates element-wise : \[ \begin{align*} y_i &= BN_{\gamma,\beta} (x_i) = \gamma \hat{x}_i + \beta\\ \hat{x}_i &= \frac{x_i - \mu_{\mathcal{B}, i} }{\sqrt{\sigma^2_{\mathcal{B}, i} + \epsilon}} \end{align*} \] with \(\mu_{\mathcal{B},i}\) and \(\sigma_{\mathcal{B},i}\) statistics computed on the mini batch during training.
Learning faster, with better generalization.
During training
During inference (test) :
warningDo not forget to switch to test mode :
Some recent works challenge the idea of covariate shift (Santurkar, Tsipras, Ilyas, & Ma, 2018), (Bjorck, Gomes, Selman, & Weinberger, 2018). The loss seems smoother allowing larger learning rates, better generalization, robustness to hyperparameters.
Add a L2 penalty on the weights, \(\alpha > 0\)
\[ \begin{align*} J(\theta) &= L(\theta) + \frac{\alpha}{2} \|\theta\|^2_2 = L(\theta) + \frac{\alpha}{2}\theta^T \theta\\ \nabla_\theta J &= \nabla_\theta L + \alpha \theta\\ \theta &\leftarrow \theta - \epsilon \nabla_\theta J = (1 - \alpha \epsilon) \theta - \epsilon \nabla_\theta L \end{align*} \] Called L2 regularization, Tikhonov regularization, weight decay
Example RBF, 1 kernel per sample, \(N=30\), noisy inputs,
See chap 7 of (Goodfellow et al., 2016) for a geometrical interpretation
Intuition : for linear layers, the gradient of the function equals the weights. Small weights \(\rightarrow\) small gradient \(\rightarrow\) smooth function.
In theory, regularizing the bias will cause underfitting
Example
\[ \begin{align*} J(w, b) &= \frac{1}{N} \sum_{i=1}^N \| y_i - b - w^T x_i\|_2^2\\ \nabla_b J(w,b) &\implies b = (\frac{1}{N} \sum_i y_i) - w^T (\frac{1}{N} \sum_i x_i) \end{align*} \]
If your data are centered (as they should), the optimal bias is the mean of the targets.
Add a L1 penalty to the weights : \[ \begin{align*} J(\theta) &= L(\theta) + \alpha \|\theta\|_1 = L(\theta) + \alpha \sum_i |\theta_i|\\ \nabla_\theta J &= \nabla_\theta L + \alpha \mbox{sign}(\theta) \end{align*} \]
Example RBF, 1 kernel per sample, \(N=30\), noisy inputs,
See chap 7 of (Goodfellow et al., 2016) for a mathematical explanation in a specific case. Sparsity used for feature selection with LASSO (filter/wrapper/embedded).
Introduced in (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014):
Idea 1 : preventing co-adaptation. A pattern is robust by itself not because of others doing part of the job.
Idea 2 : average of all the sub-networks (ensemble learning)
How :
for every minibatch, zeroes hidden and input activations with probability \(p\) (\(p=0.5\) for hidden, \(p=0.2\) for input). At test time, multiply every activations by \(p\)
“Inverted” dropout : multiply the kept activations by \(p\) at train time. At test time, just do a normal forward pass.
Can be interpreted as if training/averaging all the possible subnetworks.
L1/L2
Dropout
Split your data in three sets :
Everything can be placed in a cross validation loop.
Early stopping is about keeping the model with the lowest validation loss.
The best regularizer you may find is data. The more you have, the better you learn.
you can use pretrained models on some tasks as an initialization for learning your task (but may fail due to domain shift) : check the Pytorch Hub, timm, Hugging face hub
you can use unlabeled data for pretraining your networks (as done in 2006s) with auto-encoders / RBM : unsupervised/semi-supervised learning. Note also the recent works on self supervision (Balestriero et al., 2023)
you can apply random transformations to your data : dataset augmentation, see for example albumentations.ai
Introduced in (Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2015) in the context of Convolutional Neural Networks.
Idea : Preventing the network to be over confident on its predictions on the training set.
Recipe : in a \(k\)-class problem, instead of using hard targets \(\in \{0, 1\}\), use soft targets \(\in \{\frac{\alpha}{k}, 1-\alpha\frac{k-1}{k}\}\) (weighted average between the hard targets and uniform target). \(\alpha \approx 0.1\).
See also (Müller, Kornblith, & Hinton, 2020) for several experiments.
See also Mixup regularization (Zhang, Cisse, Dauphin, & Lopez-Paz, 2017).
Rather check the full online document references.pdf
Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms.
Balestriero, R., Ibrahim, M., Sobal, V., Morcos, A., Shekhar, S., Goldstein, T., … Goldblum, M. (2023). A cookbook of self-supervised learning. Retrieved from http://arxiv.org/abs/2304.12210
Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures. arXiv:1206.5533 [Cs]. Retrieved from http://arxiv.org/abs/1206.5533
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy Layer-Wise Training of Deep Networks. In (p. 8).
Bjorck, N., Gomes, C. P., Selman, B., & Weinberger, K. Q. (2018). Understanding Batch Normalization. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018) (p. 12).
Broomhead, D., & Lowe, D. (1988). Multivariable Functional Interpolation and Adaptive Networks. Complex Systems, 2, 321–355.
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2015). The Loss Surfaces of Multilayer Networks. In Roceedings of the 18thInternational Con-ference on Artificial Intelligence and Statistics (p. 13).
Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2016). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). arXiv:1511.07289 [Cs]. Retrieved from http://arxiv.org/abs/1511.07289
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314. https://doi.org/10.1007/BF02551274
Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems, 27, 2933–2941. Retrieved from https://papers.nips.cc/paper/2014/hash/17e23e50bedc63b4095e3d8204ce063b-Abstract.html
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, 12, 2121–2159.
Fritzke, B. (1994). A growing neural gas network learns topologies. In Proceedings of the 7th International Conference on Neural Information Processing Systems (pp. 625–632). Cambridge, MA, USA: MIT Press.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Roceedings of the13thInternational Conferenceon Artificial Intelligence and Statistics (AISTATS) 2010 (p. 8).
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., … Hassabis, D. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471–476. https://doi.org/10.1038/nature20101
Griewank, A. (2012). Who Invented the Reverse Mode of Differentiation? Documenta Mathematica, 12.
Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2 édition). Philadelphia, PA: Society for Industrial; Applied Mathematics.
Hastad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the eighteenth annual ACM symposium on Theory of computing (pp. 6–20). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/12130.12132
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv:1502.01852 [Cs]. Retrieved from http://arxiv.org/abs/1502.01852
Hinton, G. E. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507. https://doi.org/10.1126/science.1127647
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251–257. https://doi.org/10.1016/0893-6080(91)90009-T
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (pp. 448–456). PMLR. Retrieved from http://proceedings.mlr.press/v37/ioffe15.html
Jaderberg, M., Simonyan, K., & Zisserman, A. (2015). Spatial Transformer Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems (p. 9).
Josef Hochreiter. (1991). Untersuchungen zu dynamischen neuronalen Netzen (PhD thesis). Retrieved from http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 [Cs, Math]. Retrieved from http://arxiv.org/abs/1609.04836
Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. In arXiv:1412.6980 [cs]. Retrieved from http://arxiv.org/abs/1412.6980
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386
LeCun, Y. A., Bottou, L., Orr, G. B., & Müller, K.-R. (1998). Efficient BackProp. In G. Montavon, G. B. Orr, & K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade: Second Edition (pp. 9–48). Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-35289-8_3
Li, Y., Wei, C., & Ma, T. (2019). Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks. In NIPS 2019 (p. 12).
Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. In arXiv:1608.03983 [cs, math]. Retrieved from http://arxiv.org/abs/1608.03983
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning (ICML13) (p. 6).
Maclin, R., & Shavlik, J. W. (1995). Combining the predictions of multiple classifiers: Using competitive learning to initialize neural networks. In Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1 (pp. 524–530). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Montufar, G. F., Pascanu, R., Cho, K., & Bengio, Y. (2014). On the Number of Linear Regions of Deep Neural Networks. In Advances in Neural Information Processing Systems (Vol. 27, pp. 2924–2932). Retrieved from https://papers.nips.cc/paper/2014/hash/109d2dd3608f669ca17920c511c2a41e-Abstract.html
Müller, R., Kornblith, S., & Hinton, G. (2020). When Does Label Smoothing Help? arXiv:1906.02629 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1906.02629
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (pp. 807–814). Madison, WI, USA: Omnipress.
Nielsen, M. A. (2015). Neural Networks and Deep Learning. Retrieved from http://neuralnetworksanddeeplearning.com
Olah, C. (2015). Calculus on computational graphs: Backpropagation.
Park, J., & Sandberg, I. W. (1991). Universal Approximation Using Radial-Basis-Function Networks. Neural Computation, 3(2), 246–257. https://doi.org/10.1162/neco.1991.3.2.246
Pascanu, R., Dauphin, Y. N., Ganguli, S., & Bengio, Y. (2014). On the saddle point problem for non-convex optimization. arXiv:1405.4604 [Cs]. Retrieved from http://arxiv.org/abs/1405.4604
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30thInternational Conference on Machine Learning (p. 9).
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., … Lerer, A. (2017). Automatic differentiation in PyTorch. In Proceedings of the 31st International Conference on Neural Information Processing Systems (p. 4).
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0
Santurkar, S., Tsipras, D., Ilyas, A., & Ma, A. (2018). How Does Batch Normalization Help Optimization? In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018) (p. 11).
Schmidhuber, J. (1992). Learning Complex, Extended Sequences Using the Principle of History Compression. Neural Computation, 4(2), 234–242. https://doi.org/10.1162/neco.1992.4.2.234
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117. https://doi.org/10.1016/j.neunet.2014.09.003
Schwenker, F., Kestler, H. A., & Palm, G. (2001). Three learning phases for radial-basis-function networks. Neural Networks, 14(4-5), 439–458. Retrieved from http://dblp.uni-trier.de/db/journals/nn/nn14.html#SchwenkerKP01
Smith, L. N. (2018). A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay. arXiv:1803.09820 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1803.09820
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(56), 1929–1958. Retrieved from http://jmlr.org/papers/v15/srivastava14a.html
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In ICML (p. 14).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567 [Cs]. Retrieved from http://arxiv.org/abs/1512.00567
Werbos, P. (1981). Application of advances in nonlinear sensitivity analysis. In Proc. Of the 10th IFIP conference (pp. 762–770).
Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv:1212.5701 [Cs]. Retrieved from http://arxiv.org/abs/1212.5701
Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). Mixup: Beyond empirical risk minimization. arXiv. https://doi.org/10.48550/ARXIV.1710.09412
Zhang, J., & Mitliagkas, I. (2018). YellowFin and the Art of Momentum Tuning. arXiv:1706.03471 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1706.03471