An introduction to deep learning
Jeremy Fix
February 15, 2024
Slides made with slidemakerbut it is hard to train (except the CNN) and the SVM comes in the play in the 1990s … : second winter
For an overview : (Schmidhuber, 2015)
See also Deep reinforcement learning : Atari / AlphaGO / AlphaStar / AlphaChem; Graph neural networks, etc..
Some of the reasons of the current success :
Libraries allow to easily implement/test/deploy neural networks :
Some of the major contributors to the field:
Some the most important conferences: NIPS/NeurIPS, ICLR, (ICML, ICASSP, ..)
Online ressources :
- distill.pub, blog posts (e.g. pytorch.org blog),
- FastAI lectures, CS231n, MIT S191
- awesome deep learning, Awesome deep learning papers
Lecture 1/2 (08/02): Introduction, Linear networks, RBF
Lecture 3/4 (10/02): Feedforward networks, differential programming, initialization and gradient descent
2 : 17/02 1TP : 21/02 1TP : 28/02 2 CM : 07/03 2 CM : 14/03 1 TP : 21/03
Lecture 5 (17/02): Regularization, and Convolutional neural networks architectures
Lecture 6 (17/02-07/03) : Convolutional Neural Networks : applications
Lab work 1 (21/02-28/02) : Introduction to pytorch, tensorboard, FCN, CNNs
Lecture 7 (07/03-14/03): Recurrent neural networks : architectures
Lecture 8 (14/03): Recurrent neural networks : applications
Lab work 3 (21/03): Recurrent neural networks : Seq2Seq for Speech to text
Labworks : on our GPU clusters (1080, 2080 Ti, pytorch), in pairs, remotely with VNC.
Exam (22/03): 2h paper and pen exam
A neural network is a directed graph :
There are two types of graphs :
But why do we care about convolutional neural networks with a softmax output, ReLu hiddden activations, cross entropy loss, batch normalization layers, trained with RMSprop with Nesterov momentum regularized with dropout exactly ?
Given fixed, predefined feature functions \(\phi_j\), with \(\phi_0(x) = 1, \forall x \in \mathbb{R}^n\), the perceptron classifies \(x\) as :
\[\begin{align} y &= g(w^T \Phi(x))\\ g(x) &= \begin{cases}-1 &\text{if }\quad x < 0 \\ +1 & \text{if }\quad x \geq 0 \end{cases} \end{align}\]
with \(\phi(x) \in \mathbb{R}^{n_a+1}\), \(\phi(x) = \begin{bmatrix} 1 \\ \phi_1(x) \\ \phi_2(x) \\ \vdots \end{bmatrix}\)
Given \((x_i, y_i)\), \(y_i \in \{-1,1\}\), the perceptron learning rule operates online: \[\begin{align} w = \begin{cases} w &\text{ if the input is correctly classified}\\ w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\ w - \phi(x_i) &\text{ if the input is incorrectly classified as +1} \end{cases} \end{align}\]
Decision rule : \(y = g(w^T \Phi(x))\)
Algorithm:
\[\begin{align}
w = \begin{cases}
w &\text{ if the input is correctly classified}\\
w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\
w - \phi(x_i) &\text{ if the input is incorrectly classified as +1}
\end{cases}
\end{align}\]
Decision rule : \(y = g(w^T \Phi(x))\)
Algorithm:
\[\begin{align}
w = \begin{cases}
w &\text{ if the input is correctly classified}\\
w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\
w - \phi(x_i) &\text{ if the input is incorrectly classified as +1}
\end{cases}
\end{align}\]
Decision rule : \(y = g(w^T \Phi(x))\)
The intersection of the valid halfspaces is called the cone of feasibility (it may be empty).
Consider two samples \(x_1, x_2\) with \(y_1=+1\), \(y_2=-1\)
Given \((x_i, y_i)\), \(y_i \in \{-1,1\}\), the perceptron learning rule operates online: \[\begin{align} w = \begin{cases} w &\text{ if the input is correctly classified}\\ w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\ w - \phi(x_i) &\text{ if the input is incorrectly classified as +1} \end{cases} \end{align}\]
\[\begin{align} w = \begin{cases} w &\text{ if } g(w^T\phi(x_i)) = y_i\\ w + \phi(x_i) &\text{ if } g(w^T \phi(x_i)) = -1 \text{ and } y_i = +1\\ w - \phi(x_i) &\text{ if } g(w^T \phi(x_i)) = +1 \text{ and } y_i = -1 \end{cases} \end{align}\]
\[\begin{align*} w = \begin{cases} w &\text{ if } g(w^T\phi(x_i)) = y_i\\ w + y_i \phi(x_i) &\text{ if } g(w^T \phi(x_i)) \neq y_i \end{cases} \end{align*}\]
Given \((x_i, y_i)\), \(y_i \in \{-1,1\}\), the perceptron learning rule operates online:
\[\begin{align*} w = \begin{cases} w &\text{ if } g(w^T\phi(x_i)) = y_i\\ w + y_i \phi(x_i) &\text{ if } g(w^T \phi(x_i)) \neq y_i \end{cases} \end{align*}\]
\[\begin{align*} w = w + \frac{1}{2} (y_i - \hat{y}_i) \phi(x_i) \end{align*}\]
with \(\hat{y}_i = g(w^T \phi(x_i))\). This is called the delta rule.
Definition (Linear separability)
A binary classification problem \((x_i, y_i) \in \mathbb{R}^d \times \{-1,1\}, i \in [1..N]\) is said to be linearly separable if there exists \(\textbf{w} \in \mathbb{R}^d\) such that~:
\[\begin{align*} \forall i, \mbox{sign}(\textbf{w}^T x_i) = y_i \end{align*}\]
with \(\forall x < 0, \mbox{sign}(x) = -1, \forall x \geq 0, \mbox{sign}(x) = +1\).
Theorem (Perceptron convergence theorem)
A classification problem \((x_i, y_i) \in \mathbb{R}^d \times \{-1,1\}, i \in [1..N]\) is linearly separable if and only if the perceptron learning rule converges to an optimal solution in a finite number of steps.
\(\Leftarrow\): easy; \(\Rightarrow\) : we upper/lower bound \(|w(t)|_2^2\)
\[\begin{equation} w_t = w_0 + \sum_{i}\frac{1}{2} (y_i - \hat{y}_i) \phi(x_i) \end{equation}\]
\((y_i - \hat{y}_i)\) is the prediction error
Any linear predictor involving only scalar products can be kernelized (kernel trick, cf SVM);
Decision rule : \(\mbox{sign}(<w, x>)\)
Given \(w(t) = w_0 + \sum_{i \in \mathcal{I}} y_i x_i\)
\[\begin{align*} <w,x> &= <w_0,x> + \sum_{i \in \mathcal{I}} y_i <x_i, x> \\ \Rightarrow k(w,x) &= k(w_0, x) + \sum_{i \in \mathcal{I}} y_i k(x_i, x) \end{align*}\]
Polynomial kernel of degree \(d=3\) :
\[k(x, y) = (1 + <x, y>)^3\]
Training set : 50 samples
Real risk : \(92\%\)
Code : https://github.com/rougier/ML-Recipes/blob/master/recipes/ANN/kernel-perceptron.py
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \mathbb{R}\), minimize
\[ J(w) = \frac{1}{N} \sum_i ||y_i - w^T x_i||^2 \]
We assume that \(x_i[0] = 1 \forall i\) so that \(w[0]\) hosts the bias term.
Analytically Introduce \(X = [x_0 | x_1 | ... ]\), \(J(w) = \|y-X^Tw\|^2\). In numerator layout (see later)
\[ \nabla_w J(w) = 0 \Rightarrow \nabla_z \|z\|_2^2(z=y-X^T w) \nabla_w (y-X^T w)= -2.(y - X^Tw)^T X^T = 0 \Rightarrow X X^T w = X y \]
Needs to compute \(XX^T\), i.e. over the whole training set…
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \mathbb{R}\), minimize
\[ J(w) = \frac{1}{N} \sum_i ||y_i - w^T x_i||^2 \]
We assume that \(x_i[0] = 1 \forall i\) so that \(w[0]\) hosts the bias term.
start at \(w_0\)
take each sample one after the other (online) \(x_i, y_i\)
denote \(\hat{y}_i = w^T x_i\) the prediction
update \[w_{t+1}= w_t - \epsilon \nabla_w J(w_t) = w_t + \epsilon (y_i - \hat{y}_i) x_i\]
delta rule, \(\delta = (y_i - \hat{y}_i)\) prediction error \[w_{t+1} = w_t + \epsilon \delta x_i\]
\[J(w,x,y) = \frac{1}{N} \sum_{i=1}^N L(w,x_i,y_i)\] e.g. \(L(w,x_i,y_i) = ||y_i - w^T x_i||^2\)
Batch gradient descent
compute the gradient of the loss \(J(w)\) over the whole training set
performs one step in direction of \(-\nabla_w J(w,x,y)\) \[w_{t+1} = w_t - \epsilon_t \textcolor{red}{\nabla_w J(w,x,y)}\]
\(\epsilon\) : learning rate
\[J(w,x,y) = \frac{1}{N} \sum_{i=1}^N L(w,x_i,y_i)\] e.g. \(L(w,x_i,y_i) = ||y_i - w^T x_i||^2\)
Stochastic gradient descent (SGD)
one sample at a time, noisy estimate of \(\nabla_w J\)
performs one step in direction of \(-\nabla_w L(w,x_i,y_i)\) \[w_{t+1} = w_t - \epsilon_t \textcolor{red}{\nabla_w L(w,x_i,y_i)}\]
faster to converge than gradient descent
\[J(w,x,y) = \frac{1}{N} \sum_{i=1}^N L(w,x_i,y_i)\] e.g. \(L(w,x_i,y_i) = ||y_i - w^T x_i||^2\)
Minibatch
\[ w_{t+1} = w_t - \epsilon_t \textcolor{red}{\frac{1}{M} \sum_{j \in \mathcal{J}} \nabla_w L(w,x_j,y_j)} \]
If the batch size is too large, there is a generalization gap (LeCun, Bottou, Orr, & Müller, 1998), maybe due to sharp minimum (Keskar, Mudigere, Nocedal, Smelyanskiy, & Tang, 2017); see also (Hoffer, Hubara, & Soudry, 2017)
Convex function A function \(f: \mathbb{R}^n \mapsto \mathbb{R}\) is convex :
\(\iff \forall x_1, x_2 \in \mathbb{R}^n, \forall t \in [0,1]\) \(f(t x_1 + (1-t)x_2) \leq t f(x_1) + (1-t) f(x_2)\)
with \(f\) twice diff.,
\(\iff \forall x \in \mathbb{R}^n, H = \nabla^2 f(x)\) is positive semidefinite
i.e. \(\forall x \in \mathbb{R}^n, x^T H x \geq 0\)
For a convex function \(f\), all local minima are global minima. Our losses are lower bounded, so these minima exist. Under mild conditions, gradient descent and stochastic gradient descent converge, typically \(\sum \epsilon_t =\infty, \sum \epsilon_t^2 < \infty\) (cf lectures on convex optimization).
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \mathbb{R}\)
Other choices may also be considered (Huber loss, MAE, …).
Possibly regularized (but more on regularization latter).
Indeed,
\[ \begin{align*} \nabla_w L &= (w^T x_i - y_i) x_i\\ \nabla_w^2 L &= x_i x_i^T\\ \forall x \in \mathbb{R}^n x^T x_i x_i^T x &= (x_i^T x)^2 \geq 0 \end{align*} \]
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n}, y_i \in \{0, 1\}\)
Assume that \(P(y=1 | x) = p(x; w)\), parametrized by \(w\), and our samples to be independent, the conditional likelihood of the labels is:
\[ \mathcal{L}(w) = \prod_i P(y=y_i | x_i) = \prod_i p(x_i; w)^{y_i} (1- p(x_i; w))^{1-y_i} \]
With maximum likelihood estimation, we rather equivalently minimize the averaged negative log-likelihood :
\[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = \frac{1}{N} \sum_i -y_i\log(p(x_i; w))-(1-y_i)\log(1-p(x_i; w)) \]
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \{0, 1\}\)
Linear logit model : \(o(x) = w^Tx\) (we still assume \(x[0] = 1\) for the bias)
Sigmoid transfer function : \(\hat{y}(x) = p(x; w) = \sigma(o(x)) = \sigma(w^T x)\)
Following maximum likelihood estimation, we minimize : \[ J(w) = \frac{1}{N} \sum_i -y_i\log(p(x_i; w))-(1-y_i)\log(1-p(x_i; w)) \]
The loss \(L(\hat{y}, y) = -y \log(\hat{y}) - (1-y)\log(1 - \hat{y})\) is called the cross entropy loss, or negative log-likelihood
The gradient of the cross entropy loss with \(\hat{y}(x) = \sigma(x)\) is : \[ \nabla_w L(w,x_i,y_i) = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w} = -(y_i - \hat{y}_i) x_i \]
Indeed,
Compute the gradient to see why
Take L2 loss \(L(\hat{y}, y) = \frac{1}{2}||\hat{y} - y||^2\)
With a cross entropy loss, \(\nabla_w L(w,x_i,y_i)\) is proportional to the error
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in [|0, K-1|]\)
Assume that \(P(y=c | x) = \frac{e^{w_c^T x}}{\sum_k e^{w_k^T x}}\), parametrized by \(w_0, w_1, w_2, ..\), and our samples to be independent, the conditional likelihood of the labels is:
\[ \mathcal{L}(w) = \prod_i P(y=y_i | x_i) \]
With maximum likelihood estimation, we rather equivalently minimize the averaged negative log-likelihood:
\[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = -\frac{1}{N} \sum_i \log(P(y=y_i | x_i)) \]
With a one-hot encoding of the target class (i.e. \(y_i = [0, ..., 0, 1, 0, .. ]\)), it can be written as :
\[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = -\frac{1}{N} \sum_i \sum_c y_c \log(P(y=c | x_i)) \]
Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in [|0, K-1|]\)
Softmax regression is convex.
Large exponentials
If you compute naïvely the softmax, you would have \(\exp(..)\) which is quickly large.
Fortunately:
\[ softmax(o_1, o_2, o_3, ..) = softmax(o_1 - o^\star, o_2 - o^\star, o_3-o^\star) = \frac{\exp(o_i - o^\star)}{\sum_j \exp(o_j - o^\star)} \]
You always compute \(\exp(z)\) with \(z \leq 0\).
Avoiding some exponentials with the log-sum-exp trick \(\log(\sum_j \exp(o_j)) = o^\star + \log(\sum_j \exp(o_j-o^\star))\)
You do not really need to compute the \(\log(\hat{y}_j) = \log(softmax_j(x)))\) since :
\[ \log(\hat{y}_i) = \log(\frac{\exp(o_i-o^\star)}{\sum_j \exp(o_j - o^\star)}) = o_i - o^\star - \log(\sum_j \exp(o_j - o^\star)) \]
In practice that explains why we use the Cross entropy loss with logits outputs rather than Softmax + Negative log likelihood or even LogSoftMax + NLLLoss (which does not have the log… yeah confusing…)
Perceptrons and logistic regression perform linear separation in a predefined, fixed feature space.
What about learning these features \(\phi_j(x)\)?
\[ \begin{eqnarray*} \phi(x) = \begin{pmatrix} 1 \\ \exp{\frac{-||x-\mu_0||^2}{2\sigma_0^2}} \\ \vdots\\ \exp{\frac{-||x-\mu_{N_a-1}||^2}{2\sigma_{N_a-1}^2}} \\ \end{pmatrix} \end{eqnarray*} \]
Regression
Binary classification
Multi classification
We know how to learn the weights \(w\) : minibatch gradient descent (or a variant thereof)
What about the centers and variances ? (Schwenker, Kestler, & Palm, 2001)
place them uniformly, randomly, by vector quantization (K-means++(Arthur & Vassilvitskii, 2007), GNG (Fritzke, 1994))
two phases : fix the centers/variances, fit the weights
three phases : fix the centers/variances, fit the weights, fit everything (\(\nabla_{\mu} L, \nabla_{\sigma} L, \nabla_w L\))
Theorem : Universal approximation (Park & Sandberg, 1991)
Denote \(\mathcal{S}\) the family of functions based on RBF in \(\mathbb{R}^d\): \[\mathcal{S} = \{g \in \mathbb{R}^d \to \mathbb{R}, g(x) = \sum_i w_i K(\frac{x-\mu_i}{\sigma}), w \in \mathbb{R}^N\}\] with \(K : \mathbb{R}^d \rightarrow \mathbb{R}\) continuous almost everywhere and \(\int_{\mathbb{R}^d}K(x)dx \neq 0\),
Then \(\mathcal{S}\) is dense in \(L^p(\mathbb{R})\) for every \(p \in [1, \infty)\)
In particular, it applies to the gaussian kernel introduced before.
Vocabulary
ReLu are more favorable for the gradient flow than the saturating functions (more on that latter when discussing computational graphs and gradient computation).
Relu (Nair & Hinton, 2010)
\[\begin{equation*} \scriptstyle f(x) = \begin{cases} x & \mbox{ if } x \geq 0\\ 0 & \mbox { if } x < 0 \end{cases} \end{equation*}\]
Leaky Relu
(Maas, Hannun, & Ng, 2013)
Parametric ReLu
(He, Zhang, Ren, & Sun, 2015)
\[\begin{equation*} \scriptstyle f(x) = \begin{cases} x & \mbox{ if } x \geq 0\\ \alpha x & \mbox { if } x < 0 \end{cases} \end{equation*}\]
Exponential Linear Unit
(Clevert, Unterthiner, & Hochreiter, 2016)
Exactly as when we discussed about the RBF, this is task dependent.
Regression
Binary classification
Multi classification
Any well behaved function can be arbitrarily approximated with a single layer FNN (Cybenko, 1989), (Hornik, 1991)
Intuition
At that point, you may wonder why we bother about deep learning, right ?
Training is performed by gradient descent which was popularized by (Rumelhart et al., 1986) who called it error backpropagation (but (Werbos, 1981) already introduced the idea, see (Schmidhuber, 2015)).
Gradient descent is an iterative algorithm :
Remember : by minibatch gradient descent (see Lecture 1)
The question is : how do you compute \(\frac{\partial J}{\partial w_i}\) ??
But let us first see pytorch in action.
Overall steps :
Training
0- Imports
1- Loading the data
2- Define the network
3- Define the loss, optimizer, callbacks, …
4- Iterate and monitor
Testing
0- Imports
1- Loading the data
2- Define the network and load the trained parameters
3- Define the loss
4- Iterate
0- Imports
import torch
import torch.nn as nn
import torch.optim as optim
import sklearn
import sklearn.datasets
import tqdm
import matplotlib.pyplot as plt
1- Loading the data
# Load the data and build up our dataloader
data = sklearn.datasets.fetch_california_housing()
# X is (20640, 8), y is (20640, )
X, y = data.data, data.target
# At least normalize the input for an easier optimization
mean, std = X.mean(axis=0), X.std(axis=0)
X = (X - mean)/std
X_train = torch.tensor(X).float()
y_train = torch.tensor(y).float()
# A mapable dataset defines __len__ and __getitem__ (see also iterable datasets)
# it can also be an iterable dataset
train_dataset = torch.utils.data.TensorDataset(X_train, y_train)
# A dataloader will create the minibatches
train_dataloader = torch.utils.data.DataLoader(train_dataset,
batch_size=64,
shuffle=True,
pin_memory=True)
Doc: Dataset, DataLoader. Pin memory
Iterating over train_dataloader gives a pair of tensors of shape \((64, 8)\) and \((64,)\).
2- Define the network
if torch.cuda.is_available():
device = torch.device('gpu')
else:
device = torch.device('cpu')
# Build up the model
Nh = 64
model = nn.Sequential(
nn.Linear(8, Nh), nn.ReLU(),
nn.Linear(Nh, Nh), nn.ReLU(),
nn.Linear(Nh, Nh), nn.ReLU(),
nn.Linear(Nh, 1)
)
model.to(device)
Doc: Linear, Sequential
3- Define the loss, optimizer, callbacks, …
4- Iterate and monitor
for e in range(num_epochs):
# Switch the network in train mode
model.train()
print(f"Epoch {e}")
for X, y in tqdm.tqdm(train_dataloader):
# Send the data to the GPU if necessary
X, y = X.to(device), y.to(device)
# Reset the gradient accumulator
optimizer.zero_grad()
# Forward pass
y_pred = model(X).squeeze()
loss_value = loss(y_pred, y)
# Backward pass
loss_value.backward()
# Weight update
optimizer.step()
print(f"MSELoss on the training set : {mseloss(train_dataloader)}")
# Update the learning rate after one epoch
scheduler.step()
Evaluation
def mseloss(loader):
# After every epoch, compute the risk
# on the loader
cum_loss = 0.0
n_samples = 0
# Switch the network in eval mode
model.eval()
with torch.no_grad():
for X, y in tqdm.tqdm(loader):
X, y = X.to(device), y.to(device)
# Forward pass
y_pred = model(X).squeeze()
loss_value = loss(y_pred, y)
# The loss is 'mean' reduced so be carefull when cumulating it
cum_loss += loss_value * y_pred.size()[0]
n_samples += y_pred.size()[0]
return cum_loss/n_samples
A computational graph is a directed acyclic graph where nodes are :
Example graph for a linear regression \(\mathbb{R}^8 \mapsto \mathbb{R}\) with minibatch \((X, y)\)
\[ J = \frac{1}{M} \sum_{i=0}^{63} (w_1^T x_i + b_1 - y_i)^2 \]
Problem computing the partial derivatives with respect to the variables \(\frac{\partial J}{\partial var}\).
You just need to provide the local derivatives of the output w.r.t the inputs.
And then apply the chain rule.
ex : \(\frac{\partial J}{\partial w_1} \in \mathcal{M}_{1, 8}(\mathbb{R})\), assuming numerator layout
Numerator layout convention (otherwise, we transpose and reverse the jacobian product order):
The derivative of a scalar with respect to a vector is a row vector : \[ y \in \mathbb{R}, x \in \mathbb{R}^n, \frac{dy}{dx} \in \mathcal{M}_{1, n}(\mathbb{R}) \]
More generally, the derivative of a vector valued function \(y : \mathbb{R}^{n_x} \mapsto \mathbb{R}^{n_y}\) with respect to its input (the Jacobian) is a \(n_y \times n_x\) matrix :
\[ x \in \mathbb{R}^{n_x}, y(x) \in \mathbb{R}^{n_y}, \frac{dy}{dx}(x) = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_{1}}{\partial x_{n_x}} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_{2}}{\partial x_{n_x}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_{n_y}}{\partial x_1} & \frac{\partial y_{n_y}}{\partial x_2} & \cdots & \frac{\partial y_{n_y}}{\partial x_{n_x}} \end{bmatrix}(x) \]
For a (single-path) chain \(y_1 \rightarrow y_2 = f_1(y_1) \rightarrow y_3 = f_2(y_2) \cdots y_n = f_{n-1}(y_{n-1})\), of vector valued functions \(y_1 \in \mathbb{R}^{n_1}, y_2\in\mathbb{R}^{n_2}, \cdots y_n \in \mathbb{R}^{n_n}\),
\[ \frac{\partial y_n}{\partial y_1} = \frac{\partial y_n}{\partial y_{n-1}}\frac{\partial y_{n-1}}{\partial y_{n-2}}\cdots\frac{\partial y_2}{\partial y_1} \]
ex : \(\frac{\partial J}{\partial w_1} = \frac{\partial J}{\partial y_1} \frac{\partial y_1}{\partial w_1} = \frac{2}{M} (y_1-y)^T X \in \mathbb{R}^8\), assuming numerator layout
For matrix variables, we should be introducing tensors. See also this and this
But this can be computationally (too) expensive :
Let us be more efficient : forward mode differentiation
Idea: To compute \(\frac{\partial y}{\partial x}\), forward propagate \(\frac{\partial }{\partial x}\)
e.g. \(\frac{\partial y}{\partial x} = y_3 e^{y_1} \left[ w_2^T + y_2 w_1^T\right] + y_4 e^{y_2}\left[ w_1^T + y_1 w_2^T\right]\)
Welcome to the field of automatic differentiation (AD). For more, see (Griewank, 2012), (Griewank & Walther, 2008) (see also (Olah, 2015), (Paszke et al., 2017))
Let us be (sometimes) even more efficient : reverse mode differentiation
Idea: To compute \(\frac{\partial y}{\partial x}\), backward propagate \(\frac{\partial y}{\partial }\) (compute the adjoint)
e.g. \(\frac{\partial y}{\partial x} = (y_4y_1e^{y_2} + y_3 e^{y_1})w_2^T + (y_3y_2e^{y_1} + y_4 e^{y_2})w_1^T\)
Oh ! We also got \(\frac{\partial y}{\partial w_2} = \frac{\partial y}{\partial y_2}\frac{\partial y_2}{\partial w_2}\), \(\frac{\partial y}{\partial b_2} = \frac{\partial y}{\partial y_2}\frac{\partial y_2}{\partial b_2}\), …
This is more efficient than forward mode when we have much more inputs (\(n\)) than outputs (\(m\)) for \(f : \mathbb{R}^n \rightarrow \mathbb{R}^m\), computing \(\frac{df}{dx}(x)\)
A Review of Automatic Differentiationand its Efficient Implementation
In (Rumelhart et al., 1986), the algorithm was called “error backpropagation” : why ?
Suppose a 2-layer multi-layer feedforward network and propagating one sample, with a scalar loss : \[ L = g( y_i, \begin{bmatrix} & & \\ & W_2 (n_2 \times n_1) & \\ & & \end{bmatrix} f( \begin{bmatrix} & & \\ & W_1 (n_1 \times n_x) & \\ & & \end{bmatrix} \begin{bmatrix} \\ x_i \\ \phantom{} \end{bmatrix} )) \in \mathbb{R} \]
\(g\) could be a squared loss for regression (with \(n_2=1\)), or CrossEntropyLoss (with logits and \(n_2=n_{class}\)) for multiclass classification.
We denote \(z_1 = W_1 x_i, z_2 = W_2 f(z_1)\) and \(\delta_i = \frac{\partial L}{\partial z_i} \in \mathbb{R}^{n_i}\). Then : \[ \begin{align} \delta_2 &= \frac{\partial L}{\partial z_2} = \frac{\partial g(x_1, x_2)}{\partial x_2}(y_i, z_2) \\ \delta_1 &= \frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial z_2} \frac{\partial z_2}{\partial z_1} = \begin{bmatrix} & & \delta_2 & & \end{bmatrix} \begin{bmatrix} & & \\ & W_2(n_2 \times n_1) & \\ & & \phantom{}\end{bmatrix} \text{diag}(f'(z_1)) \end{align} \] The errors of \(\delta_2\) are integrated back through the weight matrix that was used for the forward pass. (See also (Nielsen, 2015), chap 2).
Training in two phase
warning The reverse-mode differentiation uses the variables computed in the forward pass
\(\rightarrow\) we can apply efficiently stochastic gradient descent to optimize the parameters of our neural networks !
Note the computational graph can be extended to encompass the operations of the backward pass.
optimizer = optim.Adam(model.parameters())
for e in range(epochs):
for X,y in train_dataloader:
optimizer.zero_grad()
...
loss.backward()
optimizer.step()
The computational graph is a central notion in modern neural networks/deep learning. Broaden the scope with differential programming.
In the recent years, “fancier” differentiable blocks others than \(f(W f(W..))\), and that are dynamically built (eager mode vs static graph).
Spatial Transformer Networks
(Jaderberg, Simonyan, & Zisserman, 2015)
Content/Location based addressing
Neural Turing Machine / Differential Neural computer (Graves et al., 2016)
Indeed :
\[ \begin{align*} & \begin{bmatrix} \\ x \\ \phantom{} \end{bmatrix}\\ \begin{bmatrix} & & \\ & W_1 & \\ & & \phantom{} \end{bmatrix} & \begin{bmatrix} \\ y_1 \\ \phantom{} \end{bmatrix} \end{align*} \]
\[ \begin{align*} & \begin{bmatrix} \\ f(y_1) \\ \phantom{} \end{bmatrix}\\ \begin{bmatrix} & & \\ & W_2 & \\ & & \phantom{} \end{bmatrix} & \begin{bmatrix} \\ y_2 \\ \phantom{} \end{bmatrix} \end{align*} \]
But empirically, most local minima are close (in performance) to the global minimum, especially with large/deep networks. See (Dauphin et al., 2014), (Pascanu, Dauphin, Ganguli, & Bengio, 2014), (Choromanska, Henaff, Mathieu, Arous, & LeCun, 2015). Saddle points seem to be more critical.
Algorithm
Rationale (Taylor expansion) : \(L(\theta_{t+1}) \approx L(\theta_{t}) + (\theta_{t+1} - \theta_{t})^T \nabla_{\theta} L(\theta_{t})\)
The choice of the batch size :
The optimization may converge slowly or even diverge if the learning rate \(\epsilon\) is not appropriate.
Bengio: “The optimal learning rate is usually close to the largest learning rate that does not cause divergence of the training criterion” (Bengio, 2012)
Karpathy “\(0.0003\) is the best learning rate for Adam, hands down.” (Twitter, 2016)
(Note: Adam will be discussed in few slides)
See also :
- Practical Recommendations for gradient-based training of deep architectures (Bengio, 2012)
- Efficient Backprop (LeCun et al., 1998)
Setup
Parameters : \(\epsilon=0.005\), \(\theta_0 = \begin{bmatrix} 10 \\ 5\end{bmatrix}\)
Converges to \(\theta_{\infty} = \begin{bmatrix} 1.9882 \\ 2.9975\end{bmatrix}\)
Algorithm : Let us damp the oscillations with a low pass on \(\nabla_{\theta}\)
Usually \(\mu \approx 0.9\) or \(0.99\).
See also distill.pub. Note the frameworks may implement subtle variations.
Parameters : \(\epsilon=0.005\), \(\mu=0.6\), \(\theta_0 = \begin{bmatrix} 10 \\ 5\end{bmatrix}\)
Converges to \(\theta_{\infty} = \begin{bmatrix} 1.9837 \\ 2.9933\end{bmatrix}\)
Idea Look ahead to potentially correct the update. Based on Nesterov Accelerated Gradient. Formulation of (Sutskever et al., 2013)
Algorithm
Parameters : \(\epsilon=0.005\), \(\mu=0.8\), \(\theta_0 = \begin{bmatrix} 10 \\ 5\end{bmatrix}\)
Converges to \(\theta_{\infty} = \begin{bmatrix} 1.9738 \\ 2.9914\end{bmatrix}\)
On our simple regression problem
You should always adapt your learning rate with a learning rate scheduler
Some more recent approaches are changing the picture of “decreasing learning rate” (“Robbins Monro conditions”)
See (Smith, 2018), The 1cycle policy - S. Gugger
Stochastic Gradient Descent with Warm Restart (Loshchilov & Hutter, 2017)
The improved performances may be linked to reaching flatter minimums (i.e. with predictions less sensitive than sharper minimums). The models reached before the warm restarts can be averaged (see Snapshot ensemble).
It seems also that initial large learning rates tend to lead to better models on the long run (Y. Li et al., 2019)
Adagrad Adaptive Gradient (Duchi, Hazan, & Singer, 2011)
The \(\sqrt{.}\) is experimentally critical ; \(\delta \approx [1e-8, 1e-4]\) for numerical stability.
Small gradients \(\rightarrow\) bigger learning rate for moving fast along flat directions
Big gradients \(\rightarrow\) smaller learning rate to calm down on high curvature.
But accumulation from the beginning is too aggressive. Learning rates decrease too fast.
RMSprop Hinton(unpublished, Coursera)
Idea: we should be using an exponential moving average when accumulating the gradient.
Adaptive Moments (ADAM) (Kingma & Ba, 2015)
\[ \theta(t+1) = \theta(t) - \frac{\epsilon}{\delta + \sqrt{\hat{v}(t+1)}} \hat{m}(t+1) \]
and some others : Adadelta (Zeiler, 2012), … , YellowFin (Zhang & Mitliagkas, 2018).
(Goodfellow, Bengio, & Courville, 2016) There is currently no consensus[…] no single best algorithm has emerged[…]the most popular and actively in use include SGD,SGD with momentum, RMSprop, RMSprop with momentum, Adadelta and Adam.
See also Chap. 8 of (Goodfellow et al., 2016)
Rationale : \[ J(\theta) \approx J(\theta_0) + (\theta - \theta_0)^T \nabla_{\theta}J(\theta_0) + \frac{1}{2}(\theta - \theta_0)^T \nabla^2_{\theta} J(\theta_0) (\theta - \theta_0) \] with \(H = \nabla^2J\) the Hessian matrix, a \(n_\theta \times n_\theta\) matrix hosting the second derivatives of \(J\).
The second derivates are much more noisy than the first derivative (gradient), a larger batch size is usually required to prevent instabilities.
XOR is easy right ?
But it fails miserably (6/20 fails). Tmax=1000
XOR is easy right ?
Now it is better (0/20 fails). Tmax=1000
Historically, training deep FNN was known to be hard, i.e. bad generalization errors.
The starting point of a gradient descent has a dramatic impact :
Pretraining is no more used (because of xxRelu, Initialization schemes, ..)
Gradient descent converges faster if your data are normalized and decorrelated. Denote by \(x_i \in \mathbb{R}^d\) your input data, \(\hat{x}_i\) its normalized.
Z-score normalization (goal: \(\hat{\mu}_j = 0, \hat{\sigma}_j = 1\)) \[ \forall i,j, \hat{x}_{i,j} = \frac{x_{i,j} - \mu_j}{\sigma_j + \epsilon} \]
ZCA whitening (goal: \(\hat{\mu}_j = 0, \hat{\sigma}_j = 1\), \(\frac{1}{n-1} \hat{X} \hat{X}^T = I\))
\[ \hat{X} = W X, W = \frac{1}{\sqrt{n-1}} (XX^T)^{-1/2} \]
Remember our linear regression : \(y = 3x+2+\mathcal{U}(-0.1, 0.1)\), L2 loss, 30 1D samples
A good initialization should break the symmetry : constant initialization schemes make units learning all the same thing
A good initialization should start optimization in a region of low capacity : linear neural network
The Fundamental Deep Learning Problem first observed by (Josef Hochreiter, 1991) for RNN, the gradient can either vanish or explode, especially in deep networks (RNN are very deep).
Remember that the backpropagated gradient involves : \[ \frac{\partial J}{\partial x_l} = \frac{\partial J}{\partial y_L} W_L f'(y_l) W_{L-1} f'(y_{l-1}) \cdots \] with \(y_l = W_l x_l + b, x_l = f(y_{l-1})\).
We see a pattern like \((W.f')^L\) which can diverge or vanish for large \(L\).
especially, with the sigmoid :\(f' < 1\).
With a ReLu, the positive part has \(f' = 1\).
In (LeCun et al., 1998), Y. LeCun provided some guidelines on the design:
Aim Initialize the weights/biases to keep \(f\) in its linear part through multiple layers:
set the biases to \(0\)
initialize randomly and independently from \(\mathcal{N}(\mu=0, \sigma^2=\frac{1}{fan_{in}})\).
If \(x \in \mathbb{R}^n\) is \(\mathcal{N}(0, \Sigma = I)\), \(w \in \mathbb{R}^n\) is \(\mathcal{N}(μ=0, \Sigma=\frac{1}{n}I)\), then :
\[ \begin{align*} E[w^T x + b] &= E[w^T x] = \sum_i E[w_i x_i] = \sum_i E[w_i] E[x_i] = 0\\ var[w^T x + b] &= var[w^T x] \\ & = \sum_i \sigma^2_{w_i}\sigma^2_{x_i} + \sigma^2_{w_i}\mu^2_{x_i} + \mu^2_{w_i}\sigma^2_{x_i}\\ &= \sum_i \sigma^2_{w_i}\sigma^2_{x_i} = \frac{1}{n}\sum \sigma^2_{x_i} = 1 \end{align*} \]
\(x_i, w_i\) are all pairwise independent.
Idea we must preserve the same distribution along the forward and backward pass (Glorot & Bengio, 2010).
This prevents:
Glorot (Xavier) initialization scheme for a feedforward network \(f(W_{n}..f(W^1f(W_0 x+b_0)+b_1)...+b_n)\) with layer sizes \(n_i\):
Assuming the linear regime \(f'() = 1\) of the network : \[
\begin{align*}
\mbox{Forward propagation variance constraint :} \forall i, fan_{in_i} \sigma^2_{W_i} &= 1\\
\mbox{Backward propagation variance constraint :} \forall i, fan_{out_i} \sigma^2_{W_i} &= 1
\end{align*}
\] Compromise : \(\forall i, \frac{1}{\sigma^2_{W_i}} = \frac{fan_{in} + fan_{out}}{2}\)
- Glorot (Xavier) uniform : \(\mathcal{U}(-\frac{\sqrt{6}}{\sqrt{fan_{in}+fan_{out}}}, \frac{\sqrt{6}}{\sqrt{fan_{in}+fan_{out}}})\), b=0
- Glorot (Xavier) normal : \(\mathcal{N}(0, \frac{\sqrt{2}}{\sqrt{fan_{in}+fan_{out}}})\), b=0
Idea we must preserve the same distribution along the forward and backward pass for rectifiers (He et al., 2015). For a feedforward network \(f(W_{n}..f(W^1f(W_0 x+b_0)+b_1)...+b_n)\) with layer sizes \(n_i\):
So, \(\sigma^2_{y_l} = \frac{1}{2}n_l \sigma^2_{w_l} \sigma^2_{y_{l-1}}\). To preserve the variance, we must guarantee \(\frac{1}{2} n_l \sigma^2_{w_l} = 1\).
We used : if \(X\) and \(Y\) are independent : \(\sigma^2_{X.Y} = \mu_{X^2}\mu_{Y^2} - \mu_X^2 \mu_Y^2\)
Idea we must preserve the same distribution along the forward and backward pass for rectifiers (He et al., 2015). For a feedforward network \(f(W_{n}..f(W^1f(W_0 x+b_0)+b_1)...+b_n)\) with layer sizes \(n_i\):
\[ \begin{align*} \mbox{Forward propagation variance constraint :} \forall i, \frac{1}{2}fan_{in_i} \sigma^2_{W_i} &= 1\\ \mbox{Backward propagation variance constraint :} \forall i, \frac{1}{2}fan_{out_i} \sigma^2_{W_i} &= 1 \end{align*} \]
He suggests to use either one or the other, e.g. \(\sigma^2_{W_i} = \frac{2}{fan_{in}}\)
- He uniform : \(\mathcal{U}(-\frac{\sqrt{6}}{\sqrt{fan_{in}}}, \frac{\sqrt{6}}{\sqrt{fan_{in}}})\), b=0
- He normal : \(\mathcal{N}(0, \frac{\sqrt{2}}{\sqrt{fan_{in}}})\), b=0
Note: for PreLu : \(\frac{1}{2} (1 + a^2) fan_{in} \sigma^2_{W_i} = 1\)
By default, the parameters are initialized randomly. e.g. in torch.nn.Linear :
class Linear(torch.nn.Module):
def __init__(self):
...
self.reset_parameters()
def reset_parameters(self) -> None:
# Setting a=sqrt(5) in kaiming_uniform is the same as initializing with
# uniform(-1/sqrt(in_features), 1/sqrt(in_features)). For details, see
# https://github.com/pytorch/pytorch/issues/57109
torch.nn.init.kaiming_uniform_(self.weight, a=math.sqrt(5))
fan_in, _ = torch.init.calculate_fan_in_and_fan_out(self.weight)
bound = 1/math.sqrt(fan_in)
init.uniform_(self.bias, -bound, bound)
Oh, but that’s not what we should use for ReLu ?!?! Indeed you are right, see this issue. This is to avoid breaking with the way torch(lua) was initializing.
import torch.nn.init as init
class MyModel(torch.nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.classifier = nn.Sequential(
*linear_relu(input_size, 256),
*linear_relu(256, 256),
nn.Linear(256, num_classes)
)
self.init()
def init(self):
@torch.no_grad()
def finit(m):
if type(m) == nn.Linear:
init.kaiming_uniform_(m.weight,
a=0,
mode='fan_in',
nonlinearity='relu')
m.bias.fill_(0.0)
self.apply(finit)
(Ioffe & Szegedy, 2015) observed the change in distribution of network activations due to the change in network parameters during training.
Experiment 3 fully connected layers (100 units), sigmoid, softmax output, MNIST dataset
Idea standardize the activations of every layers to keep the same distributions during training (Ioffe & Szegedy, 2015)
The gradient must be aware of this normalization, otherwise may get parameter explosion (see (Ioffe & Szegedy, 2015)) \(\rightarrow\) we need a differentiable normalization layer
introduces a differentiable Batch Normalization layer : \[ z = g(W x + b) \rightarrow z = g(BN(W x)) \]
BN operates element-wise : \[ \begin{align*} y_i &= BN_{\gamma,\beta} (x_i) = \gamma \hat{x}_i + \beta\\ \hat{x}_i &= \frac{x_i - \mu_{\mathcal{B}, i} }{\sqrt{\sigma^2_{\mathcal{B}, i} + \epsilon}} \end{align*} \] with \(\mu_{\mathcal{B},i}\) and \(\sigma_{\mathcal{B},i}\) statistics computed on the mini batch during training.
Learning faster, with better generalization.
During training
During inference (test) :
warningDo not forget to switch to test mode :
Some recent works challenge the idea of covariate shift (Santurkar, Tsipras, Ilyas, & Ma, 2018), (Bjorck, Gomes, Selman, & Weinberger, 2018). The loss seems smoother allowing larger learning rates, better generalization, robustness to hyperparameters.
Add a L2 penalty on the weights, \(\alpha > 0\)
\[ \begin{align*} J(\theta) &= L(\theta) + \frac{\alpha}{2} \|\theta\|^2_2 = L(\theta) + \frac{\alpha}{2}\theta^T \theta\\ \nabla_\theta J &= \nabla_\theta L + \alpha \theta\\ \theta &\leftarrow \theta - \epsilon \nabla_\theta J = (1 - \alpha \epsilon) \theta - \epsilon \nabla_\theta L \end{align*} \] Called L2 regularization, Tikhonov regularization, weight decay
Example RBF, 1 kernel per sample, \(N=30\), noisy inputs,
See chap 7 of (Goodfellow et al., 2016) for a geometrical interpretation
Intuition : for linear layers, the gradient of the function equals the weights. Small weights \(\rightarrow\) small gradient \(\rightarrow\) smooth function.
In theory, regularizing the bias will cause underfitting
Example
\[ \begin{align*} J(w, b) &= \frac{1}{N} \sum_{i=1}^N \| y_i - b - w^T x_i\|_2^2\\ \nabla_b J(w,b) &\implies b = (\frac{1}{N} \sum_i y_i) - w^T (\frac{1}{N} \sum_i x_i) \end{align*} \]
If your data are centered (as they should), the optimal bias is the mean of the targets.
Add a L1 penalty to the weights : \[ \begin{align*} J(\theta) &= L(\theta) + \alpha \|\theta\|_1 = L(\theta) + \alpha \sum_i |\theta_i|\\ \nabla_\theta J &= \nabla_\theta L + \alpha \mbox{sign}(\theta) \end{align*} \]
Example RBF, 1 kernel per sample, \(N=30\), noisy inputs,
See chap 7 of (Goodfellow et al., 2016) for a mathematical explanation in a specific case. Sparsity used for feature selection with LASSO (filter/wrapper/embedded).
Introduced in (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014):
Idea 1 : preventing co-adaptation. A pattern is robust by itself not because of others doing part of the job.
Idea 2 : average of all the sub-networks (ensemble learning)
How :
for every minibatch, zeroes hidden and input activations with probability \(p\) (\(p=0.5\) for hidden, \(p=0.2\) for input). At test time, multiply every activations by \(p\)
“Inverted” dropout : multiply the kept activations by \(p\) at train time. At test time, just do a normal forward pass.
Can be interpreted as if training/averaging all the possible subnetworks.
L1/L2
Dropout
Split your data in three sets :
Everything can be placed in a cross validation loop.
Early stopping is about keeping the model with the lowest validation loss.
The best regularizer you may find is data. The more you have, the better you learn.
you can use pretrained models on some tasks as an initialization for learning your task (but may fail due to domain shift) : check the Pytorch Hub
you can use unlabeled data for pretraining your networks (as done in 2006s) with auto-encoders / RBM : unsupervised/semi-supervised learning
you can apply random transformations to your data : dataset augmentation, see for example alubmentations.ai
Introduced in (Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2015) in the context of Convolutional Neural Networks.
Idea : Preventing the network to be over confident on its predictions on the training set.
Recipe : in a \(k\)-class problem, instead of using hard targets \(\in \{0, 1\}\), use soft targets \(\in \{\frac{\alpha}{k}, 1-\alpha\frac{k-1}{k}\}\) (weighted average between the hard targets and uniform target). \(\alpha \approx 0.1\).
See also (Müller, Kornblith, & Hinton, 2020) for several experiments.
From data that have a spatial structure (locally correlated), features can be extracted with convolutions.
On Images
That also makes sense for temporal series that have a structure in time.
What is a convolution : Example in 2D
Seen as a matrix multiplication
Given two 1D-vectors \(f, k\), say \(k = [c, b, a]\) \[ (f * k) = \begin{bmatrix} b & c & 0 & 0 & \cdots & 0 & 0 \\ a & b & c & 0 & \cdots & 0 & 0 \\ 0 & a & b & c & \cdots & 0 & 0 \\ 0 & 0 & a & b & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & b & c \\ 0 & 0 & 0 & 0 & \cdots & a & b \\ \end{bmatrix} . \begin{bmatrix} \\ f \\ \phantom{}\end{bmatrix} \]
Local features can be combined to learn higher level features.
Let us build a house detector
Ideas Using the structure of the inputs to limit the number of parameters without limiting the expressiveness of the network
For inputs with spatial (or temporal) correlations, features can be extracted with convolutions of local kernels
\(\rightarrow\) strongly regularized !
The architecture of LeNet-5 (LeCun et al., 1989), let’s call it the Vanilla CNN
Architecture
Two main parts :
- convolutional part : C1 -> C5 : convolution - non-linearity - subsampling
- fully connected part : linear - non-linearity
Specificities :
- Weighted sub-sampling
- Gaussian connections (RBF output layer)
- connectivity pattern \(S_2 - C_3\) to reduce the number of weights
Number of parameters :
Layer | Parameters |
---|---|
\(C_1\) | \(156\) |
\(S_2\) | \(12\) |
\(C_3\) | \(1.516\) |
\(S_4\) | \(32\) |
\(C_5\) | \(48.120\) |
\(F_6\) | \(10.164\) |
Convolution :
- size (e.g. \(3 \times 3\), \(5\times 5\))
- padding (e.g. \(1\), \(2\))
- stride (e.g. \(1\))
Pooling (max/average):
- size (e.g. \(2\times 2\))
- padding (e.g. \(0\))
- stride (e.g. \(2\))
We work with 4D tensors for 2D images, 3D tensors for nD temporal series (e.g. multiple simultaneous recordings), 2D tensors for 1D temporal series
In Pytorch, the tensors follow the Batch-Channel-Height-Width (BCHW, channel-first) convention. Other frameworks, like TensorFlow or CNTK, use BHWC (channel-last).
Pytorch code for implementing a CNN : Conv1D Conv2D, MaxPool1D MaxPool2D, AveragePooling, etc…
How can I get the feature dimensions of conv_model output ?
All of these should fit into a nn.Module subclass :
class MyModel(torch.nn.Module):
def __init__(self, ....):
super(MyModel, self).__init__()
self.conv_model = nn.Sequential(...)
output_size = ...
self.fc_model = nn.Sequential(...)
def forward(self, inputs):
conv_features = self.conv_model(inputs)
conv_features = conv_features.view(inputs.shape[0], -1)
return self.fc_model(conv_features)
You can also use the recently introduced nn.Flatten layer.
Given two 1D-vectors \(x_1, k\), say \(k = [c, b, a]\) \[ y_1 = (x_1 * k) = \begin{bmatrix} b & c & 0 & 0 & \cdots & 0 & 0 \\ a & b & c & 0 & \cdots & 0 & 0 \\ 0 & a & b & c & \cdots & 0 & 0 \\ 0 & 0 & a & b & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & b & c \\ 0 & 0 & 0 & 0 & \cdots & a & b \\ \end{bmatrix} . \begin{bmatrix} \\ x_1 \\ \phantom{}\end{bmatrix} = W_k x_1 \]
If we compute the gradient of the loss, in denominator layout: \[ \frac{\partial L}{\partial x_1} = \frac{\partial y_1}{\partial x_1}\frac{\partial L}{\partial y_1} = W_k^T \frac{\partial L}{\partial y_1} \]
Hence, it is coined the term transposed convolution or backward convolution. This will pop up again when speaking about deconvolution.
Introduced in (Ciresan, Meier, & Schmidhuber, 2012), ensemble of CNNs trained with dataset augmentation
Introduced in (Krizhevsky et al., 2012), the “spark” giving birth to the revival of neural networks.
The first layer learned to extract meaningful features
ILSVRC’13 winner. Introduced in (Zeiler & Fergus, 2014)
Ablation studies on AlexNet : the FC layers are not that important
Introduced the idea of supervised pretraining (pretraining on ImageNet, finetune the softmax for Caltech-101, Caltech-256, Pascal 2012)
SGD minibatch(128), momentum(0.9), learning rate (0.01) manual schedule,
Deconvnet computes approximately the gradient of the loss w.r.t. the input (Simonyan, Vedaldi, & Zisserman, 2014). It differs in the way the ReLu is integrated.
ILSVRC’14 1st runner up. Introduced by (Simonyan & Zisserman, 2015).
Introduced in (Springenberg, Dosovitskiy, Brox, & Riedmiller, 2015).
ILSVR’14 winner. Introduced by (Szegedy et al., 2014).
Idea Multi-scale feature detection and dimensionality reduction
ILSVRC’15 winner. Introduced in (He et al., 2016a)
Highway Networks (Srivastava, Greff, & Schmidhuber, 2015)
\[ y = T(x).H(x) + C(x).x \]
DenseNets
Fitnet [Romero(2015)], Wideresnet(2017), Mobilenetv1, v2, v3 [Howard(2019)] : searching for the best architecture, EfficientNet (Tan & Le, 2020)
See also :
You should increase the number of filters throughout the network :
Examples :
EfficientNet (Tan & Le, 2020) studies the scaling strategies of conv. models.
For calculating the effective receptive field size, see this guide on conv arithmetic.
Your effective receptive field can grow faster with a-trou convolutions (or dilated convolutions) (Yu & Koltun, 2016):
Illustrations from this guide on conv arithmetic. The Conv2D object’s constructor accepts a dilation argument.
Introduced in Inception v3 (Szegedy et al., 2015)
\(n\) input filters,\(\alpha n\) output filters :
\(\alpha=2 \Rightarrow -24\%\) (\(\sqrt{\alpha}\) is critical!)
\(n\) input filters,\(\alpha n\) output filters :
\(\alpha=2 \Rightarrow -30\%\)
See also the recent work on “Rethinking Model scaling for convolutional neural networks” (Tan & Le, 2020)
Inception and Xception, Mobilnets. It separates :
See also the Feature Pyramid Networks for multi-scale features.
Trainable non-linear transformation of the channels. Network in network (Lin, Chen, & Yan, 2014)
You can check the norm of the gradient w.r.t. the first layers’ parameters to diagnose vanishing gradients
Recent architectures remove the max pooling layers and replace them by conv(stride=2) for downsampling
All the competitors in ImageNet do perform model averaging.
Model averaging
Weight averaging
If you worry about the increased computational complexity, see knowledge distillation (Hinton, Vinyals, & Dean, 2015) : training a light model with the soft targets (vs. the labels, i.e. the hard targets) of a computationally intensive one.
All the frameworks provide you with a model zoo of pre-trained networks. E.g. in PyTorch, for image classification. You can cut the head and finetune the softmax only.
import torchvision.models as models
resnet18 = models.resnet18(pretrained=True)
alexnet = models.alexnet(pretrained=True)
vgg16 = models.vgg16(pretrained=True)
inception = models.inception_v3(pretrained=True)
googlenet = models.googlenet(pretrained=True)
mobilenet = models.mobilenet_v2(pretrained=True)
resnext50_32x4d = models.resnext50_32x4d(pretrained=True)
[...]
warning Do not forget the input normalization !
Have a look in the torchvision doc, there are pretrained for classification, detection, segmentation … See also pytorch hub and timm for very up to date image models.
You can oversample around your training samples by applying transforms on the inputs that make predictable changes on the targets.
Libraries for augmentation : albumentations, imgaug
Operator | Resolution | RF size | #Channels |
---|---|---|---|
ConvBlock | \(32\times32\) | \(5\times5\) | 32 |
ConvBlock | \(32\times32\) | \(9\times9\) | 32 |
Sub | \(16\times16\) | \(15\times15\) | 32 |
ConvBlock | \(16\times16\) | \(15\times15\) | 128 |
ConvBlock | \(16\times16\) | \(23\times23\) | 128 |
Sub | \(8\times8\) | \(31\times31\) | 128 |
AvgPool | \(1\times1\) | 128 | |
Linear | \(100\) |
ConvBlock: 2x [Conv(1x3)-(BN)-Relu-Conv(3x1)-(BN)-Relu]
Sub : Conv(3x3, stride=2)-(BN)-Relu
Common settings :
Different configurations :
Number of parameters: \(\simeq 2M\)
Time per epoch (1080Ti) : 17s. , 42min training time
If applied, only the weights of the convolution and linear layers are regularized (not the bias, nor the coefficients of the Batch Norm)
No regularization (either L2, Dropout, Label smoothing, data augmentation), No BatchNorm
With batchnorm after every convolution (Note it is also regularizing the network)
With dataset augmentation (HFlip, Scale, Trans)
With regularization : L2 (0.0025), Dropout(0.5), Label smoothing(0.1)
Given :
Examples from ImageNet (see here)
Bounding boxes given, in the datasets (the predictor parametrization may differ), by : \([x, y, w, h]\), \([x_{min},y_{min},x_{max},y_{max}]\), …
Datasets : Coco, ImageNet, Open Images Dataset
Recent survey : Object detection in 20 years: a survey
Open image evaluation:
Suppose you have a single object to detect, can you localize it into the image ?
How can we proceed with multiple objects ? (Girshick, Donahue, Darrell, & Malik, 2014) proposed to :
Revolution in the object detection community (vs. “traditional” HOG like features).
Drawback :
Notes : pretained on ImageNet, finetuned on the considered classes with warped images. Hard negative mining (boosting).
Introduced in (Girshick, 2015). Idea:
Drawbacks:
Github repository. CVPR’15 slides
Notes : pretrained VGG16 on ImageNet. Fast training with multiple ROIs per image to build the \(128\) mini batch from \(N=2\) images, using \(64\) proposals : \(25\%\) with IoU>0.5 and \(75\%\) with \(IoU \in [0.1, 0.5[\). Data augmentation : horizontal flip. Per layer learning rate, SGD with momentum, etc..
Multi task loss : \[ L(p, u, t, v) = -\log(p_u) + \lambda \mbox{smooth L1}(t, v) \]
The bbox is parameterized as in (Girshick et al., 2014). Single scale is more efficient than multi-scale.
Introduced in (Ren, He, Girshick, & Sun, 2016). The first end-to-end trainable network. Introducing the Region Proposal Network (RPN). A RPN is a sliding Conv(\(3\times3\)) - Conv(\(1\times1\), k + 4k) network (see here). It also introduces anchor boxes of predefined aspect ratios learned by vector quantization.
Check the paper for a lot of quantitative results. Small objects may not have a lot of features.
Bbox parametrization identical to (Girshick et al., 2014), with smooth L1 loss. Multi-task loss for the RPN. Momentum(0.9), weight decay(0.0005), learning rate (0.001) for 60k minibatches, 0.0001 for 20k.
Multi-step training. Gradient is non-trivial due to the coordinate snapping of the boxes (see ROI align for a more continuous version)
With VGG-16, the conv5 layer is \(H/16,W/16\). For an image \(1000 \times 600\), there are \(60 \times 40 = 2400\) anchor boxes centers.
Introduced in (Lin et al., 2017)
Upsampling is performed by using nearest neighbors.
For object detection, a RPN is run on every scale of the pyramid \(P_2, P_3, P_4, P_5\).
ROIPooling/Align is fed with the feature map at a scale depending on ROI size. Large ROI on small/coarse feature maps, Small ROI on large/fine feature maps
The first one-stage detector. Introduced in (Redmon, Divvala, Girshick, & Farhadi, 2016). It outputs:
Bounding box encoding:
In YoLo v3, the network is Feature Pyramid Network (FPN) like with a downsampling and an upsampling paths, with predictions at 3 stages.
The loss is multi-task with :
\[\begin{align*} \mathcal{L} &= \lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} [(t_x-t_x^*)^2+(t_y-t_y^*)^2+(t_w-t_w^*)^2+(t_h-t_h^*)^2] \\ & -\sum_{i=0}^{S^2} \sum_{j=0}^{B} BCE(\mathbb{1}_{ij}^{obj}, \mbox{has_obj}_{ij}) \\ & -\sum_{i=0}^{S^2} \sum_{j=0}^{B} \sum_{k=0}^{K} BCE(\mbox{has_class}_{ijk}, p_{ijk}) \end{align*}\]
In v1 and v2, the prediction losses were L2 losses.
Multi labelling can occur in coco (e.g. women, person)
The object detectors may output multiple overlapping bounding box for the same object
NMS algorithm :
NMS may suppress one of two “overlapped” objects. It hard resets the scores of overlapping bboxes.
SoftNMS (Bodla, Singh, Chellappa, & Davis, 2017):
Given an image,
Semantic segmentation : predict the class of every single pixel. We also call dense prediction/dense labelling.
Example image from MS Coco
Instance segmentation : classify all the pixels belonging to the same countable objects
Example image from MS Coco
More recently, panoptic segmentation refers to instance segmentation for countable objects (e.g. people, animals, tools) and semantic segmentation for amorphous regions (grass, sky, road).
Metrics : see Coco panotpic evaluation
Some example networks : PSP-Net, U-Net, Dilated Net, ParseNet, DeepLab, Mask RCNN, …
Graph convolutions :
Processing 3D point clouds :
Time delay neural networks as introduced in (Waibel, Hanazawa, Hinton, Shikano, & Lang, 1989) spatializes the time:
But : which size of the time window ? Must the history size always be the same ? Do we need the data over the whole time span ? How to share computations in time instead of using distinct weights per time instant?
Feedforward neural networks can still be efficient for processing sequential data, e.g. Gated ConvNet (Dauphin, Fan, Auli, & Grangier, 2017), Transformers, …
Introduced by (Elman, 1990).
Weight matrices :
\[ \begin{align*} h(t) &= f(W^{in} x(t) + W^{h} h(t-1) + W^{back} y(t-1) + b_h)\\ y(t) &= g(W^{out} h(t) + b_y) \end{align*} \]
The hidden to hidden weight matrix \(W_{h}\) is repeatedly applied.
Named Elman networks if \(W^{back} = 0\), and Jordan networks if \(W^h = 0\). Elman networks with a random fixed \(W^h\) are called Echo State networks.
The inputs and outputs can be of variable (\(1 \rightarrow T_x\), \(1 \rightarrow T_y\)) and arbitrary sizes (\(T_x \neq T_y\)).
Many to one example : language model, sentiment analysis : multiclass sequence classification \(T_y=1\):
Many to many example : Neural Machine Translation
One to many example: image captioning, language model with probabilistic sampling
start : ‘LA JUMENT ET’
LA JUMENT ET LE RAT
ET L’huiller craignait les gens d’une mise un vers atteint:
Va c’est d’être indigne de Vénus d’aller pressez l’ame
D’une mais, dit-il, un plongeant l’avertion :
Son échangé vous refusiez-vous
start : ‘LA JUMENT ET’
LA JUMENT ET LE BULÉE
[Ésope]
Comme à part craindre déjà cet honneur à couvrir jamais
Et ses mélonces, condition tempérament.
L’autre honne alla vie.
Je ne saurais pas que d’un moutons.
Que ce choix, coquet, g
Idea: unfold in time the computational graph and perform reverse mode differentiation (Werbos, 1990).
You must be training on truncated series to prevent a computational burden.
You can also perform forward mode differentiation (Real time recurrent learning RTTL (Williams & Peng, 1990)) with online adaptation as the inputs/targets comes in but this is computationally expensive.
Unrolled in time, RNN appears as very deep networks \(\rightarrow\) vanishing/exploding gradient
Initialization strategies :
Architecture :
Training :
Regularization:
RNNs have difficulties learning long range dependencies. The LSTM (Hochreiter & Schmidhuber, 1997) introduces memory cells to address that problem.
Peepholes may connect the \(c_t\) to their gates.
Equations:
\[ \begin{eqnarray*} I_t &=& \sigma(W^x_i x_t + W^h_i h_{t-1} + b_i) &\in [0,1], \mbox{Input gate}\\ F_t &=& \sigma(W^x_f x_t + W^h_f h_{t-1} + b_f)&\in [0,1], \mbox{Forget gate}\\ O_t &=& \sigma(W^x_o x_t + W^h_o h_{t-1} + b_o) &\in [0,1], \mbox{Output gate}\\ n_t &=& \tanh(W^x_n x_t + W^h_n h_{t-1} + b_z)& \mbox{unit's input}\\ c_t &=& F_t \odot c_{t-1} + I_t \odot n_t& \mbox{cell update}\\ h_t &=& O_t \odot \tanh(c_t) & \mbox{unit's output} \end{eqnarray*} \] The next layers integrate what is exposed by the cells, i.e. the unit’s output \(h_t\), not \(c_t\).
If \(F_t=1, I_t=0\), the cell state \(c_t\) is unmodified. This is called the constant error carrousel.
The forget gate is introduced in (Gers et al., 2000). Variants have been investigated in a search space odyssey (Greff, Srivastava, Koutnı́k, Steunebrink, & Schmidhuber, 2017).
See also (Le et al., 2015) which reconsiders using ReLU in LSTM given appropriate initialization of the recurrent weights to the identity to be copy by default mode.
The GRU is introduced as an alternative, simpler model than LSTM. Introduced in (Cho et al., 2014).
Equations:
\[ \begin{eqnarray*} R_t &=& \sigma(W^x_i x_t + W^h_i h_{t-1} + b_i) \mbox{ Reset gate}\\ Z_t &=& \sigma(W^x_z x_t + W^h_z h_{t-1} + b_z) \mbox{ Update gate}\\ n_t &=& \tanh(W^x_{n} x_t + b_{nx} + R_t \odot (W^h_nh_{t-1}+b_{nh}))\\ h_t &=& Z_t \odot h_{t-1} + (1-Z_t) \odot n_t \end{eqnarray*} \]
If \(Z_{t} = 1\), the cell state \(h_t\) is not modified. If \(Z_t = 0\) and \(R_t=1\), it is updated in one step.
Compared to LSTM, a GRU cell :
Idea Both past and future contexts can sometimes be required for classification at the current time step; e.g. when you speak, past and future phonemes influence the way you pronounce the current one. Introduced in (Schuster & Paliwal, 1997})
While RNN are fundamentally deep neural networks, they can still benefit from being stacked : this allows the layers to operate at increasing time scales. The lower layers can change their content at a higher rate than the higher layers.
(Graves et al., 2013): Phoneme classification with stacked bidirectionnal LSTMs
(Sutskever, Vinyals, & Le, 2014) : Machine translation with stacked unidirectionnal LSTMs (Seq2Seq)
In a stacked RNN, you can concatenate consecutive hidden states before feeding in the next RNN layer, e.g. Listen, Attend and Spell encoder (\(\rightarrow\) downscale time)
Other variants for introducing depth in RNN is explored in (Pascanu et al., 2014). For example, the transition function from \(h_{t-1}\) to \(h_t\) is not deep, even in stacked RNNs but is deep in DT-RNN.
Stacked bidirectional LSTM, documentation
with discrete inputs (words, characters, …) of the same time length, one prediction per time step.
import torch
import torch.nn as nn
seq_len = 51
batch_size = 32
vocab_size = 10
embedding_dim = 128
hidden_size = 256
rnn_model = nn.Sequential(
nn.Embedding(num_embeddings=vocab_size,
embedding_dim=embedding_dim),
nn.LSTM(input_size=embedding_dim,
hidden_size=hidden_size,
num_layers=3,
bidirectional=True)
)
out_model = nn.Sequential(
nn.Linear(2*hidden_size, 10)
)
# Forward propagation
# An embedding layer takes as input a LongTensor
rand_input = torch.randint(low=0, high=vocab_size,
size=(seq_len, batch_size))
# out_rnn is (T, B, num_hidden)
# state_n is the state of the last hidden layer
out_rnn, state_n = rnn_model(rand_input)
# out is (T, B, num_out)
out = out_model(out_rnn)
You can provide an initial state to the call function of the LSTM, in which case, you must take out the LSTM from the nn.Sequential, by default \(\overrightarrow{h}_0 = \overleftarrow{h}_0 = \overleftarrow{c}_0 = \overrightarrow{c}_0 = 0\)). You could learn these initial hidden states (to bias the first operations of your rnn).
All the weights and biases and initialized from LeCun like initialization \(\mathcal{U}(-\sqrt{k}, \sqrt{k})\), \(k=\frac{1}{\mbox{hidden_size}}\)
Some authors (Gers et al., 2000) suggest to favor either long-term dependencies or short-term dependencies by setting the bias of the forget gate accordingly (“Learning to forget”, \(F_{t=0}=1\) to remember everything by default).
See the lab work for specificities on representing variable sized sequences with pytorch PackedSequences.
How do you know how to access these weights ? See the doc
Note several redundant biases \(b_{i.}\) and \(b_{h.}\) (for CuDNN compatibility).
Task: initialize in the “Learning to forget” regime of (Gers et al., 2000)
num_layers=3
rnn = nn.LSTM(input_size=embedding_dim,
hidden_size=hidden_size,
num_layers=num_layers,
bidirectional=True)
# Initialize to high forget gate bias
with torch.no_grad():
for i in range(num_layers):
forw_bias = getattr(rnn, f'bias_ih_l{i}').chunk(4, dim=0)[1]
forw_bias.fill_(1)
rev_bias = getattr(rnn, f'bias_ih_l{i}_reverse').chunk(4, dim=0)[1]
rev_bias.fill_(1)
The ordering of the weights/biases are inputg/forgetg/cell/outputg.
Problem given fixed length chunks of sentences, predict the next word/character : \(p(x_T | x_0, x_1, ..., x_{T-1})\)
Many to many during training (teacher forcing (Williams & Peng, 1990)) but many to one for inference.
A language model can be used, e.g., to constrain the decoding of a network outputting sentences (e.g. in speech-to-text or captioning tasks)
See also The unreasonnable effectiveness of recurrent neural networks and (Sutskever, Martens, & Hinton, 2011).
Example on “Les fabulistes”
Dataset size : \(10.048\) non overlapping chunks of length \(60\).
Example samples :
Input : [2,71,67,54,70,57,10,2,56,95,71…65,57,66,72,2,70,57,55,60,57]
" sobre, dé…ment reche"
Output [71,67,54,70,57,10,2,56,95,71,61…57,66,72,2,70,57,55,60,57,70]
“sobre, dés…ent recher”
Note we use uni-directional LSTM. With bi-directionnal LSTM, the problem is easily solved by using the backward LSTM only.
Loss : cross-entropy averaged over batch_size \(\times\) seq_len
Training: Adam(0.01), learning rate halved every 10 steps, gradient clipping (5) (not sure it helped though)
After \(30\) epochs, validation loss of \(1.45\) and validation accuracy of \(56\%\).
To sample from the language model, you can provide it with some context sentence, e.g.
[‘L’, ‘A’, ’ ‘, ’G’, ‘R’,‘E’,‘N’,‘O’,‘U’, ‘I’, ‘L’, ‘L’, ‘E’, ’ ’]
Sample of \(200\) chars after init
LA GRENOUILLE y
-Ô)asZYc5[h+IÉë?8—>Y.bèqp;ÎzÇÇ<)!|f]Lt+«-u
XûoÜ:!ïgVùb|Ceü9ùÈ«à
6)ZàÀçBJi)X:ZÛdzxQ8PcvïV]O]xPX,Înc.è’Pâs:X;ûfjBâ?X
ç’ED’fSOl*Z(È’È1SnjàvPïLoUÊêàDgùO9z8eJûRYJ?Yg
Uâp|jCbû—HxBràZBMZÛPCGuR’]ÀiÊÂSBF4D),û
Sample 1 of \(200\) chars after 30 epochs
LA GRENOUILLE ET MOURE ET LA RENARDIER
Quel Grâce tout mon ambassade est pris.
L’un pourtant rare,
D’une première
Qu’à partout tout en mon nommée et quelques fleuris ;
Vous n’oserions les Fermerois, les heurs la
Note the upper case after the line breaks, the uppercase title, the quite existing words. The text does not make much sense but it is generated character by character !
Sample 2 of \(200\) chars (from the same model as before)
LA GRENOUILLE D’INDÉTES
[Phèdre]
Tout faire force belle, commune,
Et des arts qui, derris vôtre gouverne a rond d’une partage conclut sous besort qu’il plaît du lui dit Portune comme un Heurant enlever bien homme,
More on language modeling (metrics, models, …) in the Deep NLP lecture of Joel Legrand.
Idea Use a pre-trained CNN for image embedding plugged into a RNN for generating (decoding) the sequence of words. Introduced in (Vinyals et al., 2015).
Learn a model maximizing :
\[ p(S_0S_1S_2..S_T | I, \theta) = p(S_0|I, \theta)\prod_{j>0}^{T} p(S_j|S_0S_1...S_{j-1}, I, \theta) \]
i.e. minimizing \(-\log(p(S_0S_1S_2..S_T | I, \theta)) = -\sum_j \log(p(S_j|S_0...S_{j-1}, I, \theta))\)
Inspired by the Seq2Seq approach successful in machine translation (more on this later), they proposed an encoder-decoder model to translate an image to a sentence
Training ingredients :
Introducing the visual convolutional features at every step did not help.
Inference :
Idea Allow the RNN to filter out/focus on CNN features during generation using an attention mechanism (Bahdanau, Cho, & Bengio, 2015). Introduced in (Xu et al., 2016). Link to theano source code
Training:
Double stochastic attention :
Inference:
Problem In tasks such as Machine Translation (MT) or Automatic Speech Recognition (ASR), input sequences get mapped to output sequences, both can be of arbitrary sizes.
Machine translation :
The proposal will not now be implemented
Les propositions ne seront pas mises en application maintenant
Automatic speech recognition
The alignment can be difficult to explicit. Contrary to the language model, we may not know easily when to output what.
Idea Encode/Compress the input sequence to a hidden state and decode/decompress the output sequence from there. Introduced in (Cho et al., 2014) for ranking translations and (Sutskever et al., 2014) for generating translations (NMT).
Architecture :
The input sentence is fed in reverse order.
Beam search decoding. Teacher forcing for training but see also Scheduled sampling or Professor Forcing.
To get the most likely translation, you need to estimate
\[ p(y | x) = p(y_0|x, \theta) \prod_t p(y_t | y_0...y_{t-1} x \theta) \]
But the probability distribution over the labels is dependent on the previously generated label (which feeds the input for the next step) \(\rightarrow\) approximate search by maintaining a set of \(B\) candidates.
See also the modified beam search scoring of GNMT (Wu et al., 2016).
Idea For problems with the output sequence length \(T_y\) is smaller than the input sequence \(T_x\), allow a blank character. Introduced in (Graves, Fernández, Gomez, & Schmidhuber, 2006)
The collapsing many-to-one mapping \(\mathcal{B}\) removes the duplicates and then the blanks.
The CTC networks learn from all the possible alignments of \(X\) with \(Y\) by adding the extra-blank character. Allows to learn from unsegmented sequences !
See also alternatives of the blank character in (Collobert, Puhrsch, & Synnaeve, 2016).
\[ \begin{align*} p(Y | X) &= \sum_\pi p(\pi | x) \\ &= \sum_\pi \prod_t p(\pi_t|x) \end{align*} \]
No need to sum over the possibly large number of paths \(\pi\), it can be computed recursively.
Graphical representation from distill.pub
Recursively compute \(\alpha_{s,t}\) the probability assigned by the model at time \(t\) to the subsequence (extended with the blank) \(y_{1:s}\)
You end up with a computational graph through which the gradient can propagate.
Problem During inference, given an input \(x\), what is the most probable collapsed labeling ? This is intractable.
Solution 1: best path decoding by selecting, at each time step, the output with the highest probability assigned by your model
\[ \hat{y}(x) = \mathcal{B}(\mbox{argmax}_\pi p(\pi|x, \theta)) = \mathcal{B}(\mbox{argmax}_{\pi}\prod_t p(\pi_t | x, \theta)) \]
But the same labeling can have many alignments and the probability can be spiky on one bad alignment.
Solution 2: beam search decoding taking care of the blank character (multiple paths may collapse to the same final labeling)
Possibility to introduce a language model to bias the decoding in favor of plausible words. See (Hannun et al., 2014) :
\[ \mbox{argmax}_y(p(y|x) p_{LM}(y)^\alpha \mbox{wordcount}^\beta(y)) \]
Problem Given a waveform, produce the transcript.
Example datasets : Librispeech (English, 1000 hours, Aligned), TED (English, 450 hours, Aligned), Mozilla common voice (Multi language, 2000 hours in English, 600 hours in French, unaligned)
Note: you can contribute the open shared common voice dataset in one of the 60 languages by either recording or validating (Ardila et al., 2020)!
Example model : end-to-end trainable Baidu DeepSpeech (v1,v2) (Hannun et al., 2014),(Amodei et al., 2015). See also the implementation of Mozilla DeepSpeech v2.
Note some authors introduced end-to-end trainable networks from the raw waveforms (Zeghidour, Usunier, Synnaeve, Collobert, & Dupoux, 2018).
Introduced in (Amodei et al., 2015) on English and Mandarin.
The English architecture involves :
35 M. parameters
The training :
Rather check the full online document references.pdf
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., … Zhu, Z. (2015). Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv:1512.02595 [Cs]. Retrieved from http://arxiv.org/abs/1512.02595
Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., … Weber, G. (2020). Common Voice: A Massively-Multilingual Speech Corpus. arXiv:1912.06670 [Cs]. Retrieved from http://arxiv.org/abs/1912.06670
Arjovsky, M., Shah, A., & Bengio, Y. (2016). Unitary Evolution Recurrent Neural Networks. arXiv:1511.06464 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1511.06464
Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In In Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1607.06450
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In.
Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures. arXiv:1206.5533 [Cs]. Retrieved from http://arxiv.org/abs/1206.5533
Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2006). Greedy Layer-Wise Training of Deep Networks. In (p. 8).
Bjorck, N., Gomes, C. P., Selman, B., & Weinberger, K. Q. (2018). Understanding Batch Normalization. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018) (p. 12).
Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-NMS – Improving Object Detection With One Line of Code. arXiv:1704.04503 [Cs]. Retrieved from http://arxiv.org/abs/1704.04503
Broomhead, D., & Lowe, D. (1988). Multivariable Functional Interpolation and Adaptive Networks. Complex Systems, 2, 321–355.
Cho, K., Merrienboer, B. van, Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1406.1078
Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., & LeCun, Y. (2015). The Loss Surfaces of Multilayer Networks. In Roceedings of the 18thInternational Con-ference on Artificial Intelligence and Statistics (p. 13).
Ciresan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (pp. 3642–3649). Providence, RI: IEEE. https://doi.org/10.1109/CVPR.2012.6248110
Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2016). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). arXiv:1511.07289 [Cs]. Retrieved from http://arxiv.org/abs/1511.07289
Collobert, R., Puhrsch, C., & Synnaeve, G. (2016). Wav2Letter: An End-to-End ConvNet-based Speech Recognition System. arXiv:1609.03193 [Cs]. Retrieved from http://arxiv.org/abs/1609.03193
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314. https://doi.org/10.1007/BF02551274
Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated convolutional networks. Retrieved from http://arxiv.org/abs/1612.08083
Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems, 27, 2933–2941. Retrieved from https://papers.nips.cc/paper/2014/hash/17e23e50bedc63b4095e3d8204ce063b-Abstract.html
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [Cs]. Retrieved from http://arxiv.org/abs/1810.04805
Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, 12, 2121–2159.
Elman, J. L. (1990). Finding Structure in Time. Cognitive Science, 14(2), 179–211. https://doi.org/10.1207/s15516709cog1402_1
Fritzke, B. (1994). A growing neural gas network learns topologies. In Proceedings of the 7th International Conference on Neural Information Processing Systems (pp. 625–632). Cambridge, MA, USA: MIT Press.
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202. https://doi.org/10.1007/BF00344251
Gers, F. A., Schmidhuber, J. A., & Cummins, F. A. (2000). Learning to forget: Continual prediction with lstm. Neural Comput., 12(10), 2451–2471. https://doi.org/10.1162/089976600300015015
Girshick, R. (2015). Fast R-CNN. In 2015 IEEE International Conference on Computer Vision (ICCV) (pp. 1440–1448). https://doi.org/10.1109/ICCV.2015.169
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:1311.2524 [Cs]. Retrieved from http://arxiv.org/abs/1311.2524
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Roceedings of the13thInternational Conferenceon Artificial Intelligence and Statistics (AISTATS) 2010 (p. 8).
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on machine learning (pp. 369–376). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1143844.1143891
Graves, A., Mohamed, A.-r., & Hinton, G. (2013). Speech Recognition with Deep Recurrent Neural Networks. arXiv:1303.5778 [Cs]. Retrieved from http://arxiv.org/abs/1303.5778
Graves, A., Wayne, G., Reynolds, M., Harley, T., Danihelka, I., Grabska-Barwińska, A., … Hassabis, D. (2016). Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626), 471–476. https://doi.org/10.1038/nature20101
Greff, K., Srivastava, R. K., Koutnı́k, J., Steunebrink, B. R., & Schmidhuber, J. (2017). LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232. https://doi.org/10.1109/TNNLS.2016.2582924
Griewank, A. (2012). Who Invented the Reverse Mode of Differentiation? Documenta Mathematica, 12.
Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2 édition). Philadelphia, PA: Society for Industrial; Applied Mathematics.
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., … Ng, A. Y. (2014). Deep Speech: Scaling up end-to-end speech recognition. arXiv:1412.5567 [Cs]. Retrieved from http://arxiv.org/abs/1412.5567
Hastad, J. (1986). Almost optimal lower bounds for small depth circuits. In Proceedings of the eighteenth annual ACM symposium on Theory of computing (pp. 6–20). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/12130.12132
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv:1502.01852 [Cs]. Retrieved from http://arxiv.org/abs/1502.01852
He, K., Zhang, X., Ren, S., & Sun, J. (2016a). Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). Las Vegas, NV, USA: IEEE. https://doi.org/10.1109/CVPR.2016.90
He, K., Zhang, X., Ren, S., & Sun, J. (2016b). Identity Mappings in Deep Residual Networks. arXiv:1603.05027 [Cs]. Retrieved from http://arxiv.org/abs/1603.05027
Henaff, M., Szlam, A., & LeCun, Y. (2016). Recurrent Orthogonal Networks and Long-Memory Tasks, 9.
Hinton, G. E. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507. https://doi.org/10.1126/science.1127647
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1503.02531
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of Physiology, 117(4), 500–544. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1392413/
Hoffer, E., Hubara, I., & Soudry, D. (2017). Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. arXiv:1705.08741 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1705.08741
Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251–257. https://doi.org/10.1016/0893-6080(91)90009-T
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., … Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861 [Cs]. Retrieved from http://arxiv.org/abs/1704.04861
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q. (2017). Snapshot ensembles: Train 1, get m for free. In (p. 14).
Huang, G., Liu, Z., Maaten, L. van der, & Weinberger, K. Q. (2018). Densely Connected Convolutional Networks. arXiv:1608.06993 [Cs]. Retrieved from http://arxiv.org/abs/1608.06993
Ioffe, S., & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (pp. 448–456). PMLR. Retrieved from http://proceedings.mlr.press/v37/ioffe15.html
Jaderberg, M., Simonyan, K., & Zisserman, A. (2015). Spatial Transformer Networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems (p. 9).
Josef Hochreiter. (1991). Untersuchungen zu dynamischen neuronalen Netzen (PhD thesis). Retrieved from http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf
Kam-Chuen Jim, Giles, C. L., & Horne, B. G. (1996). An analysis of noise in recurrent neural networks: Convergence and generalization. IEEE Transactions on Neural Networks, 7(6), 1424–1438. https://doi.org/10.1109/72.548170
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 [Cs, Math]. Retrieved from http://arxiv.org/abs/1609.04836
Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. In arXiv:1412.6980 [cs]. Retrieved from http://arxiv.org/abs/1412.6980
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386
Krueger, D., Maharaj, T., Kramár, J., Pezeshki, M., Ballas, N., Ke, N. R., … Pal, C. (2017). ZONEOUT: REGULARIZING RNNS BY RANDOMLY PRESERVING HIDDEN ACTIVATIONS, 11.
Le, Q. V., Jaitly, N., & Hinton, G. E. (2015). A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. arXiv:1504.00941 [Cs]. Retrieved from http://arxiv.org/abs/1504.00941
LeCun, Y. A., Bottou, L., Orr, G. B., & Müller, K.-R. (1998). Efficient BackProp. In G. Montavon, G. B. Orr, & K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade: Second Edition (pp. 9–48). Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-35289-8_3
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4), 541–551. https://doi.org/10.1162/neco.1989.1.4.541
Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., … Gadde, R. T. (2019). Jasper: An End-to-End Convolutional Neural Acoustic Model. arXiv:1904.03288 [Cs, Eess]. Retrieved from http://arxiv.org/abs/1904.03288
Li, Y., Wei, C., & Ma, T. (2019). Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks. In NIPS 2019 (p. 12).
Lin, M., Chen, Q., & Yan, S. (2014). Network In Network. arXiv:1312.4400 [Cs]. Retrieved from http://arxiv.org/abs/1312.4400
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature Pyramid Networks for Object Detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 936–944). Honolulu, HI: IEEE. https://doi.org/10.1109/CVPR.2017.106
Loshchilov, I., & Hutter, F. (2017). SGDR: Stochastic Gradient Descent with Warm Restarts. In arXiv:1608.03983 [cs, math]. Retrieved from http://arxiv.org/abs/1608.03983
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning (ICML13) (p. 6).
Maclin, R., & Shavlik, J. W. (1995). Combining the predictions of multiple classifiers: Using competitive learning to initialize neural networks. In Proceedings of the 14th international joint conference on Artificial intelligence - Volume 1 (pp. 524–530). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115–133. https://doi.org/10.1007/BF02478259
Minksy, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry.
Montufar, G. F., Pascanu, R., Cho, K., & Bengio, Y. (2014). On the Number of Linear Regions of Deep Neural Networks. In Advances in Neural Information Processing Systems (Vol. 27, pp. 2924–2932). Retrieved from https://papers.nips.cc/paper/2014/hash/109d2dd3608f669ca17920c511c2a41e-Abstract.html
Moon, T., Choi, H., Lee, H., & Song, I. (2015). RNNDROP: A novel dropout for rnns in asr. In 2015 ieee workshop on automatic speech recognition and understanding (asru) (pp. 65–70). https://doi.org/10.1109/ASRU.2015.7404775
Müller, R., Kornblith, S., & Hinton, G. (2020). When Does Label Smoothing Help? arXiv:1906.02629 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1906.02629
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning (pp. 807–814). Madison, WI, USA: Omnipress.
Nielsen, M. A. (2015). Neural Networks and Deep Learning. Retrieved from http://neuralnetworksanddeeplearning.com
Olah, C. (2015). Calculus on computational graphs: Backpropagation.
Park, J., & Sandberg, I. W. (1991). Universal Approximation Using Radial-Basis-Function Networks. Neural Computation, 3(2), 246–257. https://doi.org/10.1162/neco.1991.3.2.246
Pascanu, R., Dauphin, Y. N., Ganguli, S., & Bengio, Y. (2014). On the saddle point problem for non-convex optimization. arXiv:1405.4604 [Cs]. Retrieved from http://arxiv.org/abs/1405.4604
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30thInternational Conference on Machine Learning (p. 9).
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., … Lerer, A. (2017). Automatic differentiation in PyTorch. In Proceedings of the 31st International Conference on Neural Information Processing Systems (p. 4).
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. arXiv:1506.02640 [Cs]. Retrieved from http://arxiv.org/abs/1506.02640
Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv:1506.01497 [Cs]. Retrieved from http://arxiv.org/abs/1506.01497
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. https://doi.org/10.1037/h0042519
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4510–4520). Salt Lake City, UT: IEEE. https://doi.org/10.1109/CVPR.2018.00474
Santurkar, S., Tsipras, D., Ilyas, A., & Ma, A. (2018). How Does Batch Normalization Help Optimization? In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018) (p. 11).
Schmidhuber, J. (1992). Learning Complex, Extended Sequences Using the Principle of History Compression. Neural Computation, 4(2), 234–242. https://doi.org/10.1162/neco.1992.4.2.234
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117. https://doi.org/10.1016/j.neunet.2014.09.003
Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093
Schwenker, F., Kestler, H. A., & Palm, G. (2001). Three learning phases for radial-basis-function networks. Neural Networks, 14(4-5), 439–458. Retrieved from http://dblp.uni-trier.de/db/journals/nn/nn14.html#SchwenkerKP01
Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv:1312.6034 [Cs]. Retrieved from http://arxiv.org/abs/1312.6034
Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. In arXiv:1409.1556 [cs]. Retrieved from http://arxiv.org/abs/1409.1556
Smith, L. N. (2018). A disciplined approach to neural network hyper-parameters: Part 1 – learning rate, batch size, momentum, and weight decay. arXiv:1803.09820 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1803.09820
Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). Striving for Simplicity: The All Convolutional Net. arXiv:1412.6806 [Cs]. Retrieved from http://arxiv.org/abs/1412.6806
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(56), 1929–1958. Retrieved from http://jmlr.org/papers/v15/srivastava14a.html
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training Very Deep Networks, 9.
Sutskever, I. (2013). Training recurrent neural networks (PhD thesis). University of Toronto, CAN.
Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. In ICML (p. 14).
Sutskever, I., Martens, J., & Hinton, G. (2011). Generating Text with Recurrent Neural Networks, 8.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. arXiv:1409.3215 [Cs]. Retrieved from http://arxiv.org/abs/1409.3215
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2014). Going Deeper with Convolutions. arXiv:1409.4842 [Cs]. Retrieved from http://arxiv.org/abs/1409.4842
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567 [Cs]. Retrieved from http://arxiv.org/abs/1512.00567
Tan, M., & Le, Q. V. (2020). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv:1905.11946 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1905.11946
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 652–663. https://doi.org/10.1109/TPAMI.2016.2587640
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328–339. https://doi.org/10.1109/29.21701
Werbos, P. (1981). Application of advances in nonlinear sensitivity analysis. In Proc. Of the 10th IFIP conference (pp. 762–770).
Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560. Retrieved from https://www.bibsonomy.org/bibtex/25ea3485ce75778e802cd8466cd7ffa69/joachimagne
Widrow, B., & Hoff, M. E. (1962). Associative Storage and Retrieval of Digital Information in Networks of Adaptive “Neurons”. In E. E. Bernard & M. R. Kare (Eds.), Biological Prototypes and Synthetic Systems: Volume 1 Proceedings of the Second Annual Bionics Symposium sponsored by Cornell University and the General Electric Company, Advanced Electronics Center, held at Cornell University, August 30–September 1, 1961 (pp. 160–160). Boston, MA: Springer US. https://doi.org/10.1007/978-1-4684-1716-6_25
Williams, R. J., & Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2(4), 490–501. https://doi.org/10.1162/neco.1990.2.4.490
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [Cs]. Retrieved from http://arxiv.org/abs/1609.08144
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., … Bengio, Y. (2016). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv:1502.03044 [Cs]. Retrieved from http://arxiv.org/abs/1502.03044
Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. arXiv:1511.07122 [Cs]. Retrieved from http://arxiv.org/abs/1511.07122
Ze, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 7962–7966). Vancouver, BC, Canada: IEEE. https://doi.org/10.1109/ICASSP.2013.6639215
Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., & Dupoux, E. (2018). End-to-End Speech Recognition from the Raw Waveform. In Interspeech 2018 (pp. 781–785). ISCA. https://doi.org/10.21437/Interspeech.2018-2414
Zeiler, M. D. (2012). ADADELTA: An Adaptive Learning Rate Method. arXiv:1212.5701 [Cs]. Retrieved from http://arxiv.org/abs/1212.5701
Zeiler, M. D., & Fergus, R. (2014). Visualizing and Understanding Convolutional Networks. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision – ECCV 2014 (pp. 818–833). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-10590-1_53
Zhang, J., & Mitliagkas, I. (2018). YellowFin and the Art of Momentum Tuning. arXiv:1706.03471 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1706.03471