Deep learning

An introduction to deep learning

Jeremy Fix

September 18, 2021

Slides made with slidemaker

Introduction

Historical perspective

  • Hodgkin-Huxley (Hodgkin & Huxley, 1952) : giant squid axon
  • Formal neuron (McCulloch & Pitts, 1943) : the community gets very excited
  • Perceptron (Rosenblatt, 1958) : linear classifier
  • AdaLinE (Widrow & Hoff, 1962) : linear regressor
  • Minsky/Papert (Minksy & Papert, 1969) : first winter
  • Convolutional Neural networks (Fukushima, 1980),(LeCun et al., 1989) : great !
  • Multilayer Perceptron and backprop (Rumelhart, Hinton, & Williams, 1986) : great !

but it is hard to train (except the CNN) and the SVM comes in the play in the 1990s … : second winter

  • 2006 : pretraining greatly helps
  • 2012 : AlexNet on ImageNet (10% better on test than the 2nd)
  • Now on : lot of SOTA neural networks

For an overview : (Schmidhuber, 2015)

Success stories

  • Image classification : (Krizhevsky, Sutskever, & Hinton, 2012)
  • Image segmentation CSAIL repo
  • Depth estimation Github
  • 3D Pose estimation
Detectron2
Detectron2
  • Speech synthesis/recognition : usually adressed with HMM (Ze, Senior, & Schuster, 2013), (Li et al., 2019)
  • Automatic translation (Google Neural Machine Translation)
  • Language models (BERT, GPT) (Devlin, Chang, Lee, & Toutanova, 2019)
Wavenet
Wavenet
  • Deep reinforcement learning : Atari / AlphaGO / AlphaStar

Also in chemistry (AlphaChem), on graph data (e.g. pytorch geometric), point clouds (PointNet), …

Why is deep learning working

Some of the reasons of the current success :

  • GPU (speed of processing) / Data (regularizing)
  • theoretical understandings on the difficulty of training deep networks (from 2006)

Libraries allow to easily implement/test/deploy neural networks :

  • Torch (Lua) / PyTorch(Python/C++), Caffe(C++/Python), Caffe2 (RIP 2018)
  • Microsoft CNTK
  • Google Tensorflow / Keras
  • Theano/Lasagne (Python, RIP 2017)
  • CNTK, Chainer, Matlab, Mathematica, ….

Ressources

Books

Introduction to the theory of neural computation Neural network for pattern recognition Deep learning book Dive into deep learning Deep learning with pytorch

People and conferences

Some of the major contributors to the field:

  • N-2 : McCulloch/Pitts, Rumelhart, Rosenblatt, Hopfield,
  • N-1 : Hinton, Bengio, LeCun, Schmidhuber
  • N : Goodfellow, Dauphin, Graves, Sutskever, Karpathy, Krizevsky, Hochreiter

Some the most important conferences: NIPS/NeurIPS, ICLR, (ICML, ICASSP, ..)
Online ressources :
- distill.pub, blog posts,
- FastAI lectures, CS231n, MIT S191
- awesome deep learning, Awesome deep learning papers

Syllabus

Lecture 1 (30/11): Introduction, Linear networks
Lecture 2/3 (30/11, 01/12): Computational Graph, Fully Connected, RBF, Optimization, Initialization, Regularization

Lab work 1 (07/12) : Introduction to pytorch, tensorboard, FCN, CNNs

Lecture 4 (08/12): Convolutional Neural Networks: architectures
Lecture 5 (14/12) : Convolutional Neural Networks : applications
Lecture 6 (14/12) : Recurrent neural networks : architectures

Lab work 2 (04/01) : Convolutional neural networks (Object detection)

Lecture 7 (11/01): Recurrent neural networks : applications
Lecture 8 (11/01): Introduction to generative and probabilistic networks

Lab work 3 (18/01): Recurrent neural networks (Seq2Seq)
Lab work 4 (25/01): Generative Neural Networks (RBM, GAN, DBN)

Labworks : on our GPU clusters, in pairs, remotely with VNC

What is a neural network ?

Definition

A neural network is a directed graph :

  • nodes : computational units
  • edges : weighted connections
Feedforward neural network
Feedforward neural network
Recurrent neural network
Recurrent neural network

There are two types of graphs :

  • no cycle : feedforward neural network
  • with at least one cycle : recurrent neural networks

But why do we care about convolutional neural networks with a softmax output, ReLu hiddden activations, cross entropy loss, batch normalization layers, trained with RMSprop with Nesterov momentum regularized with dropout exactly ?

Linear Neural networks

Perceptron (Rosenblatt, 1958)

Perceptron (Rosenblatt, 1958)

  • Classification : given \((x_i, y_i) \in \mathbb{R}^n \times \{-1, 1\}\)
  • Sensory - Associative - Response architecture, \(\phi_j(x)\) with \(\phi_0(x) = 1\)
  • Algorithm and geometrical interpretation
SAR Architecture
SAR Architecture

Architecture of the classifier

Given fixed, predefined feature functions \(\phi_j\), with \(\phi_0(x) = 1, \forall x \in \mathbb{R}^n\), the perceptron classifies \(x\) as :

\[\begin{align} y &= g(w^T \Phi(x))\\ g(x) &= \begin{cases}-1 &\text{if }\quad x < 0 \\ +1 & \text{if }\quad x \geq 0 \end{cases} \end{align}\]

with \(\phi(x) \in \mathbb{R}^{n_a+1}\), \(\phi(x) = \begin{bmatrix} 1 \\ \phi_1(x) \\ \phi_2(x) \\ \vdots \end{bmatrix}\)

SAR Architecture
SAR Architecture

Online training algorithm

Given \((x_i, y_i)\), \(y_i \in \{-1,1\}\), the perceptron learning rule operates online: \[\begin{align} w = \begin{cases} w &\text{ if the input is correctly classified}\\ w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\ w - \phi(x_i) &\text{ if the input is incorrectly classified as +1} \end{cases} \end{align}\]

Geometrical interpretation : correct classification

Decision rule : \(y = g(w^T \Phi(x))\)
Algorithm:
\[\begin{align} w = \begin{cases} w &\text{ if the input is correctly classified}\\ w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\ w - \phi(x_i) &\text{ if the input is incorrectly classified as +1} \end{cases} \end{align}\]

A correctly classified sample either positive or negative
A correctly classified sample either positive or negative

Geometrical interpretation : misclassification

Decision rule : \(y = g(w^T \Phi(x))\)
Algorithm:
\[\begin{align} w = \begin{cases} w &\text{ if the input is correctly classified}\\ w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\ w - \phi(x_i) &\text{ if the input is incorrectly classified as +1} \end{cases} \end{align}\]

An incorrectly classified sample either positive or negative
An incorrectly classified sample either positive or negative

Geometrical interpretation : multiple samples

Decision rule : \(y = g(w^T \Phi(x))\)

The intersection of the valid halfspaces is called the cone of feasibility (it may be empty).

Consider two samples \(x_1, x_2\) with \(y_1=+1\), \(y_2=-1\)

The cone of feasibility for y_1=+1 and y_2=-1
The cone of feasibility for \(y_1=+1\) and \(y_2=-1\)

Toward a canonical learning rule

Given \((x_i, y_i)\), \(y_i \in \{-1,1\}\), the operates online~: \[\begin{align} w = \begin{cases} w &\text{ if the input is correctly classified}\\ w + \phi(x_i) &\text{ if the input is incorrectly classified as -1}\\ w - \phi(x_i) &\text{ if the input is incorrectly classified as +1} \end{cases} \end{align}\]

\[\begin{align} w = \begin{cases} w &\text{ if } g(w^T\phi(x_i)) = y_i\\ w + \phi(x_i) &\text{ if } g(w^T \phi(x_i)) = -1 \text{ and } y_i = +1\\ w - \phi(x_i) &\text{ if } g(w^T \phi(x_i)) = +1 \text{ and } y_i = -1 \end{cases} \end{align}\]

\[\begin{align*} w = \begin{cases} w &\text{ if } g(w^T\phi(x_i)) = y_i\\ w + y_i \phi(x_i) &\text{ if } g(w^T \phi(x_i)) \neq y_i \end{cases} \end{align*}\]

Toward a canonical learning rule

Given \((x_i, y_i)\), \(y_i \in \{-1,1\}\), the operates online~:

\[\begin{align*} w = \begin{cases} w &\text{ if } g(w^T\phi(x_i)) = y_i\\ w + y_i \phi(x_i) &\text{ if } g(w^T \phi(x_i)) \neq y_i \end{cases} \end{align*}\]

\[\begin{align*} w = w + \frac{1}{2} (y_i - \hat{y}_i) \phi(x_i) \end{align*}\]

with \(\hat{y}_i = g(w^T \phi(x_i))\). This is called the delta rule.

Perceptron convergence theorem

Definition (Linear separability)

A binary classification problem \((x_i, y_i) \in \mathbb{R}^d \times \{-1,1\}, i \in [1..N]\) is said to be linearly separable if there exists \(\textbf{w} \in \mathbb{R}^d\) such that~:

\[\begin{align*} \forall i, \mbox{sign}(\textbf{w}^T x_i) = y_i \end{align*}\]

with \(\forall x < 0, \mbox{sign}(x) = -1, \forall x \geq 0, \mbox{sign}(x) = +1\).

Theorem (Perceptron convergence theorem)

A classification problem \((x_i, y_i) \in \mathbb{R}^d \times \{-1,1\}, i \in [1..N]\) is linearly separable if and only if the perceptron learning rule converges to an optimal solution in a finite number of steps.

\(\Leftarrow\): easy; \(\Rightarrow\) : we upper/lower bound \(|w(t)|_2^2\)

Various facts

  • \(w_t = w_0 + \sum_{i \in \mathcal{I}(t)} y_i \phi(x_i)\), with \(\mathcal{I}(t)\) the set of misclassified samples
  • it minimizes a loss : \(J(w) = \frac{1}{N} \sum_i \max(0, -y_i w^T \phi(x_i))\)
  • the solution can be written as

\[\begin{equation} w_t = w_0 + \sum_{i}\frac{1}{2} (y_i - \hat{y}_i) \phi(x_i) \end{equation}\]

\((y_i - \hat{y}_i)\) is the prediction error

Kernel perceptron

Any linear predictor involving only scalar products can be kernelized (kernel trick, cf SVM);

Decision rule : \(\mbox{sign}(<w, x>)\)

Given \(w(t) = w_0 + \sum_{i \in \mathcal{I}} y_i x_i\)

\[\begin{align*} <w,x> &= <w_0,x> + \sum_{i \in \mathcal{I}} y_i <x_i, x> \\ \Rightarrow k(w,x) &= k(w_0, x) + \sum_{i \in \mathcal{I}} y_i k(x_i, x) \end{align*}\]

Kernel perceptron
Kernel perceptron

Polynomial kernel of degree \(d=3\) :

\[k(x, y) = (1 + <x, y>)^3\]

Training set : 50 samples

Real risk : \(92\%\)

Code : https://github.com/rougier/ML-Recipes/blob/master/recipes/ANN/kernel-perceptron.py

AdaLinE (Widrow & Hoff, 1962)

Linear regression analytically

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \mathbb{R}\), minimize

\[ J(w) = \frac{1}{N} \sum_i ||y_i - w^T x_i||^2 \]

We assume that \(x_i[0] = 1 \forall i\) so that \(w[0]\) hosts the bias term.

Analytically Introduce \(X = [x_0 | x_1 | ... ]\), \(J(w) = \|y-X^Tw\|^2\)

\[ \nabla_w J(w) = 0 \Rightarrow X X^T w = X y \]

  • \(XX^T\) non singular : \(w = (X X^T)^{-1} X y\)
  • \(XX^T\) singular (e.g. points along a line in 2D), infinite nb solutions
    • One solution can be found with the regularized least square : \[ min G(w) = J(w) + \alpha w^T w \]
    • \(\nabla_w G(w) = 0 \Rightarrow (X X^T + \alpha I) w = X y\)
    • as soon as \(\alpha > 0\), \((X X^T + \alpha I)\) is not singular

Needs to compute \(XX^T\), i.e. over the whole training set…

Linear regression with stochastic gradient descent

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \mathbb{R}\), minimize

\[ J(w) = \frac{1}{N} \sum_i ||y_i - w^T x_i||^2 \]

We assume that \(x_i[0] = 1 \forall i\) so that \(w[0]\) hosts the bias term.

  • start at \(w_0\)

  • take each sample one after the other (online) \(x_i, y_i\)

  • denote \(\hat{y}_i = w^T x_i\) the prediction

  • update \[w_{t+1}= w_t - \epsilon \nabla_w J(w_t) = w_t + \epsilon (y_i - \hat{y}_i) x_i\]

  • delta rule, \(\delta = (y_i - \hat{y}_i)\) prediction error \[w_{t+1} = w_t + \epsilon \delta x_i\]

Gradient descent

Batch gradient descent

\[J(w,x,y) = \frac{1}{N} \sum_{i=1}^N L(w,x_i,y_i)\] e.g. \(L(w,x_i,y_i) = ||y_i - w^T x_i||^2\)

Batch gradient descent

  • compute the gradient of the loss \(J(w)\) over the whole training set

  • performs one step in direction of \(-\nabla_w J(w,x,y)\) \[w_{t+1} = w_t - \epsilon_t \textcolor{red}{\nabla_w J(w,x,y)}\]

  • \(\epsilon\) : learning rate

Stochastic gradient descent

\[J(w,x,y) = \frac{1}{N} \sum_{i=1}^N L(w,x_i,y_i)\] e.g. \(L(w,x_i,y_i) = ||y_i - w^T x_i||^2\)

Stochastic gradient descent (SGD)

  • one sample at a time, noisy estimate of \(\nabla_w J\)

  • performs one step in direction of \(-\nabla_w L(w,x_i,y_i)\) \[w_{t+1} = w_t - \epsilon_t \textcolor{red}{\nabla_w L(w,x_i,y_i)}\]

  • faster to converge than gradient descent

Minibatch gradient descent

\[J(w,x,y) = \frac{1}{N} \sum_{i=1}^N L(w,x_i,y_i)\] e.g. \(L(w,x_i,y_i) = ||y_i - w^T x_i||^2\)

Minibatch

  • noisy estimate of the true gradient with \(M\) samples (e.g. \(M=64, 128\)); \(M\) is the minibatch size
  • Randomize \(\mathcal{J}\) with \(|\mathcal{J}| = M\), one set at a time

\[ w_{t+1} = w_t - \epsilon_t \textcolor{red}{\frac{1}{M} \sum_{j \in \mathcal{J}} \nabla_w L(w,x_j,y_j)} \]

  • smoother estimate than SGD
  • great for parallel architectures (GPU)

If the batch size is too large, there is a generalization gap (LeCun, Bottou, Orr, & Müller, 1998), maybe due to sharp minimum (Keskar, Mudigere, Nocedal, Smelyanskiy, & Tang, 2017); see also (Hoffer, Hubara, & Soudry, 2017)

Does it make sense to use gradient descent ?

Convex function A function \(f: \mathbb{R}^n \mapsto \mathbb{R}\) is convex :

  1. \(\iff \forall x_1, x_2 \in \mathbb{R}^n, \forall t \in [0,1]\) \(f(t x_1 + (1-t)x_2) \leq t f(x_1) + (1-t) f(x_2)\)

  2. with \(f\) twice diff.,
    \(\iff \forall x \in \mathbb{R}^n, H = \nabla^2 f(x)\) is positive semidefinite
    i.e. \(\forall x \in \mathbb{R}^n, x^T H x \geq 0\)

For a convex function \(f\), all local minima are global minima. Our losses are lower bounded, so these minima exist. Under mild conditions, gradient descent and stochastic gradient descent converge, typically \(\sum \epsilon_t =\infty, \sum \epsilon_t^2 < \infty\) (cf lectures on convex optimization).

Linear regression

Summary

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \mathbb{R}\)

  • We assume that \(x[0] = 1\) to encompass the bias
  • Linear model : \(\hat{y} = w^T x\)
  • L2 loss : \(L(ŷ, y) = \|\hat{y} - y \|^2\)
  • by gradient descent \[ \nabla_w L(w,x_i,y_i) = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w} = -(y_i - \hat{y}_i) x_i \]

Other choices may also be considered (Huber loss, MAE, …).

Possibly regularized (but more on regularization latter).

Linear regression with L2 loss is convex

Indeed,

  • Given \(x_i, y_i\), \(L(w) = \frac{1}{2}(w^T x_i - y_i)^2\) is convex:

\[ \begin{align*} \nabla_w L &= (w^T x_i - y_i) x_i\\ \nabla_w^2 L &= x_i x_i^T\\ \forall x \in \mathbb{R}^n x^T x_i x_i^T x &= (x_i^T x)^2 \geq 0 \end{align*} \]

  • a non negative weighted sum of convex functions is convex

Linear classification

Maximum likelihood (binary classification)

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n}, y_i \in \{0, 1\}\)

Assume that \(P(y=1 | x) = p(x; w)\), parametrized by \(w\), and our samples to be independent, the conditional likelihood of the labels is:

\[ \mathcal{L}(w) = \prod_i P(y=y_i | x_i) = \prod_i p(x_i; w)^{y_i} (1- p(x_i; w))^{1-y_i} \]

With maximum likelihood estimation, we rather equivalently minimize the averaged negative log-likelihood :

\[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = \frac{1}{N} \sum_i -y_i\log(p(x_i; w))-(1-y_i)\log(1-p(x_i; w)) \]

Logistic regression (binary classification)

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in \{0, 1\}\)

  • Linear logit model : \(o(x) = w^Tx\) (we still assume \(x[0] = 1\) for the bias)

  • Sigmoid transfer function : \(\hat{y}(x) = p(x; w) = \sigma(o(x)) = \sigma(w^T x)\)

    • \(\sigma(x) = \frac{1}{1 + \exp(-x)}\), \(\sigma(x) \in [0, 1]\)
    • \(\frac{d}{dx}\sigma(x) = \sigma(x) (1 - \sigma(x))\)
  • Following maximum likelihood estimation, we minimize : \[ J(w) = \frac{1}{N} \sum_i -y_i\log(p(x_i; w))-(1-y_i)\log(1-p(x_i; w)) \]

  • The loss \(L(\hat{y}, y) = -y \log(\hat{y}) - (1-y)\log(1 - \hat{y})\) is called the cross entropy loss, or negative log-likelihood

  • The gradient of the cross entropy loss with \(\hat{y}(x) = \sigma(x)\) is : \[ \nabla_w L(w,x_i,y_i) = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w} = -(y_i - \hat{y}_i) x_i \]

Logistic regression is convex

Indeed,

  • Given \(x_i, y_i=1\), \(L_1(w) = -\log(\sigma(w^T x_i) = \log(1 + \exp(-w^Tx_i))\),
    \(\nabla_w L_1 = -(1 - \sigma(w^Tx_i)) x_i\)
    \(\nabla_w^2 L_1 = \underbrace{\sigma(w^T x_i) (1-\sigma(w^Tx_i))}_{>0} x_i x_i^T\)
  • Given \(x_i, y_i=0\), \(L_2(w) = -\log(1-\sigma(w^T x_i))\)
    \(\nabla_w L_2 = \sigma(w^T x_i) x\)
    \(\nabla_w^2 L_2 = \underbrace{\sigma(w^T x_i) (1-\sigma(w^Tx_i))}_{>0} x_i x_i^T\)
  • a non negative weighted sum of convex functions is convex

Do not use a L2 loss

Compute the gradient to see why

Take L2 loss \(L(\hat{y}, y) = \frac{1}{2}||\hat{y} - y||^2\)

  • Take the “linear” model : \(\hat{y}_i = \sigma(w^T x_i)\)
  • Check that \(\frac{d}{dx} \sigma(x) = \sigma(x) (1 - \sigma(x))\)
  • Compute the gradient wrt \(w\):
    \[ \nabla_w L(w,x_i,y_i) = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial w} = (\hat{y}_i - y_i) \textcolor{red}{\sigma(w^T x_i) (1 - \sigma(w^T x_i))} x_i \]
  • If \(x_i\) is strongly misclassified (e.g. \(y_i=1\), \(w^T x_i = -\infty\)). Then \(\sigma(w^T x_i) (1 - \sigma(w^T x_i)) \approx 0\), i.e. \(\nabla_w L(w,x_i,y_i) \approx 0\) \(\Rightarrow\) stepsize is very small while the sample is misclassified

With a cross entropy loss, \(\nabla_w L(w,x_i,y_i)\) is proportional to the error

Softmax regression (multiclass classification)

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in [|0, K-1|]\)

Assume that \(P(y=c | x) = \frac{e^{w_c^T x}}{\sum_k e^{w_k^T x}}\), parametrized by \(w_0, w_1, w_2, ..\), and our samples to be independent, the conditional likelihood of the labels is:

\[ \mathcal{L}(w) = \prod_i P(y=y_i | x_i) \]

With maximum likelihood estimation, we rather equivalently minimize the averaged negative log-likelihood:

\[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = -\frac{1}{N} \sum_i \log(P(y=y_i | x_i)) \]

With a one-hot encoding of the target class (i.e. \(y_i = [0, ..., 0, 1, 0, .. ]\)), it can be written as :

\[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = -\frac{1}{N} \sum_i \sum_c y_c \log(P(y=c | x_i)) \]

Softmax regression (multiclass classification)

Problem : Given \((x_i, y_i)\), \(x_i\in\mathbb{R}^{n+1}, y_i \in [|0, K-1|]\)

  • Linear models for each class \(o_j(x) = w_j^T x\) (we still assume \(x[0] = 1\))
  • Softmax transfer function : \(P[y=j|x] = \hat{y}_j = \frac{\exp(o_j(x))}{\sum_k \exp(o_k(x))}\)
  • Generalization of the sigmoid for a vectorial output
  • Following maximum likelihood estimation, we minimize \[ J(w) = -\frac{1}{N} \log(\mathcal{L}(w)) = -\frac{1}{N} \sum_i \log(P(y=y_i | x_i)) \]
  • The loss \(L(\hat{y}, y) = -\log(\hat{y}_y)\) is called the cross-entropy loss
  • by gradient descent: \[\nabla_{w_j} L(w,x,y) = \sum_k \frac{\partial L}{\partial \hat{y}_k} \frac{\partial \hat{y}_k}{\partial w_j} = -(\delta_{j,y} - \hat{y}_j) x\]

Softmax regression is convex.

Numerical issues with the softmax and CE loss

Large exponentials

If you compute naïvely the softmax, you would have \(\exp(..)\) which is quickly large.

Fortunately:

\[ softmax(o_1, o_2, o_3, ..) = softmax(o_1 - o^\star, o_2 - o^\star, o_3-o^\star) = \frac{\exp(o_i - o^\star)}{\sum_j \exp(o_j - o^\star)} \]

You always compute \(\exp(z)\) with \(z \leq 0\).

Avoiding some exponentials with the log-sum-exp trick

You do not really need to compute the \(\log(\hat{y}_j) = \log(softmax_j(x)))\) since :

\[ \log(\hat{y}_i) = \log(\frac{\exp(o_i-o^\star)}{\sum_j \exp(o_j - o^\star)}) = o_i - o^\star - \log(\sum_j \exp(o_j - o^\star)) \]

Toward non linear networks

Limits of linear classification

Perceptrons and logistic regression perform linear separation in a predefined, fixed feature space.

The XOR and its transformation
The XOR and its transformation

What about learning these features \(\phi_j(x)\)?

Bibliography

References

Rather check the full online document references.pdf

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805 [Cs]. Retrieved from http://arxiv.org/abs/1810.04805

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202. https://doi.org/10.1007/BF00344251

Hodgkin, A. L., & Huxley, A. F. (1952). A quantitative description of membrane current and its application to conduction and excitation in nerve. The Journal of Physiology, 117(4), 500–544. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1392413/

Hoffer, E., Hubara, I., & Soudry, D. (2017). Train longer, generalize better: Closing the generalization gap in large batch training of neural networks. arXiv:1705.08741 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1705.08741

Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., & Tang, P. T. P. (2017). On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. arXiv:1609.04836 [Cs, Math]. Retrieved from http://arxiv.org/abs/1609.04836

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386

LeCun, Y. A., Bottou, L., Orr, G. B., & Müller, K.-R. (1998). Efficient BackProp. In G. Montavon, G. B. Orr, & K.-R. Müller (Eds.), Neural Networks: Tricks of the Trade: Second Edition (pp. 9–48). Berlin, Heidelberg: Springer. https://doi.org/10.1007/978-3-642-35289-8_3

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4), 541–551. https://doi.org/10.1162/neco.1989.1.4.541

Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J. M., … Gadde, R. T. (2019). Jasper: An End-to-End Convolutional Neural Acoustic Model. arXiv:1904.03288 [Cs, Eess]. Retrieved from http://arxiv.org/abs/1904.03288

McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115–133. https://doi.org/10.1007/BF02478259

Minksy, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry.

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. https://doi.org/10.1037/h0042519

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. https://doi.org/10.1038/323533a0

Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117. https://doi.org/10.1016/j.neunet.2014.09.003

Widrow, B., & Hoff, M. E. (1962). Associative Storage and Retrieval of Digital Information in Networks of Adaptive “Neurons”. In E. E. Bernard & M. R. Kare (Eds.), Biological Prototypes and Synthetic Systems: Volume 1 Proceedings of the Second Annual Bionics Symposium sponsored by Cornell University and the General Electric Company, Advanced Electronics Center, held at Cornell University, August 30–September 1, 1961 (pp. 160–160). Boston, MA: Springer US. https://doi.org/10.1007/978-1-4684-1716-6_25

Ze, H., Senior, A., & Schuster, M. (2013). Statistical parametric speech synthesis using deep neural networks. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 7962–7966). Vancouver, BC, Canada: IEEE. https://doi.org/10.1109/ICASSP.2013.6639215