Deep learning

An introduction to deep learning

Jeremy Fix

February 15, 2024

Dealing with sequential data

Spatialising the time : TDNN

Time delay neural networks as introduced in (Waibel, Hanazawa, Hinton, Shikano, & Lang, 1989) spatializes the time:

But : which size of the time window ? Must the history size always be the same ? Do we need the data over the whole time span ? How to share computations in time instead of using distinct weights per time instant?

Feedforward neural networks can still be efficient for processing sequential data, e.g. Gated ConvNet (Dauphin, Fan, Auli, & Grangier, 2017), Transformers, …

Architecture of a RNN

Introduced by (Elman, 1990).

Weight matrices :

$W^{in}$ input to hidden
$W^h$ hidden to hidden
$W^{out}$ hidden to output
$W^{back}$ outputs to hidden

\[ \begin{align*} h(t) &= f(W^{in} x(t) + W^{h} h(t-1) + W^{back} y(t-1) + b_h)\\ y(t) &= g(W^{out} h(t) + b_y) \end{align*} \]

The hidden to hidden weight matrix $W_{h}$ is repeatedly applied.

Named Elman networks if $W^{back} = 0$, and Jordan networks if $W^h = 0$. Elman networks with a random fixed $W^h$ are called Echo State networks.

Many to One, One to Many, Many to Many

The inputs and outputs can be of variable ($1 \rightarrow T_x$, $1 \rightarrow T_y$) and arbitrary sizes ($T_x \neq T_y$).

Many to one example : language model, sentiment analysis : multiclass sequence classification $T_y=1$:
- $['this', 'movie', 'was', 'fantastic'] \mapsto 1$
- $['you', 'should', 'not', 'spend', 'even', 'a', 'single', 'penny', 'watching', 'this', 'movie'] \mapsto 0$
Many to many example : Neural Machine Translation
- What is the most likely EN translation of “La croissance économique s’est ralentie ces dernières années.” : “Economic growth has slowed down in recent years.”
One to many example: image captioning, language model with probabilistic sampling

start : ‘LA JUMENT ET’

LA JUMENT ET LE RAT
ET L’huiller craignait les gens d’une mise un vers atteint:
Va c’est d’être indigne de Vénus d’aller pressez l’ame
D’une mais, dit-il, un plongeant l’avertion :
Son échangé vous refusiez-vous

start : ‘LA JUMENT ET’

LA JUMENT ET LE BULÉE
[Ésope]
Comme à part craindre déjà cet honneur à couvrir jamais
Et ses mélonces, condition tempérament.
L’autre honne alla vie.
Je ne saurais pas que d’un moutons.
Que ce choix, coquet, g

Training with (truncated)-BPTT

Idea: unfold in time the computational graph and perform reverse mode differentiation (Werbos, 1990).

Backpropagation through time. Image from (Sutskever, 2013)

You must be training on truncated series to prevent a computational burden.

You can also perform forward mode differentiation (Real time recurrent learning RTTL (Williams & Peng, 1990)) with online adaptation as the inputs/targets comes in but this is computationally expensive.

Training a RNN can be hard

Unrolled in time, RNN appears as very deep networks $\rightarrow$ vanishing/exploding gradient

Initialization strategies :
- LSTM : high forget bias to favor a remember by default (Gers, Schmidhuber, & Cummins, 2000)
- orthogonal weight initialization (Henaff, Szlam, & LeCun, 2016), to keep the hidden activities normalized $\|Wh\|_2^2 = h^T W^T W h = h^T h = \|h\|_2^2$, see also (Arjovsky, Shah, & Bengio, 2016)
- identity recurrent weight initialization to favor a copy as-is by default for RNN (Le, Jaitly, & Hinton, 2015)

Architecture :
- Batch Normalization (hidden-2-hidden) may not work (Laurent, Pereyra, Brakel, Zhang, & Bengio, 2016) : Time step specific statistics (Cooijmans, Ballas, Laurent, Gülçehre, & Courville, 2017), Layer normalization (Ba, Kiros, & Hinton, 2016)

Training :
- gradient clipping (Pascanu, Mikolov, & Bengio, 2013) seems to be frequently used,
- activation clipping is sometimes considered (Hannun et al., 2014),

Regularization:
- Noise (Kam-Chuen Jim, Giles, & Horne, 1996), (Graves, Mohamed, & Hinton, 2013)
- Naive dropout is not an option (memory cell states could be masked differently in time)(Gal & Ghahramani, 2016)
- ZoneOut (Krueger et al., 2017), rnnDrop (Moon, Choi, Lee, & Song, 2015), ..

Memory cells

Long-Short Term memory (LSTM)

RNNs have difficulties learning long range dependencies. The LSTM (Hochreiter & Schmidhuber, 1997) introduces memory cells to address that problem.

Peepholes may connect the $c_t$ to their gates.

Equations:

\[ \begin{eqnarray*} I_t &=& \sigma(W^x_i x_t + W^h_i h_{t-1} + b_i) &\in [0,1], \mbox{Input gate}\\ F_t &=& \sigma(W^x_f x_t + W^h_f h_{t-1} + b_f)&\in [0,1], \mbox{Forget gate}\\ O_t &=& \sigma(W^x_o x_t + W^h_o h_{t-1} + b_o) &\in [0,1], \mbox{Output gate}\\ n_t &=& \tanh(W^x_n x_t + W^h_n h_{t-1} + b_z)& \mbox{unit's input}\\ c_t &=& F_t \odot c_{t-1} + I_t \odot n_t& \mbox{cell update}\\ h_t &=& O_t \odot \tanh(c_t) & \mbox{unit's output} \end{eqnarray*} \] The next layers integrate what is exposed by the cells, i.e. the unit’s output $h_t$, not $c_t$.

If $F_t=1, I_t=0$, the cell state $c_t$ is unmodified. This is called the constant error carrousel.

The forget gate is introduced in (Gers et al., 2000). Variants have been investigated in a search space odyssey (Greff, Srivastava, Koutnı́k, Steunebrink, & Schmidhuber, 2017).

See also (Le et al., 2015) which reconsiders using ReLU in LSTM given appropriate initialization of the recurrent weights to the identity to be copy by default mode.

Gated Recurrent Units (GRU)

The GRU is introduced as an alternative, simpler model than LSTM. Introduced in (Cho et al., 2014).

Equations:

\[ \begin{eqnarray*} R_t &=& \sigma(W^x_i x_t + W^h_i h_{t-1} + b_i) \mbox{ Reset gate}\\ Z_t &=& \sigma(W^x_z x_t + W^h_z h_{t-1} + b_z) \mbox{ Update gate}\\ n_t &=& \tanh(W^x_{n} x_t + b_{nx} + R_t \odot (W^h_nh_{t-1}+b_{nh}))\\ h_t &=& Z_t \odot h_{t-1} + (1-Z_t) \odot n_t \end{eqnarray*} \]

If $Z_{t} = 1$, the cell state $h_t$ is not modified. If $Z_t = 0$ and $R_t=1$, it is updated in one step.

Compared to LSTM, a GRU cell :

unconditionally exposes its hidden state (there is no private cell state $c_t$)
the hidden state is reset by getting $R_t=0$

Bidirectional RNN/LSTM/GRU

Idea Both past and future contexts can sometimes be required for classification at the current time step; e.g. when you speak, past and future phonemes influence the way you pronounce the current one. Introduced in (Schuster & Paliwal, 1997})

Bidirectionnal RNN. Image from (Graves et al., 2013)

In practice, see bidirectional in the constructors of LSTM and GRU

Deep RNNs

While RNN are fundamentally deep neural networks, they can still benefit from being stacked : this allows the layers to operate at increasing time scales. The lower layers can change their content at a higher rate than the higher layers.

(Graves et al., 2013): Phoneme classification with stacked bidirectionnal LSTMs
(Sutskever, Vinyals, & Le, 2014) : Machine translation with stacked unidirectionnal LSTMs (Seq2Seq)

In a stacked RNN, you can concatenate consecutive hidden states before feeding in the next RNN layer, e.g. Listen, Attend and Spell encoder ($\rightarrow$ downscale time)

Deep RNN variants. Image from (Pascanu, Dauphin, Ganguli, & Bengio, 2014)

Other variants for introducing depth in RNN is explored in (Pascanu et al., 2014). For example, the transition function from $h_{t-1}$ to $h_t$ is not deep, even in stacked RNNs but is deep in DT-RNN.

Defining RNN in PyTorch

Stacked bidirectional LSTM, documentation

with discrete inputs (words, characters, …) of the same time length, one prediction per time step.


import torch
import torch.nn as nn

seq_len = 51
batch_size = 32
vocab_size = 10
embedding_dim = 128
hidden_size = 256

rnn_model = nn.Sequential(
    nn.Embedding(num_embeddings=vocab_size,
                 embedding_dim=embedding_dim),
    nn.LSTM(input_size=embedding_dim,
            hidden_size=hidden_size,
            num_layers=3,
            bidirectional=True)
)
out_model = nn.Sequential(
    nn.Linear(2*hidden_size, 10)
)

# Forward propagation
# An embedding layer takes as input a LongTensor
rand_input = torch.randint(low=0, high=vocab_size,
                           size=(seq_len, batch_size))

# out_rnn is (T, B, num_hidden)
# state_n is the state of the last hidden layer
out_rnn, state_n = rnn_model(rand_input)

# out is (T, B, num_out)
# out_model is applied to each time step !! 
# remember the documentation of nn.Linear. It considers
# T*B samples of size num_out
out = out_model(out_rnn)

You can provide an initial state to the call function of the LSTM, in which case, you must take out the LSTM from the nn.Sequential, by default $\overrightarrow{h}_0 = \overleftarrow{h}_0 = \overleftarrow{c}_0 = \overrightarrow{c}_0 = 0$). You could learn these initial hidden states (to bias the first operations of your rnn).

All the weights and biases and initialized from LeCun like initialization $\mathcal{U}(-\sqrt{k}, \sqrt{k})$, $k=\frac{1}{\mbox{hidden_size}}$

Some authors (Gers et al., 2000) suggest to favor either long-term dependencies or short-term dependencies by setting the bias of the forget gate accordingly (“Learning to forget”, $F_{t=0}=1$ to remember everything by default).

See the lab work for specificities on representing variable sized sequences with pytorch PackedSequences.

Digging into pytorch code

How do you know how to access these weights ? See the doc

Note several redundant biases $b_{i.}$ and $b_{h.}$ (for CuDNN compatibility).

Custom initialization

Task: initialize in the “Learning to forget” regime of (Gers et al., 2000)

        
num_layers=3

rnn = nn.LSTM(input_size=embedding_dim,
              hidden_size=hidden_size,
              num_layers=num_layers,
              bidirectional=True)

# Initialize to high forget gate bias
with torch.no_grad():
    for i in range(num_layers):
        forw_bias = getattr(rnn, f'bias_ih_l{i}').chunk(4, dim=0)[1]
        forw_bias.fill_(1)
        rev_bias = getattr(rnn, f'bias_ih_l{i}_reverse').chunk(4, dim=0)[1]
        rev_bias.fill_(1)

The ordering of the weights/biases are inputg/forgetg/cell/outputg.

Language model : an example of aligned sequences of identical sizes

Character level language model (Char-RNN)

Problem given fixed length chunks of sentences, predict the next word/character : $p(x_T | x_0, x_1, ..., x_{T-1})$

Many to many during training (teacher forcing (Williams & Peng, 1990)) but many to one for inference.

A language model can be used, e.g., to constrain the decoding of a network outputting sentences (e.g. in speech-to-text or captioning tasks)

See also The unreasonnable effectiveness of recurrent neural networks and (Sutskever, Martens, & Hinton, 2011).

Training and sampling from the character RNN

Example on “Les fabulistes”

Vocabulary of 105 elements : {‘\t’: 0, ‘\n’: 1, ’ ‘: 2,’!‘: 3,’“‘: 4, "’”: 5, ‘(’: 6, ‘)’: 7, ‘*’: 8, ‘+’: 9, ‘,’: 10, ‘-’: 11, ‘.’: 12, ‘/’: 13, ‘0’: 14, ‘1’: 15, ‘4’: 16, ‘5’: 17, ‘6’: 18, ‘8’: 19, ‘9’: 20, ‘:’: 21, ‘;’: 22, ‘<’: 23, ‘>’: 24, ‘?’: 25, ‘A’: 26, ‘B’: 27, ‘C’: 28, ‘D’: 29, ‘E’: 30, ‘F’: 31, ‘G’: 32, ‘H’: 33, ‘I’: 34, ‘J’: 35, ‘L’: 36, ‘M’: 37, ‘N’: 38, ‘O’: 39, ‘P’: 40, ‘Q’: 41, ‘R’: 42, ‘S’: 43, ‘T’: 44, ‘U’: 45, ‘V’: 46, ‘W’: 47, ‘X’: 48, ‘Y’: 49, ‘Z’: 50, ‘\[’: 51, …}

Dataset size : $10.048$ non overlapping chunks of length $60$.
Example samples :
- Input : [2,71,67,54,70,57,10,2,56,95,71…65,57,66,72,2,70,57,55,60,57]
  
  " sobre, dé…ment reche"
- Output [71,67,54,70,57,10,2,56,95,71,61…57,66,72,2,70,57,55,60,57,70]
  
  “sobre, dés…ent recher”

Network : Embedding(64) , 2$\times$ LSTM(64), 2$\times$LinearRelu(128), LinearSoftmax(105), 111.657 parameters

Note we use uni-directional LSTM. With bi-directionnal LSTM, the problem is easily, but incorrectly, solved by using the backward LSTM only.

Loss : cross-entropy averaged over batch_size $\times$ seq_len
Training: Adam(0.01), learning rate halved every 10 steps, gradient clipping (5) (not sure it helped though)

Sampling from the character RNN

After $30$ epochs, validation loss of $1.45$ and validation accuracy of $56\%$.
To sample from the language model, you can provide it with some context sentence, e.g.
[‘L’, ‘A’, ’ ‘, ’G’, ‘R’,‘E’,‘N’,‘O’,‘U’, ‘I’, ‘L’, ‘L’, ‘E’, ’ ’]

Sample of $200$ chars after init

LA GRENOUILLE y
-Ô)asZYc5[h+IÉë?8—>Y.bèqp;ÎzÇÇ<)!|f]Lt+«-u
XûoÜ:!ïgVùb|Ceü9ùÈ«à
6)ZàÀçBJi)X:ZÛdzxQ8PcvïV]O]xPX,Înc.è’Pâs:X;ûfjBâ?X
ç’ED’fSOl*Z(È’È1SnjàvPïLoUÊêàDgùO9z8eJûRYJ?Yg
Uâp|jCbû—HxBràZBMZÛPCGuR’]ÀiÊÂSBF4D),û

Sample 1 of $200$ chars after 30 epochs

LA GRENOUILLE ET MOURE ET LA RENARDIER
Quel Grâce tout mon ambassade est pris.
L’un pourtant rare,
D’une première
Qu’à partout tout en mon nommée et quelques fleuris ;
Vous n’oserions les Fermerois, les heurs la

Note the upper case after the line breaks, the uppercase title, the quite existing words. The text does not make much sense but it is generated character by character !

Sample 2 of $200$ chars (from the same model as before)

LA GRENOUILLE D’INDÉTES
[Phèdre]
Tout faire force belle, commune,
Et des arts qui, derris vôtre gouverne a rond d’une partage conclut sous besort qu’il plaît du lui dit Portune comme un Heurant enlever bien homme,

More on language modeling (metrics, models, …) in the Deep NLP lecture of Joel Legrand.

Generating text descriptions from images (one to many)

Image captioning

Problem Given an image, generate a textual description of it.

Example datasets : Coco captions, Flickr8k, Flickr30k

Example of captioning samples from MSCoco Image captioning 2015

Some of the first entries : (Vinyals, Toshev, Bengio, & Erhan, 2015), (Xu et al., 2016)

Difficulty: object detection with their relationship

Show and tell

Idea Use a pre-trained CNN for image embedding plugged into a RNN for generating (decoding) the sequence of words. Introduced in (Vinyals et al., 2015).

Learn a model maximizing :

\[ p(S_0S_1S_2..S_T | I, \theta) = p(S_0|I, \theta)\prod_{j>0}^{T} p(S_j|S_0S_1...S_{j-1}, I, \theta) \]

i.e. minimizing $-\log(p(S_0S_1S_2..S_T | I, \theta)) = -\sum_j \log(p(S_j|S_0...S_{j-1}, I, \theta))$

Inspired by the Seq2Seq approach successful in machine translation (more on this later), they proposed an encoder-decoder model to translate an image to a sentence

Show and tell

Training ingredients :

GoogleNet CNN pretrained on ImageNet
words embeddings randomly initialized (pretraining on a large news corpus did not help)
embedding of size 512
LSTM with 512 cells
Stochastic gradient descent, no momentum,
training the LSTM with frozen CNN then finetuning the whole. Too early training end-to-end fails
scheduled sampling (otherwise, divergence between teacher forcing training and inference performances) (Bengio, Vinyals, Jaitly, & Shazeer, 2015@)

Introducing the visual convolutional features at every step did not help.

Inference :

Decoding by beam search (beam size $20$), then reduced beam size to $3$ yielded, unexpectedly, better results

Show attend and tell

Idea Allow the RNN to filter out/focus on CNN features during generation using an attention mechanism (Bahdanau, Cho, & Bengio, 2015). Introduced in (Xu et al., 2016). Link to theano source code

Training:

VGG CNN (head off)
vocabulary of $10000$ words
Embedding (100), LSTM(1000),
RMSProp($0.1$),
dropout for $h_0, c_0$, ,
Early stopping on the BLUE score

Double stochastic attention :

by construction $\sum_{i} \alpha_{t,i} = 1$
regularization $\lambda \sum_{loc} (1 - \sum_t \alpha_{t,loc})^2$ to enforce the model to pay equal attention to all the locations, norm in time for every location.

Inference:

Decoding by beam search

Show attend and tell

Example of caption with the attentional mask. Image from this nice tutorial and pytorch implementation

Dealing with variable size unaligned input/output sequences

Where is the problem

Problem In tasks such as Machine Translation (MT) or Automatic Speech Recognition (ASR), input sequences get mapped to output sequences, both can be of arbitrary sizes.

Machine translation :

The proposal will not now be implemented

Les propositions ne seront pas mises en application maintenant

Automatic speech recognition

A mel-spectrogram with its expected transcript. The spectrogram is sampled at 66 Hz. 88 output tokens for almost 300 input samples

The alignment can be difficult to explicit. Contrary to the language model, we may not know easily when to output what.

When the alignment is missing : Seq2Seq

Encoder / Decoder architectures

Idea Encode/Compress the input sequence to a hidden state and decode/decompress the output sequence from there. Introduced in (Cho et al., 2014) for ranking translations and (Sutskever et al., 2014) for generating translations (NMT).

Architecture :

4 layers LSTM($1000$, $\mathcal{U}(-0.08, 0.08)$), Embeddings($1000$)
Vocabularies (in:$160.000$, out:$80.000$)
SGD$(0.7$), halved every half epoch after 5 epochs. Trained for $7.5$ epochs. Batch($128$)
gradient clipping $5$
$10$ days on $8$ GPUs

The input sentence is fed in reverse order.

Beam search decoding. Teacher forcing for training but see also Scheduled sampling or Professor Forcing.

See Cho’s blog post. See this implementation in pytorch

Decoding with beam search

To get the most likely translation, you need to estimate

\[ p(y | x) = p(y_0|x, \theta) \prod_t p(y_t | y_0...y_{t-1} x \theta) \]

But the probability distribution over the labels is dependent on the previously generated label (which feeds the input for the next step) $\rightarrow$ approximate search by maintaining a set of $B$ candidates.

Beam search decoding with beam size B. Image from distill.pub — Beam search decoding with beam size $B$. Image from distill.pub

See also the modified beam search scoring of GNMT (Wu et al., 2016) (length normalization). Beam search is a pruned breadth first search.

When the alignment is missing : CTC

Connectionist Temporal Classification (CTC)

Idea For problems with the output sequence length $T_y$ is smaller than the input sequence $T_x$, allow a blank character. Introduced in (Graves, Fernández, Gomez, & Schmidhuber, 2006). Assume monotonic alignments (forward in input is forward in output).

CTC collapsing. Illustration from distill.pub

The collapsing many-to-one mapping $\mathcal{B}$ removes the duplicates and then the blanks.

The CTC networks learn from all the possible alignments of $X$ with $Y$ by adding the extra-blank character. Allows to learn from unsegmented sequences !

See also alternatives of the blank character in (Collobert, Puhrsch, & Synnaeve, 2016).

CTC training : CTC Loss

Step: extend your model to output a blank character.

Use the CTC Loss which is estimating the probability of a labeling by marginalizing over all the possible alignments. Assuming conditional independence of the outputs :

\[ \begin{align*} p(Y | X) &= \sum_\pi p(\pi | x) \\ &= \sum_\pi \prod_t p(\pi_t|x, \pi_0 \pi_1 \cdots \pi_{t-1}) \end{align*} \]

A lot of alignments. But, tricky, no need to sum over the possibly large number of paths $\pi$, it can be computed recursively.

Graphical representation from distill.pub

Recursively compute $\alpha_{s,t}$ the probability assigned by the model at time $t$ to the subsequence (extended with the blank) $y_{1:s}$. Denote $l'$ the extended output sequence $l$ and, take a pen.

You end up with a computational graph through which the gradient can propagate.

CTC decoding by combining the alternatives

Problem During inference, given an input $x$, what is the most probable collapsed labeling $$\mbox{argmax}_Y p(y/x)$? Huge number of paths, just the alignments of one sequence are $\frac{(T_x+T_y)!}{(T_x-T_y)!(2T_y)!}$.

Heuristic 1: greedy best path decoding by selecting, at each time step, the output with the highest probability assigned by your model

\[ \hat{y}(x) = \mathcal{B}(\mbox{argmax}_\pi p(\pi|x, \theta)) = \mathcal{B}(\mbox{argmax}_{\pi}\prod_t p(\pi_t | x, \theta)) \approx \mathcal{B}(\mbox{argmax}_k p(y_k^0| x, \theta), \mbox{argmax}_k p(y_k^1| x, \theta), \cdots) \]

But the same labeling can have many alignments and the probability can be spiky on one bad alignment.

Heuristic 2: beam search decoding taking care of the blank character (multiple paths may collapse to the same final labeling)

Possibility to introduce a language model to bias the decoding in favor of plausible words (because CTC assumes independence between the predictions). See (Hannun et al., 2014) :

\[ \mbox{argmax}_y(p(y|x) p_{LM}(y)^\alpha \mbox{wordcount}^\beta(y)) \]

CTC example on voice recognition

Problem Given a waveform, produce the transcript.
Example datasets : Librispeech (English, 1000 hours, Aligned), TED (English, 450 hours, Aligned), Mozilla common voice (Multi language, 2383 hours in English, 927 hours in French, unaligned)

Preprocessing of the waveform with spectrogram using STFT(win_size=25ms, win_step=15ms), with the transcript “Rue Wolfgang Doeblin, zéro huit, six cents Givet” — Preprocessing of the waveform with spectrogram using STFT(win_size=$25$ms, win_step=$15$ms), with the transcript “Rue Wolfgang Doeblin, zéro huit, six cents Givet”

Note: you can contribute the open shared common voice dataset in one of the 60 languages by either recording or validating (Ardila et al., 2020)!

Example model : end-to-end trainable Baidu DeepSpeech (v1,v2) (Hannun et al., 2014),(Amodei et al., 2015). See also the implementation of Mozilla DeepSpeech v2, Nvidia NeMo

Note some authors introduced end-to-end trainable networks from the raw waveforms (Zeghidour, Usunier, Synnaeve, Collobert, & Dupoux, 2018).

Deepspeech : ASR with CONV-BiGRU-CTC

Introduced in (Amodei et al., 2015) on English and Mandarin.

The English architecture involves :

mel spectrograms
$3$ 2D-convolutions in time and frequency with cliped ReLu
$7$ bidirectional GRU$(1280)$
1 FC with Batch Norm
CTC loss, decoding with beam size $500$

35 M. parameters

The training :

dataset sizes from $120$hours up to $12.000$ hours
SGD with Nesterov momentum $(0.99)$
gradient clipping to $400$
learning rate ($\approx 1e-4$), downscaled by $0.8$ at every epoch
during the 1st epoch, the samples are ordered by increasing duration (SortaGrad)
data augmentation with noise added to the speech

CTC extension with sequence transduction (RNN-T)

Idea (Graves, 2012), (Graves et al., 2013) extended CTC to 1) cope with any $T_y$ 2) make the prediction $y_t$ dependent on previously generated outputs. Can produce from $0$ to $N$ output tokens per input time step.

Can work online (stream based) contrary to seq2seq which encodes the complete input sequence (He et al., 2018).

Attention based encoder-decoder

Global Attention

Idea Seq2Seq models are required to compress all the input sequence to a single hidden state which is challenging for long input sequences. What if the decoder could focus on part of the hidden states of the encoder ? Introduced in (Bahdanau et al., 2015).

Soft attention models the expected input alignments allowing to translate the output token at time $t$. It seeks to align the input w.r.t. the output.

Voice recognition / ASR : Listen Attend and Spell

Idea Apply the seq2seq encoder/decoder with soft attention on speech recognition (Chan, Jaitly, Le, & Vinyals, 2015),(Chiu et al., 2018)

log 80-mel spectrogram
5 (bi)-LSTM encoder
2 LSTM decoder
distributed training,
scheduled sampling
label smoothing
4-head attention (multiple queries are computed from $h_i$)
both the queries and keys are computed with a MLP

For Neural Machine Translation, see this series on attention based approaches. See also Google Neural Machine Translation (Wu et al., 2016)

Content based attention

In (Luong, Pham, & Manning, 2015), several forms of attention have been explored. Their design does not feed the result of attention into the update of the decoder state.

The decoder hidden state is used as a query to select part of the encoder hidden states. For multilayer encoder/decoder, they used only the states of the last layers.

Global attention with the Dot-Product attention.

More generally, you could score by matching a transformed query $Q = W_q^T h_i$ and a transformed key $K = W_k^T \overline{h_j}$, with $\mbox{score} = Q^T K = h_i^T W_{qk} \overline{h_j}$

All you need is attention

Idea Feedforward self-attended encoder/decoder. Every input sequence element is encoded with its own self-attended context. Introduced in (Vaswani et al., 2017).

Encoder :

feedforward
word + position embedding
self-attention : every word branch has a :
- query : to compute its context
- key : to match against the queries
- value : the content to propagate

Fixed size position encoding ($\cos$, $\sin$).

Decoder:

feedforward
word + position embedding
no time propagation of context
access to all the encoder features
input : previous token and position

Transformer network for sequence processing. Image from (Vaswani et al., 2017)

The idea of feedforward networks for sequence processing and position encoding is also used in (Gehring, Auli, Grangier, Yarats, & Dauphin, 2017).

Bibliography

References

Rather check the full online document references.pdf

Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., … Zhu, Z. (2015). Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv:1512.02595 [Cs]. Retrieved from http://arxiv.org/abs/1512.02595

Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., … Weber, G. (2020). Common Voice: A Massively-Multilingual Speech Corpus. arXiv:1912.06670 [Cs]. Retrieved from http://arxiv.org/abs/1912.06670

Arjovsky, M., Shah, A., & Bengio, Y. (2016). Unitary Evolution Recurrent Neural Networks. arXiv:1511.06464 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1511.06464

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1607.06450

Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In.

Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent neural networks. CoRR, abs/1506.03099. Retrieved from http://arxiv.org/abs/1506.03099

Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2015). Listen, Attend and Spell. arXiv:1508.01211 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1508.01211

Chiu, C.-C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., … Bacchiani, M. (2018). State-of-the-art Speech Recognition With Sequence-to-Sequence Models. arXiv:1712.01769 [Cs, Eess, Stat]. Retrieved from http://arxiv.org/abs/1712.01769

Cho, K., Merrienboer, B. van, Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1406.1078

Collobert, R., Puhrsch, C., & Synnaeve, G. (2016). Wav2Letter: An End-to-End ConvNet-based Speech Recognition System. arXiv:1609.03193 [Cs]. Retrieved from http://arxiv.org/abs/1609.03193

Cooijmans, T., Ballas, N., Laurent, C., Gülçehre, & Courville, A. (2017). Recurrent batch normalization. In International conference on learning representations. Retrieved from https://openreview.net/forum?id=r1VdcHcxx

Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated convolutional networks. Retrieved from http://arxiv.org/abs/1612.08083

Elman, J. L. (1990). Finding Structure in Time. Cognitive Science, 14(2), 179–211. https://doi.org/10.1207/s15516709cog1402_1

Gal, Y., & Ghahramani, Z. (2016). A theoretically grounded application of dropout in recurrent neural networks. In Proceedings of the 30th international conference on neural information processing systems (pp. 1027–1035). Red Hook, NY, USA: Curran Associates Inc.

Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. arXiv:1705.03122 [Cs]. Retrieved from http://arxiv.org/abs/1705.03122

Gers, F. A., Schmidhuber, J. A., & Cummins, F. A. (2000). Learning to forget: Continual prediction with lstm. Neural Comput., 12(10), 2451–2471. https://doi.org/10.1162/089976600300015015

Graves, A. (2012). Sequence Transduction with Recurrent Neural Networks. arXiv:1211.3711 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1211.3711

Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on machine learning (pp. 369–376). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1143844.1143891

Graves, A., Mohamed, A.-r., & Hinton, G. (2013). Speech Recognition with Deep Recurrent Neural Networks. arXiv:1303.5778 [Cs]. Retrieved from http://arxiv.org/abs/1303.5778

Greff, K., Srivastava, R. K., Koutnı́k, J., Steunebrink, B. R., & Schmidhuber, J. (2017). LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232. https://doi.org/10.1109/TNNLS.2016.2582924

Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., … Ng, A. Y. (2014). Deep Speech: Scaling up end-to-end speech recognition. arXiv:1412.5567 [Cs]. Retrieved from http://arxiv.org/abs/1412.5567

He, Y., Sainath, T. N., Prabhavalkar, R., McGraw, I., Alvarez, R., Zhao, D., … Gruenstein, A. (2018). Streaming End-to-end Speech Recognition For Mobile Devices. arXiv:1811.06621 [Cs]. Retrieved from http://arxiv.org/abs/1811.06621

Henaff, M., Szlam, A., & LeCun, Y. (2016). Recurrent Orthogonal Networks and Long-Memory Tasks, 9.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

Kam-Chuen Jim, Giles, C. L., & Horne, B. G. (1996). An analysis of noise in recurrent neural networks: Convergence and generalization. IEEE Transactions on Neural Networks, 7(6), 1424–1438. https://doi.org/10.1109/72.548170

Krueger, D., Maharaj, T., Kramár, J., Pezeshki, M., Ballas, N., Ke, N. R., … Pal, C. (2017). ZONEOUT: REGULARIZING RNNS BY RANDOMLY PRESERVING HIDDEN ACTIVATIONS, 11.

Laurent, C., Pereyra, G., Brakel, P., Zhang, Y., & Bengio, Y. (2016). Batch normalized recurrent neural networks. In 2016 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 2657–2661). https://doi.org/10.1109/ICASSP.2016.7472159

Le, Q. V., Jaitly, N., & Hinton, G. E. (2015). A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. arXiv:1504.00941 [Cs]. Retrieved from http://arxiv.org/abs/1504.00941

Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. arXiv:1508.04025 [Cs]. Retrieved from http://arxiv.org/abs/1508.04025

Moon, T., Choi, H., Lee, H., & Song, I. (2015). RNNDROP: A novel dropout for rnns in asr. In 2015 ieee workshop on automatic speech recognition and understanding (asru) (pp. 65–70). https://doi.org/10.1109/ASRU.2015.7404775

Pascanu, R., Dauphin, Y. N., Ganguli, S., & Bengio, Y. (2014). On the saddle point problem for non-convex optimization. arXiv:1405.4604 [Cs]. Retrieved from http://arxiv.org/abs/1405.4604

Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30thInternational Conference on Machine Learning (p. 9).

Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093

Sutskever, I. (2013). Training recurrent neural networks (PhD thesis). University of Toronto, CAN.

Sutskever, I., Martens, J., & Hinton, G. (2011). Generating Text with Recurrent Neural Networks, 8.

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. arXiv:1409.3215 [Cs]. Retrieved from http://arxiv.org/abs/1409.3215

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is All you Need, 11.

Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 652–663. https://doi.org/10.1109/TPAMI.2016.2587640

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328–339. https://doi.org/10.1109/29.21701

Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560. Retrieved from https://www.bibsonomy.org/bibtex/25ea3485ce75778e802cd8466cd7ffa69/joachimagne

Williams, R. J., & Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2(4), 490–501. https://doi.org/10.1162/neco.1990.2.4.490

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [Cs]. Retrieved from http://arxiv.org/abs/1609.08144

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., … Bengio, Y. (2016). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv:1502.03044 [Cs]. Retrieved from http://arxiv.org/abs/1502.03044

Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., & Dupoux, E. (2018). End-to-End Speech Recognition from the Raw Waveform. In Interspeech 2018 (pp. 781–785). ISCA. https://doi.org/10.21437/Interspeech.2018-2414