An introduction to deep learning
Jeremy Fix
February 15, 2024
Slides made with slidemakerTime delay neural networks as introduced in (Waibel, Hanazawa, Hinton, Shikano, & Lang, 1989) spatializes the time:
But : which size of the time window ? Must the history size always be the same ? Do we need the data over the whole time span ? How to share computations in time instead of using distinct weights per time instant?
Feedforward neural networks can still be efficient for processing sequential data, e.g. Gated ConvNet (Dauphin, Fan, Auli, & Grangier, 2017), Transformers, …
Introduced by (Elman, 1990).
Weight matrices :
\[ \begin{align*} h(t) &= f(W^{in} x(t) + W^{h} h(t-1) + W^{back} y(t-1) + b_h)\\ y(t) &= g(W^{out} h(t) + b_y) \end{align*} \]
The hidden to hidden weight matrix \(W_{h}\) is repeatedly applied.
Named Elman networks if \(W^{back} = 0\), and Jordan networks if \(W^h = 0\). Elman networks with a random fixed \(W^h\) are called Echo State networks.
The inputs and outputs can be of variable (\(1 \rightarrow T_x\), \(1 \rightarrow T_y\)) and arbitrary sizes (\(T_x \neq T_y\)).
Many to one example : language model, sentiment analysis : multiclass sequence classification \(T_y=1\):
Many to many example : Neural Machine Translation
One to many example: image captioning, language model with probabilistic sampling
start : ‘LA JUMENT ET’
LA JUMENT ET LE RAT
ET L’huiller craignait les gens d’une mise un vers atteint:
Va c’est d’être indigne de Vénus d’aller pressez l’ame
D’une mais, dit-il, un plongeant l’avertion :
Son échangé vous refusiez-vous
start : ‘LA JUMENT ET’
LA JUMENT ET LE BULÉE
[Ésope]
Comme à part craindre déjà cet honneur à couvrir jamais
Et ses mélonces, condition tempérament.
L’autre honne alla vie.
Je ne saurais pas que d’un moutons.
Que ce choix, coquet, g
Idea: unfold in time the computational graph and perform reverse mode differentiation (Werbos, 1990).
You must be training on truncated series to prevent a computational burden.
You can also perform forward mode differentiation (Real time recurrent learning RTTL (Williams & Peng, 1990)) with online adaptation as the inputs/targets comes in but this is computationally expensive.
Unrolled in time, RNN appears as very deep networks \(\rightarrow\) vanishing/exploding gradient
Initialization strategies :
Architecture :
Training :
Regularization:
RNNs have difficulties learning long range dependencies. The LSTM (Hochreiter & Schmidhuber, 1997) introduces memory cells to address that problem.
Peepholes may connect the \(c_t\) to their gates.
Equations:
\[ \begin{eqnarray*} I_t &=& \sigma(W^x_i x_t + W^h_i h_{t-1} + b_i) &\in [0,1], \mbox{Input gate}\\ F_t &=& \sigma(W^x_f x_t + W^h_f h_{t-1} + b_f)&\in [0,1], \mbox{Forget gate}\\ O_t &=& \sigma(W^x_o x_t + W^h_o h_{t-1} + b_o) &\in [0,1], \mbox{Output gate}\\ n_t &=& \tanh(W^x_n x_t + W^h_n h_{t-1} + b_z)& \mbox{unit's input}\\ c_t &=& F_t \odot c_{t-1} + I_t \odot n_t& \mbox{cell update}\\ h_t &=& O_t \odot \tanh(c_t) & \mbox{unit's output} \end{eqnarray*} \] The next layers integrate what is exposed by the cells, i.e. the unit’s output \(h_t\), not \(c_t\).
If \(F_t=1, I_t=0\), the cell state \(c_t\) is unmodified. This is called the constant error carrousel.
The forget gate is introduced in (Gers et al., 2000). Variants have been investigated in a search space odyssey (Greff, Srivastava, Koutnı́k, Steunebrink, & Schmidhuber, 2017).
See also (Le et al., 2015) which reconsiders using ReLU in LSTM given appropriate initialization of the recurrent weights to the identity to be copy by default mode.
The GRU is introduced as an alternative, simpler model than LSTM. Introduced in (Cho et al., 2014).
Equations:
\[ \begin{eqnarray*} R_t &=& \sigma(W^x_i x_t + W^h_i h_{t-1} + b_i) \mbox{ Reset gate}\\ Z_t &=& \sigma(W^x_z x_t + W^h_z h_{t-1} + b_z) \mbox{ Update gate}\\ n_t &=& \tanh(W^x_{n} x_t + b_{nx} + R_t \odot (W^h_nh_{t-1}+b_{nh}))\\ h_t &=& Z_t \odot h_{t-1} + (1-Z_t) \odot n_t \end{eqnarray*} \]
If \(Z_{t} = 1\), the cell state \(h_t\) is not modified. If \(Z_t = 0\) and \(R_t=1\), it is updated in one step.
Compared to LSTM, a GRU cell :
Idea Both past and future contexts can sometimes be required for classification at the current time step; e.g. when you speak, past and future phonemes influence the way you pronounce the current one. Introduced in (Schuster & Paliwal, 1997})
In practice, see bidirectional in the constructors of LSTM and GRU
While RNN are fundamentally deep neural networks, they can still benefit from being stacked : this allows the layers to operate at increasing time scales. The lower layers can change their content at a higher rate than the higher layers.
(Graves et al., 2013): Phoneme classification with stacked bidirectionnal LSTMs
(Sutskever, Vinyals, & Le, 2014) : Machine translation with stacked unidirectionnal LSTMs (Seq2Seq)
In a stacked RNN, you can concatenate consecutive hidden states before feeding in the next RNN layer, e.g. Listen, Attend and Spell encoder (\(\rightarrow\) downscale time)
Other variants for introducing depth in RNN is explored in (Pascanu et al., 2014). For example, the transition function from \(h_{t-1}\) to \(h_t\) is not deep, even in stacked RNNs but is deep in DT-RNN.
Stacked bidirectional LSTM, documentation
with discrete inputs (words, characters, …) of the same time length, one prediction per time step.
import torch
import torch.nn as nn
seq_len = 51
batch_size = 32
vocab_size = 10
embedding_dim = 128
hidden_size = 256
rnn_model = nn.Sequential(
nn.Embedding(num_embeddings=vocab_size,
embedding_dim=embedding_dim),
nn.LSTM(input_size=embedding_dim,
hidden_size=hidden_size,
num_layers=3,
bidirectional=True)
)
out_model = nn.Sequential(
nn.Linear(2*hidden_size, 10)
)
# Forward propagation
# An embedding layer takes as input a LongTensor
rand_input = torch.randint(low=0, high=vocab_size,
size=(seq_len, batch_size))
# out_rnn is (T, B, num_hidden)
# state_n is the state of the last hidden layer
out_rnn, state_n = rnn_model(rand_input)
# out is (T, B, num_out)
# out_model is applied to each time step !!
# remember the documentation of nn.Linear. It considers
# T*B samples of size num_out
out = out_model(out_rnn)
You can provide an initial state to the call function of the LSTM, in which case, you must take out the LSTM from the nn.Sequential, by default \(\overrightarrow{h}_0 = \overleftarrow{h}_0 = \overleftarrow{c}_0 = \overrightarrow{c}_0 = 0\)). You could learn these initial hidden states (to bias the first operations of your rnn).
All the weights and biases and initialized from LeCun like initialization \(\mathcal{U}(-\sqrt{k}, \sqrt{k})\), \(k=\frac{1}{\mbox{hidden_size}}\)
Some authors (Gers et al., 2000) suggest to favor either long-term dependencies or short-term dependencies by setting the bias of the forget gate accordingly (“Learning to forget”, \(F_{t=0}=1\) to remember everything by default).
See the lab work for specificities on representing variable sized sequences with pytorch PackedSequences.
How do you know how to access these weights ? See the doc
Note several redundant biases \(b_{i.}\) and \(b_{h.}\) (for CuDNN compatibility).
Task: initialize in the “Learning to forget” regime of (Gers et al., 2000)
num_layers=3
rnn = nn.LSTM(input_size=embedding_dim,
hidden_size=hidden_size,
num_layers=num_layers,
bidirectional=True)
# Initialize to high forget gate bias
with torch.no_grad():
for i in range(num_layers):
forw_bias = getattr(rnn, f'bias_ih_l{i}').chunk(4, dim=0)[1]
forw_bias.fill_(1)
rev_bias = getattr(rnn, f'bias_ih_l{i}_reverse').chunk(4, dim=0)[1]
rev_bias.fill_(1)
The ordering of the weights/biases are inputg/forgetg/cell/outputg.
Problem given fixed length chunks of sentences, predict the next word/character : \(p(x_T | x_0, x_1, ..., x_{T-1})\)
Many to many during training (teacher forcing (Williams & Peng, 1990)) but many to one for inference.
A language model can be used, e.g., to constrain the decoding of a network outputting sentences (e.g. in speech-to-text or captioning tasks)
See also The unreasonnable effectiveness of recurrent neural networks and (Sutskever, Martens, & Hinton, 2011).
Example on “Les fabulistes”
Dataset size : \(10.048\) non overlapping chunks of length \(60\).
Example samples :
Input : [2,71,67,54,70,57,10,2,56,95,71…65,57,66,72,2,70,57,55,60,57]
" sobre, dé…ment reche"
Output [71,67,54,70,57,10,2,56,95,71,61…57,66,72,2,70,57,55,60,57,70]
“sobre, dés…ent recher”
Note we use uni-directional LSTM. With bi-directionnal LSTM, the problem is easily, but incorrectly, solved by using the backward LSTM only.
Loss : cross-entropy averaged over batch_size \(\times\) seq_len
Training: Adam(0.01), learning rate halved every 10 steps, gradient clipping (5) (not sure it helped though)
After \(30\) epochs, validation loss of \(1.45\) and validation accuracy of \(56\%\).
To sample from the language model, you can provide it with some context sentence, e.g.
[‘L’, ‘A’, ’ ‘, ’G’, ‘R’,‘E’,‘N’,‘O’,‘U’, ‘I’, ‘L’, ‘L’, ‘E’, ’ ’]
Sample of \(200\) chars after init
LA GRENOUILLE y
-Ô)asZYc5[h+IÉë?8—>Y.bèqp;ÎzÇÇ<)!|f]Lt+«-u
XûoÜ:!ïgVùb|Ceü9ùÈ«à
6)ZàÀçBJi)X:ZÛdzxQ8PcvïV]O]xPX,Înc.è’Pâs:X;ûfjBâ?X
ç’ED’fSOl*Z(È’È1SnjàvPïLoUÊêàDgùO9z8eJûRYJ?Yg
Uâp|jCbû—HxBràZBMZÛPCGuR’]ÀiÊÂSBF4D),û
Sample 1 of \(200\) chars after 30 epochs
LA GRENOUILLE ET MOURE ET LA RENARDIER
Quel Grâce tout mon ambassade est pris.
L’un pourtant rare,
D’une première
Qu’à partout tout en mon nommée et quelques fleuris ;
Vous n’oserions les Fermerois, les heurs la
Note the upper case after the line breaks, the uppercase title, the quite existing words. The text does not make much sense but it is generated character by character !
Sample 2 of \(200\) chars (from the same model as before)
LA GRENOUILLE D’INDÉTES
[Phèdre]
Tout faire force belle, commune,
Et des arts qui, derris vôtre gouverne a rond d’une partage conclut sous besort qu’il plaît du lui dit Portune comme un Heurant enlever bien homme,
More on language modeling (metrics, models, …) in the Deep NLP lecture of Joel Legrand.
Idea Use a pre-trained CNN for image embedding plugged into a RNN for generating (decoding) the sequence of words. Introduced in (Vinyals et al., 2015).
Learn a model maximizing :
\[ p(S_0S_1S_2..S_T | I, \theta) = p(S_0|I, \theta)\prod_{j>0}^{T} p(S_j|S_0S_1...S_{j-1}, I, \theta) \]
i.e. minimizing \(-\log(p(S_0S_1S_2..S_T | I, \theta)) = -\sum_j \log(p(S_j|S_0...S_{j-1}, I, \theta))\)
Inspired by the Seq2Seq approach successful in machine translation (more on this later), they proposed an encoder-decoder model to translate an image to a sentence
Training ingredients :
Introducing the visual convolutional features at every step did not help.
Inference :
Idea Allow the RNN to filter out/focus on CNN features during generation using an attention mechanism (Bahdanau, Cho, & Bengio, 2015). Introduced in (Xu et al., 2016). Link to theano source code
Training:
Double stochastic attention :
Inference:
Problem In tasks such as Machine Translation (MT) or Automatic Speech Recognition (ASR), input sequences get mapped to output sequences, both can be of arbitrary sizes.
Machine translation :
The proposal will not now be implemented
Les propositions ne seront pas mises en application maintenant
Automatic speech recognition
The alignment can be difficult to explicit. Contrary to the language model, we may not know easily when to output what.
Idea Encode/Compress the input sequence to a hidden state and decode/decompress the output sequence from there. Introduced in (Cho et al., 2014) for ranking translations and (Sutskever et al., 2014) for generating translations (NMT).
Architecture :
The input sentence is fed in reverse order.
Beam search decoding. Teacher forcing for training but see also Scheduled sampling or Professor Forcing.
To get the most likely translation, you need to estimate
\[ p(y | x) = p(y_0|x, \theta) \prod_t p(y_t | y_0...y_{t-1} x \theta) \]
But the probability distribution over the labels is dependent on the previously generated label (which feeds the input for the next step) \(\rightarrow\) approximate search by maintaining a set of \(B\) candidates.
See also the modified beam search scoring of GNMT (Wu et al., 2016) (length normalization). Beam search is a pruned breadth first search.
Idea For problems with the output sequence length \(T_y\) is smaller than the input sequence \(T_x\), allow a blank character. Introduced in (Graves, Fernández, Gomez, & Schmidhuber, 2006). Assume monotonic alignments (forward in input is forward in output).
The collapsing many-to-one mapping \(\mathcal{B}\) removes the duplicates and then the blanks.
The CTC networks learn from all the possible alignments of \(X\) with \(Y\) by adding the extra-blank character. Allows to learn from unsegmented sequences !
See also alternatives of the blank character in (Collobert, Puhrsch, & Synnaeve, 2016).
\[ \begin{align*} p(Y | X) &= \sum_\pi p(\pi | x) \\ &= \sum_\pi \prod_t p(\pi_t|x, \pi_0 \pi_1 \cdots \pi_{t-1}) \end{align*} \]
A lot of alignments. But, tricky, no need to sum over the possibly large number of paths \(\pi\), it can be computed recursively.
Graphical representation from distill.pub
Recursively compute \(\alpha_{s,t}\) the probability assigned by the model at time \(t\) to the subsequence (extended with the blank) \(y_{1:s}\). Denote \(l'\) the extended output sequence \(l\) and, take a pen.
You end up with a computational graph through which the gradient can propagate.
Problem During inference, given an input \(x\), what is the most probable collapsed labeling $\(\mbox{argmax}_Y p(y/x)\)? Huge number of paths, just the alignments of one sequence are \(\frac{(T_x+T_y)!}{(T_x-T_y)!(2T_y)!}\).
Heuristic 1: greedy best path decoding by selecting, at each time step, the output with the highest probability assigned by your model
\[ \hat{y}(x) = \mathcal{B}(\mbox{argmax}_\pi p(\pi|x, \theta)) = \mathcal{B}(\mbox{argmax}_{\pi}\prod_t p(\pi_t | x, \theta)) \approx \mathcal{B}(\mbox{argmax}_k p(y_k^0| x, \theta), \mbox{argmax}_k p(y_k^1| x, \theta), \cdots) \]
But the same labeling can have many alignments and the probability can be spiky on one bad alignment.
Heuristic 2: beam search decoding taking care of the blank character (multiple paths may collapse to the same final labeling)
Possibility to introduce a language model to bias the decoding in favor of plausible words (because CTC assumes independence between the predictions). See (Hannun et al., 2014) :
\[ \mbox{argmax}_y(p(y|x) p_{LM}(y)^\alpha \mbox{wordcount}^\beta(y)) \]
Problem Given a waveform, produce the transcript.
Example datasets : Librispeech (English, 1000 hours, Aligned), TED (English, 450 hours, Aligned), Mozilla common voice (Multi language, 2383 hours in English, 927 hours in French, unaligned)
Note: you can contribute the open shared common voice dataset in one of the 60 languages by either recording or validating (Ardila et al., 2020)!
Example model : end-to-end trainable Baidu DeepSpeech (v1,v2) (Hannun et al., 2014),(Amodei et al., 2015). See also the implementation of Mozilla DeepSpeech v2, Nvidia NeMo
Note some authors introduced end-to-end trainable networks from the raw waveforms (Zeghidour, Usunier, Synnaeve, Collobert, & Dupoux, 2018).
Introduced in (Amodei et al., 2015) on English and Mandarin.
The English architecture involves :
35 M. parameters
The training :
Idea (Graves, 2012), (Graves et al., 2013) extended CTC to 1) cope with any \(T_y\) 2) make the prediction \(y_t\) dependent on previously generated outputs. Can produce from \(0\) to \(N\) output tokens per input time step.
Can work online (stream based) contrary to seq2seq which encodes the complete input sequence (He et al., 2018).
Idea Seq2Seq models are required to compress all the input sequence to a single hidden state which is challenging for long input sequences. What if the decoder could focus on part of the hidden states of the encoder ? Introduced in (Bahdanau et al., 2015).
Soft attention models the expected input alignments allowing to translate the output token at time \(t\). It seeks to align the input w.r.t. the output.
See also distill.pub augmented-rnns
Idea Apply the seq2seq encoder/decoder with soft attention on speech recognition (Chan, Jaitly, Le, & Vinyals, 2015),(Chiu et al., 2018)
For Neural Machine Translation, see this series on attention based approaches. See also Google Neural Machine Translation (Wu et al., 2016)
In (Luong, Pham, & Manning, 2015), several forms of attention have been explored. Their design does not feed the result of attention into the update of the decoder state.
The decoder hidden state is used as a query to select part of the encoder hidden states. For multilayer encoder/decoder, they used only the states of the last layers.
More generally, you could score by matching a transformed query \(Q = W_q^T h_i\) and a transformed key \(K = W_k^T \overline{h_j}\), with \(\mbox{score} = Q^T K = h_i^T W_{qk} \overline{h_j}\)
Idea Feedforward self-attended encoder/decoder. Every input sequence element is encoded with its own self-attended context. Introduced in (Vaswani et al., 2017).
Encoder :
Fixed size position encoding (\(\cos\), \(\sin\)).
Decoder:
The idea of feedforward networks for sequence processing and position encoding is also used in (Gehring, Auli, Grangier, Yarats, & Dauphin, 2017).
Rather check the full online document references.pdf
Amodei, D., Anubhai, R., Battenberg, E., Case, C., Casper, J., Catanzaro, B., … Zhu, Z. (2015). Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. arXiv:1512.02595 [Cs]. Retrieved from http://arxiv.org/abs/1512.02595
Ardila, R., Branson, M., Davis, K., Henretty, M., Kohler, M., Meyer, J., … Weber, G. (2020). Common Voice: A Massively-Multilingual Speech Corpus. arXiv:1912.06670 [Cs]. Retrieved from http://arxiv.org/abs/1912.06670
Arjovsky, M., Shah, A., & Bengio, Y. (2016). Unitary Evolution Recurrent Neural Networks. arXiv:1511.06464 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1511.06464
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. arXiv:1607.06450 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1607.06450
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In.
Bengio, S., Vinyals, O., Jaitly, N., & Shazeer, N. (2015). Scheduled sampling for sequence prediction with recurrent neural networks. CoRR, abs/1506.03099. Retrieved from http://arxiv.org/abs/1506.03099
Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2015). Listen, Attend and Spell. arXiv:1508.01211 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1508.01211
Chiu, C.-C., Sainath, T. N., Wu, Y., Prabhavalkar, R., Nguyen, P., Chen, Z., … Bacchiani, M. (2018). State-of-the-art Speech Recognition With Sequence-to-Sequence Models. arXiv:1712.01769 [Cs, Eess, Stat]. Retrieved from http://arxiv.org/abs/1712.01769
Cho, K., Merrienboer, B. van, Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv:1406.1078 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1406.1078
Collobert, R., Puhrsch, C., & Synnaeve, G. (2016). Wav2Letter: An End-to-End ConvNet-based Speech Recognition System. arXiv:1609.03193 [Cs]. Retrieved from http://arxiv.org/abs/1609.03193
Cooijmans, T., Ballas, N., Laurent, C., Gülçehre, & Courville, A. (2017). Recurrent batch normalization. In International conference on learning representations. Retrieved from https://openreview.net/forum?id=r1VdcHcxx
Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2017). Language modeling with gated convolutional networks. Retrieved from http://arxiv.org/abs/1612.08083
Elman, J. L. (1990). Finding Structure in Time. Cognitive Science, 14(2), 179–211. https://doi.org/10.1207/s15516709cog1402_1
Gal, Y., & Ghahramani, Z. (2016). A theoretically grounded application of dropout in recurrent neural networks. In Proceedings of the 30th international conference on neural information processing systems (pp. 1027–1035). Red Hook, NY, USA: Curran Associates Inc.
Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y. N. (2017). Convolutional Sequence to Sequence Learning. arXiv:1705.03122 [Cs]. Retrieved from http://arxiv.org/abs/1705.03122
Gers, F. A., Schmidhuber, J. A., & Cummins, F. A. (2000). Learning to forget: Continual prediction with lstm. Neural Comput., 12(10), 2451–2471. https://doi.org/10.1162/089976600300015015
Graves, A. (2012). Sequence Transduction with Recurrent Neural Networks. arXiv:1211.3711 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1211.3711
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on machine learning (pp. 369–376). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/1143844.1143891
Graves, A., Mohamed, A.-r., & Hinton, G. (2013). Speech Recognition with Deep Recurrent Neural Networks. arXiv:1303.5778 [Cs]. Retrieved from http://arxiv.org/abs/1303.5778
Greff, K., Srivastava, R. K., Koutnı́k, J., Steunebrink, B. R., & Schmidhuber, J. (2017). LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems, 28(10), 2222–2232. https://doi.org/10.1109/TNNLS.2016.2582924
Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., … Ng, A. Y. (2014). Deep Speech: Scaling up end-to-end speech recognition. arXiv:1412.5567 [Cs]. Retrieved from http://arxiv.org/abs/1412.5567
He, Y., Sainath, T. N., Prabhavalkar, R., McGraw, I., Alvarez, R., Zhao, D., … Gruenstein, A. (2018). Streaming End-to-end Speech Recognition For Mobile Devices. arXiv:1811.06621 [Cs]. Retrieved from http://arxiv.org/abs/1811.06621
Henaff, M., Szlam, A., & LeCun, Y. (2016). Recurrent Orthogonal Networks and Long-Memory Tasks, 9.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Kam-Chuen Jim, Giles, C. L., & Horne, B. G. (1996). An analysis of noise in recurrent neural networks: Convergence and generalization. IEEE Transactions on Neural Networks, 7(6), 1424–1438. https://doi.org/10.1109/72.548170
Krueger, D., Maharaj, T., Kramár, J., Pezeshki, M., Ballas, N., Ke, N. R., … Pal, C. (2017). ZONEOUT: REGULARIZING RNNS BY RANDOMLY PRESERVING HIDDEN ACTIVATIONS, 11.
Laurent, C., Pereyra, G., Brakel, P., Zhang, Y., & Bengio, Y. (2016). Batch normalized recurrent neural networks. In 2016 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 2657–2661). https://doi.org/10.1109/ICASSP.2016.7472159
Le, Q. V., Jaitly, N., & Hinton, G. E. (2015). A Simple Way to Initialize Recurrent Networks of Rectified Linear Units. arXiv:1504.00941 [Cs]. Retrieved from http://arxiv.org/abs/1504.00941
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based Neural Machine Translation. arXiv:1508.04025 [Cs]. Retrieved from http://arxiv.org/abs/1508.04025
Moon, T., Choi, H., Lee, H., & Song, I. (2015). RNNDROP: A novel dropout for rnns in asr. In 2015 ieee workshop on automatic speech recognition and understanding (asru) (pp. 65–70). https://doi.org/10.1109/ASRU.2015.7404775
Pascanu, R., Dauphin, Y. N., Ganguli, S., & Bengio, Y. (2014). On the saddle point problem for non-convex optimization. arXiv:1405.4604 [Cs]. Retrieved from http://arxiv.org/abs/1405.4604
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). On the difficulty of training recurrent neural networks. In Proceedings of the 30thInternational Conference on Machine Learning (p. 9).
Schuster, M., & Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681. https://doi.org/10.1109/78.650093
Sutskever, I. (2013). Training recurrent neural networks (PhD thesis). University of Toronto, CAN.
Sutskever, I., Martens, J., & Hinton, G. (2011). Generating Text with Recurrent Neural Networks, 8.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. arXiv:1409.3215 [Cs]. Retrieved from http://arxiv.org/abs/1409.3215
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is All you Need, 11.
Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 652–663. https://doi.org/10.1109/TPAMI.2016.2587640
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328–339. https://doi.org/10.1109/29.21701
Werbos, P. J. (1990). Backpropagation through time: What it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560. Retrieved from https://www.bibsonomy.org/bibtex/25ea3485ce75778e802cd8466cd7ffa69/joachimagne
Williams, R. J., & Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2(4), 490–501. https://doi.org/10.1162/neco.1990.2.4.490
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [Cs]. Retrieved from http://arxiv.org/abs/1609.08144
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., … Bengio, Y. (2016). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. arXiv:1502.03044 [Cs]. Retrieved from http://arxiv.org/abs/1502.03044
Zeghidour, N., Usunier, N., Synnaeve, G., Collobert, R., & Dupoux, E. (2018). End-to-End Speech Recognition from the Raw Waveform. In Interspeech 2018 (pp. 781–785). ISCA. https://doi.org/10.21437/Interspeech.2018-2414