Deep learning

An introduction to deep learning

Jeremy Fix

September 18, 2021

Slides made with slidemaker

Convolutional neural networks

Extracting features with convolutions

From data that have a spatial structure (locally correlated), features can be extracted with convolutions.

On Images

Original image
Original image
Discrete laplacian
Discrete laplacian
Gaussian blur
Gaussian blur
Pattern matching
Pattern matching
The pattern
The pattern

That also makes sense for temporal series that have a structure in time.

A convolution as a sparse matrix multiply

What is a convolution : Example in 2D







Seen as a matrix multiplication

Given two 1D-vectors \(f, k\), say \(k = [c, b, a]\) \[ (f * k) = \begin{bmatrix} b & c & 0 & 0 & \cdots & 0 & 0 \\ a & b & c & 0 & \cdots & 0 & 0 \\ 0 & a & b & c & \cdots & 0 & 0 \\ 0 & 0 & a & b & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & b & c \\ 0 & 0 & 0 & 0 & \cdots & a & b \\ \end{bmatrix} . \begin{bmatrix} \\ f \\ \phantom{}\end{bmatrix} \]

Composition to learn higher level features


Local features can be combined to learn higher level features.

Let us build a house detector












Architecture

Ideas Using the structure of the inputs to limit the number of parameters without limiting the expressiveness of the network

  • For inputs with spatial (or temporal) correlations, features can be extracted with convolutions of local kernels

  • A convolution can be seen as a fully connected layer with :
    • a lot of weights set exactly to \(0\)
    • a lot of weights shared across positions

\(\rightarrow\) strongly regularized !

Neocognitron (Fukushima, 1980)
Neocognitron (Fukushima, 1980)
LeNet5 (LeCun et al., 1989)
LeNet5 (LeCun et al., 1989)

Vanilla CNN of LeCun

The architecture of LeNet-5 (LeCun et al., 1989), let’s call it the Vanilla CNN

LeNet5 (LeCun et al., 1989)
LeNet5 (LeCun et al., 1989)

Architecture

Two main parts :
- convolutional part : C1 -> C5 : convolution - non-linearity - subsampling
- fully connected part :

Specificities :
- Weighted sub-sampling
- Gaussian connections (RBF output layer)
- connectivity pattern \(S_2 - C_3\) to reduce the number of weights

Number of parameters :

Layer Parameters
\(C_1\) \(156\)
\(S_2\) \(12\)
\(C_3\) \(1.516\)
\(S_4\) \(32\)
\(C_5\) \(48.120\)
\(F_6\) \(10.164\)

CNN Vocabulary

The building blocks of the convolutional part of a vanilla CNN
The building blocks of the convolutional part of a vanilla CNN

Convolution :
- size (e.g. \(3 \times 3\), \(5\times 5\))
- padding (e.g. \(1\), \(2\))
- stride (e.g. \(1\))

Pooling (max/average):
- size (e.g. \(2\times 2\))
- padding (e.g. \(0\))
- stride (e.g. \(2\))

We work with 4D tensors for 2D images, 3D tensors for nD temporal series (e.g. multiple simultaneous recordings), 2D tensors for 1D temporal series

In Pytorch, the tensors follow the Batch-Channel-Height-Width (BCHW, channel-first) convention. Other frameworks, like TensorFlow or CNTK, use BHWC (channel-last).

CNN in practice

Pytorch code for implementing a CNN : Conv1D Conv2D, MaxPool1D MaxPool2D, AveragePooling, etc…

CNN in practice

All of these should fit into a nn.Module subclass :

Transposed convolution

Given two 1D-vectors \(x_1, k\), say \(k = [c, b, a]\) \[ y_1 = (x_1 * k) = \begin{bmatrix} b & c & 0 & 0 & \cdots & 0 & 0 \\ a & b & c & 0 & \cdots & 0 & 0 \\ 0 & a & b & c & \cdots & 0 & 0 \\ 0 & 0 & a & b & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & b & c \\ 0 & 0 & 0 & 0 & \cdots & a & b \\ \end{bmatrix} . \begin{bmatrix} \\ x_1 \\ \phantom{}\end{bmatrix} = W_k x_1 \]

If we compute the gradient of the loss, in denominator layout: \[ \frac{\partial L}{\partial x_1} = \frac{\partial y_1}{\partial x_1}\frac{\partial L}{\partial y_1} = W_k^T \frac{\partial L}{\partial y_1} \]

Hence, it is coined the term transposed convolution or backward convolution. This will pop up again when speaking about deconvolution.

10 years of CNN revolution

Multicolumn CDNN

Introduced in (Ciresan, Meier, & Schmidhuber, 2012), ensemble of CNNs trained with dataset augmentation

  • \(0.23\%\) test misclassification on MNIST.
  • 1.5 million of parameters

SuperVision

Introduced in (Krizhevsky, Sutskever, & Hinton, 2012), the “spark” giving birth to the revival of neural networks.

  • Top 5 error of \(16\%\), runner-up at \(26\%\)
  • several convolutions stacked before pooling
  • trained on 2 GPUs, for a week on ImageNet (resized to \(256\times256\times3\)), 1M images. (now it’s 18 minutes)
  • 60 Million parameters, dropout, momentum, L2 penalty, dataset augmentation (rand crop \(224\times224\), translation, reflections, PCA)
  • Learning rate at \(0.01\) divided by \(10\) when validation error stalls
  • at test time, avg probabilities on \(5\) crops + reflections
  • The conv layers are cheap but super important

Supervision

The first layer learned to extract meaningful features

ZFNet

ILSVRC’13 winner. Introduced in (Zeiler & Fergus, 2014)

  • Introduced visualization techniques to inspect which features are learned.
Some inputs got by deconvolution
Some inputs got by deconvolution
  • Ablation studies on AlexNet : the FC layers are not that important

  • Introduced the idea of supervised pretraining (pretraining on ImageNet, finetune the softmax for Caltech-101, Caltech-256, Pascal 2012)

  • SGD minibatch(128), momentum(0.9), learning rate (0.01) manual schedule,

Deconvnet
Deconvnet

Deconvnet computes approximately the gradient of the loss w.r.t. the input (Simonyan, Vedaldi, & Zisserman, 2014). It differs in the way the ReLu is integrated.

VGG

ILSVRC’14 1st runner up. Introduced by (Simonyan & Zisserman, 2015).


  • 16 layers : 13 convolutive, 3 fully connected
  • Only \(3\times3\) convolution, \(2\times2\) pooling
  • Stacked \(3\times3\) convolutions \(\equiv\) \(5\times5\) convolution receptive field with less parameters
    • If \(c_{in}=K, c_{out}=K\), \(5\times5\) convolution \(\rightarrow\) \(25K^2\) parameters
    • If \(c_{in}=K, c_{out}=K\), 2 stacked \(3\times3\) convolution \(\rightarrow\) \(18K^2\) parameters
  • 140 million parameters, batch size(256), Momentum(0.9), Weight decay(\(0.0005\)), Dropout(0.5) in FC, learning rate(\(0.01\)) divided \(3\) times by \(10\)
  • Initialization of \(B,C,D,E\) from trained \(A\). Init of \(A\) random \(\mathcal{N}(0, 10^{-2}), b=0\). Noticed (Glorot & Bengio, 2010) after submission.
  • can cope with variable input size changing the FC layers to conv \(7\times 7\), conv\(1\times1\).
The VGG architectures
The VGG architectures

Striving for simplicity

Introduced in (Springenberg, Dosovitskiy, Brox, & Riedmiller, 2015).

  • uses only convolutions, with various strides, no max pooling
  • introduces “guided backpropagation” visualization
Guided backpropagation examples
Guided backpropagation examples
Architectures
Architectures

GoogLeNet (inception v1)

ILSVR’14 winner. Introduced by (Szegedy et al., 2014).

Idea Multi-scale feature detection and dimensionality reduction

  • 22 layers, \(6.8\)M parameters
  • trained in parallel , asynchronous SGD, momentum(0.9), learning rate schedule (\(4\%\) every 8 epochs)
  • at test : polyack average and ensemble of \(7\) models
  • auxiliary heads to mitigate vanishing gradient

GoogleNet last layers
GoogleNet last layers

Residual Networks (ResNet)

ILSVRC’15 winner. Introduced in (He et al., 2016a)

Deeper is worse ?!
Deeper is worse ?!
Residual block
Residual block

Residual Networks (ResNet)

ResNet34. Dotted shortcuts and conv“/2” are stride 2 to match the spatial dimensions. Dotted shortcuts use 1\times1 conv to match the depth. 0.46M parameters.
ResNet34. Dotted shortcuts and conv“/2” are stride 2 to match the spatial dimensions. Dotted shortcuts use \(1\times1\) conv to match the depth. \(0.46\)M parameters.
Resnet architectures. Conv are “Conv-BN-Relu”. ResNet-50 has 23M parameters.
Resnet architectures. Conv are “Conv-BN-Relu”. ResNet-50 has \(23\)M parameters.
Shortcut variations (He et al., 2016b)
Shortcut variations (He et al., 2016b)

Variations around skip layer connections

Highway Networks (Srivastava, Greff, & Schmidhuber, 2015)

  • Uses “gates” (as in LSTM, see lectures on RNN) :
    • Transform gate \(T(x) = \sigma(W_T x + b_T)\)
    • Carry gate \(C(x) = \sigma(f_c(x))\)

\[ y = T(x).H(x) + C(x).x \]

DenseNets

Densenets (Huang, Liu, Maaten, & Weinberger, 2018)
Densenets (Huang, Liu, Maaten, & Weinberger, 2018)

Other networks

  • Fitnet [Romero(2015)]
  • Wideresnet(2017)
  • Mobilenetv1, v2, v3 [Howard(2019)] : searching for the best architecture
  • EfficientNet (Tan & Le, 2020)

See also :

CNN design principles

Number of filters

You should increase the number of filters throughout the network :

  • the first layer extracts low level features
  • the higher layers compose on the lower layer dictionary of features

Examples :

  • LeNet-5 (1998) : \(6 5\times5\), \(16 5\times5\)
  • AlexNet (2012) : \(96 11\times11\), \(256 5\times5\), \(2\times(384 3\times3)\), \(256 3\times3\)
  • VGG (2014) : \(64-128-256-512\), all \(3\times 3\)
  • ResNet (2015) : \(64-128-256-512\), all \(3\times 3\)
  • Inception (2015) : \(32-64-80-192-288-768-1280-2048\), \(1\times1, 3\times3, "5\times5"\)

Effective receptive field (1/3)

Effective receptive field (2/3)

Effective receptive field (3/3)

For calculating the effective receptive field size, see for example this calculator or this guide on conv arithmetic.

A-trou convolutions

Your effective receptive field can grow faster with a-trou convolutions (or dilated convolutions) (Yu & Koltun, 2016):

Conv 3, Pad 1, Stride 1
Conv 3, Pad 1, Stride 1
Conv 3, Pad 0, Stride 1, Dilated 1
Conv 3, Pad 0, Stride 1, Dilated 1

Illustrations from this guide on conv arithmetic. The Conv2D object’s constructor accepts a dilation argument.

Stacking and factorizing small kernels

Introduced in Inception v3 (Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2015)

Stacking 2(3\times3) conv
Stacking \(2(3\times3)\) conv

\(n\) input filters,\(\alpha n\) output filters :

  • \((\alpha n, 5\times5)\) conv : \(25 \alpha n^2\) params
  • \((\sqrt{\alpha}n,3\times3)\)- \((\alpha n, 3\times3)\) : \(9\sqrt{\alpha}n^2+9\sqrt{\alpha}\alpha n^2\) params;

\(\alpha=2 \Rightarrow -24\%\)

1\times3 and 3\times1 conv
\(1\times3\) and \(3\times1\) conv

\(n\) input filters,\(\alpha n\) output filters :

  • \((\alpha n, 3\times3)\) conv : \(9 \alpha n^2\) params
  • \((\sqrt{\alpha}n, 1\times3)\) - \((\alpha n, 3\times1)\) : \(3\sqrt{\alpha}n^2 + 3\alpha \sqrt{\alpha}n^2\) params

\(\alpha=2 \Rightarrow -30\%\)

See also the recent work on “Rethinking Model scaling for convolutional neural networks” (Tan & Le, 2020)

Depthwise separabable convolutions

Inception and Xception, Mobilnets. It separates :

  • feature extraction in each channel, in space : depthwise convolution
  • feature combination between channels : pointwise convolution \(1\times1\)
Depthwise and pointwise convolutions (Howard et al., 2017)
Depthwise and pointwise convolutions (Howard et al., 2017)

Multi-scale feature extraction

Extract features at multiple scales
Extract features at multiple scales

See also the Feature Pyramid Networks for multi-scale features.

Dimensionality reduction

Dimensionality reduction with 1\times1 conv
Dimensionality reduction with \(1\times1\) conv

Trainable non-linear transformation of the channels. Network in network (Lin, Chen, & Yan, 2014)

Easing the gradient flow

You can check the norm of the gradient w.r.t. the first layers’ parameters to diagnose vanishing gradients

  • Shortcut connections (e.g. ResNet, DenseNet, Highway)

  • auxiliary heads (e.g. GoogleNet)

Do we need max pooling ?

Recent architectures remove the max pooling layers and replace them by conv(stride=2) for downsampling

MobileNetv2 (Sandler, Howard, Zhu, Zhmoginov, & Chen, 2018). Bottleneck used also in EfficientNet(2019)
MobileNetv2 (Sandler, Howard, Zhu, Zhmoginov, & Chen, 2018). Bottleneck used also in EfficientNet(2019)
Striving for simplicity (Springenberg et al., 2015)
Striving for simplicity (Springenberg et al., 2015)
ResNet
ResNet

Model and weight averaging

All the competitors in ImageNet do perform model averaging.

Model averaging

Model averaging performance on ImageNet’12 with multiple models and multiple crops-scale-flips
Model averaging performance on ImageNet’12 with multiple models and multiple crops-scale-flips

Weight averaging

Snapshot ensembles (Huang et al., 2017)
Snapshot ensembles (Huang et al., 2017)

If you worry about the increased computational complexity, see knowledge distillation (Hinton, Vinyals, & Dean, 2015).

We need data !

Using pre-trained models

All the frameworks provide you with a model zoo of pre-trained networks. E.g. in PyTorch, for image classification. You can cut the head and finetune the softmax only.

warning Do not forget the input normalization !

Have a look in the torchvision doc, there are pretrained for classification, detection, segmentation … See also pytorch hub

Dataset augmentation

You can oversample around your training samples by applying transforms on the inputs that make predictable changes on the targets.

  • color jittering, translations, reflections, rotations, PCA, …
Some images generated with imgaug. They are all physarum polycephalum, right ? Source image from the CNRS
Some images generated with imgaug. They are all physarum polycephalum, right ? Source image from the CNRS
warning Your augmentation transforms must be well calibrated !
warning Your augmentation transforms must be well calibrated !

See also mixup: Beyond empirical risk minimization

Example CNN

CIFAR-100 dataset

  • The CIFAR-100 dataset is made of \(100\) classes with \(600\) images per class.
  • The images are \(32\times 32\) RGB
Extract from CIFAR-100
Extract from CIFAR-100
  • The training set has \(500\times 100\) images, and the test set has \(100 \times 100\) images.

Model architecture and optimization setup

Operator Resolution RF size #Channels
ConvBlock \(32\times32\) \(5\times5\) 32
ConvBlock \(32\times32\) \(9\times9\) 32
Sub \(16\times16\) \(15\times15\) 32
ConvBlock \(16\times16\) \(15\times15\) 128
ConvBlock \(16\times16\) \(23\times23\) 128
Sub \(8\times8\) \(31\times31\) 128
AvgPool \(1\times1\) 128
Linear \(100\)

ConvBlock: 2x [Conv(1x3)-(BN)-Relu-Conv(3x1)-(BN)-Relu]
Sub : Conv(3x3, stride=2)-(BN)-Relu

Common settings :

  • BatchSize(32),
  • SGD(lrate=0.01) with momentum(0.9)
  • learning rate halved every 50 epochs
  • validation on \(20\%\), early stopping on the val loss

Different configurations :

  • base
  • Conv-BN-Relu or Conv-Relu
  • dataset augmentation (HFlip, Trans(5pix), Scale(0.8,1.2)), CenterCrop(32)
  • Dropout, L2, label smoothing

Number of parameters: \(\simeq 2M\)
Time per epoch (1080Ti) : 17s. , 42min training time

If applied, only the weights of the convolution and linear layers are regularized (not the bias, nor the coefficients of the Batch Norm)

Baseline

No regularization (either L2, Dropout, Label smoothing, data augmentation), No BatchNorm

With BatchNorm

With batchnorm after every convolution

Note it is also regularizing the network.

With data augmentation

With dataset augmentation (HFlip, Scale, Trans)

With regularization

With regularization : L2 (0.0025), Dropout(0.5), Label smoothing(0.1)

Bibliography

References

Rather check the full online document references.pdf

Ciresan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (pp. 3642–3649). Providence, RI: IEEE. https://doi.org/10.1109/CVPR.2012.6248110

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202. https://doi.org/10.1007/BF00344251

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Roceedings of the13thInternational Conferenceon Artificial Intelligence and Statistics (AISTATS) 2010 (p. 8).

He, K., Zhang, X., Ren, S., & Sun, J. (2016a). Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). Las Vegas, NV, USA: IEEE. https://doi.org/10.1109/CVPR.2016.90

He, K., Zhang, X., Ren, S., & Sun, J. (2016b). Identity Mappings in Deep Residual Networks. arXiv:1603.05027 [Cs]. Retrieved from http://arxiv.org/abs/1603.05027

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1503.02531

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., … Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861 [Cs]. Retrieved from http://arxiv.org/abs/1704.04861

Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q. (2017). Snapshot ensembles: Train 1, get m for free. In (p. 14).

Huang, G., Liu, Z., Maaten, L. van der, & Weinberger, K. Q. (2018). Densely Connected Convolutional Networks. arXiv:1608.06993 [Cs]. Retrieved from http://arxiv.org/abs/1608.06993

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4), 541–551. https://doi.org/10.1162/neco.1989.1.4.541

Lin, M., Chen, Q., & Yan, S. (2014). Network In Network. arXiv:1312.4400 [Cs]. Retrieved from http://arxiv.org/abs/1312.4400

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4510–4520). Salt Lake City, UT: IEEE. https://doi.org/10.1109/CVPR.2018.00474

Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv:1312.6034 [Cs]. Retrieved from http://arxiv.org/abs/1312.6034

Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. In arXiv:1409.1556 [cs]. Retrieved from http://arxiv.org/abs/1409.1556

Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). Striving for Simplicity: The All Convolutional Net. arXiv:1412.6806 [Cs]. Retrieved from http://arxiv.org/abs/1412.6806

Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training Very Deep Networks, 9.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2014). Going Deeper with Convolutions. arXiv:1409.4842 [Cs]. Retrieved from http://arxiv.org/abs/1409.4842

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567 [Cs]. Retrieved from http://arxiv.org/abs/1512.00567

Tan, M., & Le, Q. V. (2020). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv:1905.11946 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1905.11946

Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. arXiv:1511.07122 [Cs]. Retrieved from http://arxiv.org/abs/1511.07122

Zeiler, M. D., & Fergus, R. (2014). Visualizing and Understanding Convolutional Networks. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision – ECCV 2014 (pp. 818–833). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-10590-1_53