Deep learning

An introduction to deep learning

Jeremy Fix

February 15, 2024

Slides made with slidemaker

Convolutional neural networks

Extracting features with convolutions

From data that have a spatial structure (locally correlated), features can be extracted with convolutions.

On Images

That also makes sense for temporal series that have a structure in time.

A convolution as a sparse matrix multiply

What is a convolution : Example in 2D

Seen as a matrix multiplication

Given two 1D-vectors \(f, k\), say \(k = [c, b, a]\) \[ (f * k) = \begin{bmatrix} b & c & 0 & 0 & \cdots & 0 & 0 \\ a & b & c & 0 & \cdots & 0 & 0 \\ 0 & a & b & c & \cdots & 0 & 0 \\ 0 & 0 & a & b & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & b & c \\ 0 & 0 & 0 & 0 & \cdots & a & b \\ \end{bmatrix} . \begin{bmatrix} \\ f \\ \phantom{}\end{bmatrix} \]

Composition to learn higher level features

Local features can be combined to learn higher level features.

Let us build a house detector

Architecture

Ideas Using the structure of the inputs to limit the number of parameters without limiting the expressiveness of the network

For inputs with spatial (or temporal) correlations, features can be extracted with convolutions of local kernels
A convolution can be seen as a fully connected layer with :
- a lot of weights set exactly to \(0\)
- a lot of weights shared across positions

\(\rightarrow\) strongly regularized !

Vanilla CNN of LeCun

The architecture of LeNet-5 (LeCun et al., 1989), let’s call it the Vanilla CNN

Architecture

Two main parts :
- convolutional part : C1 -> C5 : convolution - non-linearity - subsampling
- fully connected part : linear - non-linearity

Specificities :
- Weighted sub-sampling
- Gaussian connections (RBF output layer)
- connectivity pattern \(S_2 - C_3\) to reduce the number of weights

Number of parameters :

Layer	Parameters
\(C_1\)	\(156\)
\(S_2\)	\(12\)
\(C_3\)	\(1.516\)
\(S_4\)	\(32\)
\(C_5\)	\(48.120\)
\(F_6\)	\(10.164\)

CNN Vocabulary

The building blocks of the convolutional part of a vanilla CNN

Convolution :
- size (e.g. \(3 \times 3\), \(5\times 5\))
- padding (e.g. \(1\), \(2\))
- stride (e.g. \(1\))

Pooling (max/average):
- size (e.g. \(2\times 2\))
- padding (e.g. \(0\))
- stride (e.g. \(2\))

We work with 4D tensors for 2D images, 3D tensors for nD temporal series (e.g. multiple simultaneous recordings), 2D tensors for 1D temporal series

In Pytorch, the tensors follow the Batch-Channel-Height-Width (BCHW, channel-first) convention. Other frameworks, like TensorFlow or CNTK, use BHWC (channel-last).

CNN in practice

Pytorch code for implementing a CNN : Conv1D Conv2D, MaxPool1D MaxPool2D, AveragePooling, etc…


conv_model  = nn.Sequential(
    *conv_relu_maxpool(cin=3, cout=32,
                       csize=3, cstride=1, cpad=1,
                       msize=2, mstride=2, mpad=0),
    *conv_relu_maxpool(cin=3, cout=64,
                       csize=3, cstride=1, cpad=1,
                       msize=2, mstride=2, mpad=0), 
)


def conv_relu_maxpool(cin, cout, csize, cstride, cpad, msize, mstride, mpad):
    return [nn.Conv2d(cin, cout, csize, cstride, cpad),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(msize, mstride, mpad)
            ]

e.g. Conv2d(32, 64, 3, 1, 1)


fc_model = nn.Sequential(
    *linear_relu(output_size, 256),
    nn.Linear(256, num_classes)
)

How can I get the feature dimensions of conv_model output ?


dummy_output = conv_model(torch.zeros((1, 3, height, width)))
output_size = np.prod(dummy_output.shape[1:] )

CNN in practice

All of these should fit into a nn.Module subclass :


class MyModel(torch.nn.Module):
    def __init__(self, ....):
        super(MyModel, self).__init__()
        self.conv_model = nn.Sequential(...)
        output_size = ...
        self.fc_model = nn.Sequential(...)

    def forward(self, inputs):
        conv_features = self.conv_model(inputs)
        conv_features = conv_features.view(inputs.shape[0], -1)
        return self.fc_model(conv_features)

You can also use the recently introduced nn.Flatten layer.

Transposed convolution

Given two 1D-vectors \(x_1, k\), say \(k = [c, b, a]\) \[ y_1 = (x_1 * k) = \begin{bmatrix} b & c & 0 & 0 & \cdots & 0 & 0 \\ a & b & c & 0 & \cdots & 0 & 0 \\ 0 & a & b & c & \cdots & 0 & 0 \\ 0 & 0 & a & b & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & b & c \\ 0 & 0 & 0 & 0 & \cdots & a & b \\ \end{bmatrix} . \begin{bmatrix} \\ x_1 \\ \phantom{}\end{bmatrix} = W_k x_1 \]

If we compute the gradient of the loss, in denominator layout: \[ \frac{\partial L}{\partial x_1} = \frac{\partial y_1}{\partial x_1}\frac{\partial L}{\partial y_1} = W_k^T \frac{\partial L}{\partial y_1} \]

Hence, it is coined the term transposed convolution or backward convolution. This will pop up again when speaking about deconvolution. Note also that, if convolution can downscale a signal (with stride>1), a transposed convolution can upscale a signal.

10 years of CNN revolution

Multicolumn CDNN

Introduced in (Ciresan, Meier, & Schmidhuber, 2012), ensemble of CNNs trained with dataset augmentation

\(0.23\%\) test misclassification on MNIST.
1.5 million of parameters

SuperVision

Introduced in (Krizhevsky, Sutskever, & Hinton, 2012), the “spark” giving birth to the revival of neural networks.

Top 5 error of \(16\%\), runner-up at \(26\%\)
several convolutions stacked before pooling
trained on 2 GPUs, for a week on ImageNet (resized to \(256\times256\times3\)), 1M images. (now it’s 18 minutes)
60 Million parameters, dropout, momentum, L2 penalty, dataset augmentation (rand crop \(224\times224\), translation, reflections, PCA)
Learning rate at \(0.01\) divided by \(10\) when validation error stalls
at test time, avg probabilities on \(5\) crops + reflections
The conv layers are cheap but super important

Supervision

The first layer learned to extract meaningful features

ZFNet

ILSVRC’13 winner. Introduced in (Zeiler & Fergus, 2014)

Introduced visualization techniques to inspect which features are learned.

Ablation studies on AlexNet : the FC layers are not that important
Introduced the idea of supervised pretraining (pretraining on ImageNet, finetune the softmax for Caltech-101, Caltech-256, Pascal 2012)
SGD minibatch(128), momentum(0.9), learning rate (0.01) manual schedule,

Deconvnet computes approximately the gradient of the loss w.r.t. the input (Simonyan, Vedaldi, & Zisserman, 2014). It differs in the way the ReLu is integrated.

VGG

ILSVRC’14 1st runner up. Introduced by (Simonyan & Zisserman, 2015).

16 layers : 13 convolutive, 3 fully connected
Only \(3\times3\) convolution, \(2\times2\) pooling
Stacked \(3\times3\) convolutions \(\equiv\) \(5\times5\) convolution receptive field with less parameters
- If \(c_{in}=K, c_{out}=K\), \(5\times5\) convolution \(\rightarrow\) \(25K^2\) parameters
- If \(c_{in}=K, c_{out}=K\), 2 stacked \(3\times3\) convolution \(\rightarrow\) \(18K^2\) parameters
140 million parameters, batch size(256), Momentum(0.9), Weight decay(\(0.0005\)), Dropout(0.5) in FC, learning rate(\(0.01\)) divided \(3\) times by \(10\)
Initialization of \(B,C,D,E\) from trained \(A\). Init of \(A\) random \(\mathcal{N}(0, 10^{-2}), b=0\). Noticed (Glorot & Bengio, 2010) after submission.
can cope with variable input size changing the FC layers to conv \(7\times 7\), conv\(1\times1\).

Striving for simplicity

Introduced in (Springenberg, Dosovitskiy, Brox, & Riedmiller, 2015).

uses only convolutions, with various strides, no max pooling
introduces “guided backpropagation” visualization

GoogLeNet (inception v1)

ILSVR’14 winner. Introduced by (Szegedy et al., 2014).

Idea Multi-scale feature detection and dimensionality reduction

22 layers, \(6.8\)M parameters
trained in parallel , asynchronous SGD, momentum(0.9), learning rate schedule (\(4\%\) every 8 epochs)
at test : polyak average and ensemble of \(7\) models
auxiliary heads to mitigate vanishing gradient

Residual Networks (ResNet)

ILSVRC’15 winner. Introduced in (He et al., 2016a)

Residual Networks (ResNet)

ResNet34. Dotted shortcuts and conv“/2” are stride 2 to match the spatial dimensions. Dotted shortcuts use 1\times1 conv to match the depth. 0.46M parameters. — ResNet34. Dotted shortcuts and conv“/2” are stride 2 to match the spatial dimensions. Dotted shortcuts use \(1\times1\) conv to match the depth. \(0.46\)M parameters.

Resnet architectures. Conv are “Conv-BN-Relu”. ResNet-50 has 23M parameters. — Resnet architectures. Conv are “Conv-BN-Relu”. ResNet-50 has \(23\)M parameters.

Conv branch variations (He et al., 2016b): BN-ReLu-conv instead of conv-BN-ReLu

Variations around skip layer connections

Highway Networks (Srivastava, Greff, & Schmidhuber, 2015)

Uses “gates” (as in LSTM, see lectures on RNN) :
- Transform gate \(T(x) = \sigma(W_T x + b_T)\)
- Carry gate \(C(x) = \sigma(f_c(x))\)

\[ y = T(x).H(x) + C(x).x \]

DenseNets

Other networks

Fitnet [Romero(2015)], Wideresnet(2017), Mobilenetv1, v2, v3 [Howard(2019)] : searching for the best architecture, EfficientNet (Tan & Le, 2020)

CNN design principles

Number of filters

You should increase the number of filters throughout the network :

the first layer extracts low level features
the higher layers compose on the lower layer dictionary of features

Examples :

LeNet-5 (1998) : \(6 5\times5\), \(16 5\times5\)
AlexNet (2012) : \(96 11\times11\), \(256 5\times5\), \(2\times(384 3\times3)\), \(256 3\times3\)
VGG (2014) : \(64-128-256-512\), all \(3\times 3\)
ResNet (2015) : \(64-128-256-512\), all \(3\times 3\)
Inception (2015) : \(32-64-80-192-288-768-1280-2048\), \(1\times1, 3\times3, '5\times5'\)

EfficientNet (Tan & Le, 2020) studies the scaling strategies of conv. models.

Effective receptive field (1/3)

Effective receptive field (2/3)

Effective receptive field (3/3)

For calculating the effective receptive field size, see this guide on conv arithmetic.

A-trou convolutions

Your effective receptive field can grow faster with a-trou convolutions (or dilated convolutions) (Yu & Koltun, 2016):

Illustrations from this guide on conv arithmetic. The Conv2D object’s constructor accepts a dilation argument.

Stacking and factorizing small kernels

Introduced in Inception v3 (Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2015)

Stacking 2(3\times3) conv — Stacking \(2(3\times3)\) conv

\(n\) input filters,\(\alpha n\) output filters :

\((\alpha n, 5\times5)\) conv : \(25 \alpha n^2\) params
\((\sqrt{\alpha}n,3\times3)\)- \((\alpha n, 3\times3)\) : \(9\sqrt{\alpha}n^2+9\sqrt{\alpha}\alpha n^2\) params;

\(\alpha=2 \Rightarrow -24\%\) (\(\sqrt{\alpha}\) is critical!)

1\times3 and 3\times1 conv — \(1\times3\) and \(3\times1\) conv

\(n\) input filters,\(\alpha n\) output filters :

\((\alpha n, 3\times3)\) conv : \(9 \alpha n^2\) params
\((\sqrt{\alpha}n, 1\times3)\) - \((\alpha n, 3\times1)\) : \(3\sqrt{\alpha}n^2 + 3\alpha \sqrt{\alpha}n^2\) params

\(\alpha=2 \Rightarrow -30\%\)

See also the recent work on “Rethinking Model scaling for convolutional neural networks” (Tan & Le, 2020)

Depthwise separabable convolutions

Inception and Xception, Mobilnets. It separates :

feature extraction in each channel, in space : depthwise convolution
feature combination between channels : pointwise convolution \(1\times1\)

Depthwise and pointwise convolutions (Howard et al., 2017)

Multi-scale feature extraction

See also the Feature Pyramid Networks for multi-scale features.

Dimensionality reduction

Trainable non-linear transformation of the channels. Network in network (Lin, Chen, & Yan, 2014)

Easing the gradient flow

You can check the norm of the gradient w.r.t. the first layers’ parameters to diagnose vanishing gradients

Shortcut connections (e.g. ResNet, DenseNet, Highway)

auxiliary heads (e.g. GoogleNet)

Do we need max pooling ?

Recent architectures remove the max pooling layers and replace them by conv(stride=2) for downsampling

MobileNetv2 (Sandler, Howard, Zhu, Zhmoginov, & Chen, 2018). Bottleneck used also in EfficientNet(2019)

Striving for simplicity (Springenberg et al., 2015)

Model and weight averaging

All the competitors in ImageNet do perform model averaging.

Model averaging

Weight averaging

If you worry about the increased computational complexity, see knowledge distillation (Hinton, Vinyals, & Dean, 2015) : training a light model with the soft targets (vs. the labels, i.e. the hard targets) of a computationally intensive one.

We need data !

Using pre-trained models

All the frameworks provide you with a model zoo of pre-trained networks. E.g. in PyTorch, for image classification. You can cut the head and finetune the softmax only.


import torchvision.models as models

resnet18 = models.resnet18(pretrained=True)
alexnet = models.alexnet(pretrained=True)
vgg16 = models.vgg16(pretrained=True)
inception = models.inception_v3(pretrained=True)
googlenet = models.googlenet(pretrained=True)
mobilenet = models.mobilenet_v2(pretrained=True)
resnext50_32x4d = models.resnext50_32x4d(pretrained=True)
[...]

Do not forget the input normalization !

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                 std=[0.229, 0.224, 0.225])

Have a look in the torchvision doc, there are pretrained for classification, detection, segmentation … See also pytorch hub and timm for very up to date image models.

Dataset augmentation

You can oversample around your training samples by applying transforms on the inputs that make predictable changes on the targets.

color jittering, translations, reflections, rotations, PCA, …

Some images generated with imgaug. All physarum polycephalum, right ? Source image from the CNRS — Some images generated with imgaug. All *physarum polycephalum*, right ? Source image from the CNRS

Libraries for augmentation : albumentations, imgaug

warning Your augmentation transforms must be well calibrated : you must be able to predict the change in label given the change of input ! — Your augmentation transforms must be well calibrated : you must be able to predict the change in label given the change of input !

Example CNN

CIFAR-100 dataset

The CIFAR-100 dataset is made of \(100\) classes with \(600\) images per class.
The images are \(32\times 32\) RGB

The training set has \(500\times 100\) images, and the test set has \(100 \times 100\) images.

Model architecture and optimization setup

Operator	Resolution	RF size	#Channels
ConvBlock	\(32\times32\)	\(5\times5\)	32
ConvBlock	\(32\times32\)	\(9\times9\)	32
Sub	\(16\times16\)	\(15\times15\)	32
ConvBlock	\(16\times16\)	\(15\times15\)	128
ConvBlock	\(16\times16\)	\(23\times23\)	128
Sub	\(8\times8\)	\(31\times31\)	128
AvgPool	\(1\times1\)		128
Linear	\(100\)

ConvBlock: 2x [Conv(1x3)-(BN)-Relu-Conv(3x1)-(BN)-Relu]
Sub : Conv(3x3, stride=2)-(BN)-Relu

Common settings :

BatchSize(32),
SGD(lrate=0.01) with momentum(0.9)
learning rate halved every 50 epochs
validation on \(20\%\), early stopping on the val loss

Different configurations :

base
Conv-BN-Relu or Conv-Relu
dataset augmentation (HFlip, Trans(5pix), Scale(0.8,1.2)), CenterCrop(32)
Dropout, L2, label smoothing

Number of parameters: \(\simeq 2M\)
Time per epoch (1080Ti) : 17s. , 42min training time

If applied, only the weights of the convolution and linear layers are regularized (not the bias, nor the coefficients of the Batch Norm)

Results

With batchnorm after every convolution (Note it is also regularizing the network)

With dataset augmentation (HFlip, Scale, Trans)

With regularization : L2 (0.0025), Dropout(0.5), Label smoothing(0.1)

Bibliography

References

Rather check the full online document references.pdf

Ciresan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (pp. 3642–3649). Providence, RI: IEEE. https://doi.org/10.1109/CVPR.2012.6248110

Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202. https://doi.org/10.1007/BF00344251

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Roceedings of the13thInternational Conferenceon Artificial Intelligence and Statistics (AISTATS) 2010 (p. 8).

He, K., Zhang, X., Ren, S., & Sun, J. (2016a). Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). Las Vegas, NV, USA: IEEE. https://doi.org/10.1109/CVPR.2016.90

He, K., Zhang, X., Ren, S., & Sun, J. (2016b). Identity Mappings in Deep Residual Networks. arXiv:1603.05027 [Cs]. Retrieved from http://arxiv.org/abs/1603.05027

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1503.02531

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., … Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861 [Cs]. Retrieved from http://arxiv.org/abs/1704.04861

Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q. (2017). Snapshot ensembles: Train 1, get m for free. In (p. 14).

Huang, G., Liu, Z., Maaten, L. van der, & Weinberger, K. Q. (2018). Densely Connected Convolutional Networks. arXiv:1608.06993 [Cs]. Retrieved from http://arxiv.org/abs/1608.06993

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4), 541–551. https://doi.org/10.1162/neco.1989.1.4.541

Lin, M., Chen, Q., & Yan, S. (2014). Network In Network. arXiv:1312.4400 [Cs]. Retrieved from http://arxiv.org/abs/1312.4400

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4510–4520). Salt Lake City, UT: IEEE. https://doi.org/10.1109/CVPR.2018.00474

Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv:1312.6034 [Cs]. Retrieved from http://arxiv.org/abs/1312.6034

Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. In arXiv:1409.1556 [cs]. Retrieved from http://arxiv.org/abs/1409.1556

Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). Striving for Simplicity: The All Convolutional Net. arXiv:1412.6806 [Cs]. Retrieved from http://arxiv.org/abs/1412.6806

Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training Very Deep Networks, 9.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2014). Going Deeper with Convolutions. arXiv:1409.4842 [Cs]. Retrieved from http://arxiv.org/abs/1409.4842

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567 [Cs]. Retrieved from http://arxiv.org/abs/1512.00567

Tan, M., & Le, Q. V. (2020). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv:1905.11946 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1905.11946

Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. arXiv:1511.07122 [Cs]. Retrieved from http://arxiv.org/abs/1511.07122

Zeiler, M. D., & Fergus, R. (2014). Visualizing and Understanding Convolutional Networks. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision – ECCV 2014 (pp. 818–833). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-10590-1_53