# Deep learning

An introduction to deep learning

September 18, 2021

# Convolutional neural networks

## Extracting features with convolutions

From data that have a spatial structure (locally correlated), features can be extracted with convolutions.

On Images

That also makes sense for temporal series that have a structure in time.

## A convolution as a sparse matrix multiply

What is a convolution : Example in 2D

Seen as a matrix multiplication

Given two 1D-vectors $$f, k$$, say $$k = [c, b, a]$$ $(f * k) = \begin{bmatrix} b & c & 0 & 0 & \cdots & 0 & 0 \\ a & b & c & 0 & \cdots & 0 & 0 \\ 0 & a & b & c & \cdots & 0 & 0 \\ 0 & 0 & a & b & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & b & c \\ 0 & 0 & 0 & 0 & \cdots & a & b \\ \end{bmatrix} . \begin{bmatrix} \\ f \\ \phantom{}\end{bmatrix}$

## Composition to learn higher level features

Local features can be combined to learn higher level features.

Let us build a house detector

## Architecture

Ideas Using the structure of the inputs to limit the number of parameters without limiting the expressiveness of the network

• For inputs with spatial (or temporal) correlations, features can be extracted with convolutions of local kernels

• A convolution can be seen as a fully connected layer with :
• a lot of weights set exactly to $$0$$
• a lot of weights shared across positions

$$\rightarrow$$ strongly regularized !

## Vanilla CNN of LeCun

The architecture of LeNet-5 (LeCun et al., 1989), let’s call it the Vanilla CNN

Architecture

Two main parts :
- convolutional part : C1 -> C5 : convolution - non-linearity - subsampling
- fully connected part :

Specificities :
- Weighted sub-sampling
- Gaussian connections (RBF output layer)
- connectivity pattern $$S_2 - C_3$$ to reduce the number of weights

Number of parameters :

Layer Parameters
$$C_1$$ $$156$$
$$S_2$$ $$12$$
$$C_3$$ $$1.516$$
$$S_4$$ $$32$$
$$C_5$$ $$48.120$$
$$F_6$$ $$10.164$$

## CNN Vocabulary The building blocks of the convolutional part of a vanilla CNN

Convolution :
- size (e.g. $$3 \times 3$$, $$5\times 5$$)
- padding (e.g. $$1$$, $$2$$)
- stride (e.g. $$1$$)

Pooling (max/average):
- size (e.g. $$2\times 2$$)
- padding (e.g. $$0$$)
- stride (e.g. $$2$$)

We work with 4D tensors for 2D images, 3D tensors for nD temporal series (e.g. multiple simultaneous recordings), 2D tensors for 1D temporal series

In Pytorch, the tensors follow the Batch-Channel-Height-Width (BCHW, channel-first) convention. Other frameworks, like TensorFlow or CNTK, use BHWC (channel-last).

## CNN in practice

Pytorch code for implementing a CNN : Conv1D Conv2D, MaxPool1D MaxPool2D, AveragePooling, etc…


conv_model  = nn.Sequential(
*conv_relu_maxpool(cin=3, cout=32,
*conv_relu_maxpool(cin=3, cout=64,
)

return [nn.Conv2d(cin, cout, csize, cstride, cpad),
nn.ReLU(inplace=True),
]

e.g. Conv2d(32, 64, 3, 1, 1)


fc_model = nn.Sequential(
*linear_relu(output_size, 256),
nn.Linear(256, num_classes)
)

How can I get the feature dimensions of conv_model output ?


dummy_output = conv_model(torch.zeros((1, 3, height, width)))
output_size = np.prod(dummy_output.shape[1:] )

## CNN in practice

All of these should fit into a nn.Module subclass :


class MyModel(torch.nn.Module):
def __init__(self, ....):
super(MyModel, self).__init__()
self.conv_model = nn.Sequential(...)
output_size = ...
self.fc_model = nn.Sequential(...)

def forward(self, inputs):
conv_features = self.conv_model(inputs)
conv_features = conv_features.view(inputs.shape, -1)
return self.fc_model(conv_features)

## Transposed convolution

Given two 1D-vectors $$x_1, k$$, say $$k = [c, b, a]$$ $y_1 = (x_1 * k) = \begin{bmatrix} b & c & 0 & 0 & \cdots & 0 & 0 \\ a & b & c & 0 & \cdots & 0 & 0 \\ 0 & a & b & c & \cdots & 0 & 0 \\ 0 & 0 & a & b & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & b & c \\ 0 & 0 & 0 & 0 & \cdots & a & b \\ \end{bmatrix} . \begin{bmatrix} \\ x_1 \\ \phantom{}\end{bmatrix} = W_k x_1$

If we compute the gradient of the loss, in denominator layout: $\frac{\partial L}{\partial x_1} = \frac{\partial y_1}{\partial x_1}\frac{\partial L}{\partial y_1} = W_k^T \frac{\partial L}{\partial y_1}$

Hence, it is coined the term transposed convolution or backward convolution. This will pop up again when speaking about deconvolution.

# 10 years of CNN revolution

## Multicolumn CDNN

Introduced in (Ciresan, Meier, & Schmidhuber, 2012), ensemble of CNNs trained with dataset augmentation

• $$0.23\%$$ test misclassification on MNIST.
• 1.5 million of parameters

## SuperVision

Introduced in (Krizhevsky, Sutskever, & Hinton, 2012), the “spark” giving birth to the revival of neural networks.

• Top 5 error of $$16\%$$, runner-up at $$26\%$$
• several convolutions stacked before pooling
• trained on 2 GPUs, for a week on ImageNet (resized to $$256\times256\times3$$), 1M images. (now it’s 18 minutes)
• 60 Million parameters, dropout, momentum, L2 penalty, dataset augmentation (rand crop $$224\times224$$, translation, reflections, PCA)
• Learning rate at $$0.01$$ divided by $$10$$ when validation error stalls
• at test time, avg probabilities on $$5$$ crops + reflections
• The conv layers are cheap but super important

## Supervision

The first layer learned to extract meaningful features

## ZFNet

ILSVRC’13 winner. Introduced in (Zeiler & Fergus, 2014)

• Introduced visualization techniques to inspect which features are learned.
• Ablation studies on AlexNet : the FC layers are not that important

• Introduced the idea of supervised pretraining (pretraining on ImageNet, finetune the softmax for Caltech-101, Caltech-256, Pascal 2012)

• SGD minibatch(128), momentum(0.9), learning rate (0.01) manual schedule,

Deconvnet computes approximately the gradient of the loss w.r.t. the input (Simonyan, Vedaldi, & Zisserman, 2014). It differs in the way the ReLu is integrated.

## VGG

ILSVRC’14 1st runner up. Introduced by (Simonyan & Zisserman, 2015).

• 16 layers : 13 convolutive, 3 fully connected
• Only $$3\times3$$ convolution, $$2\times2$$ pooling
• Stacked $$3\times3$$ convolutions $$\equiv$$ $$5\times5$$ convolution receptive field with less parameters
• If $$c_{in}=K, c_{out}=K$$, $$5\times5$$ convolution $$\rightarrow$$ $$25K^2$$ parameters
• If $$c_{in}=K, c_{out}=K$$, 2 stacked $$3\times3$$ convolution $$\rightarrow$$ $$18K^2$$ parameters
• 140 million parameters, batch size(256), Momentum(0.9), Weight decay($$0.0005$$), Dropout(0.5) in FC, learning rate($$0.01$$) divided $$3$$ times by $$10$$
• Initialization of $$B,C,D,E$$ from trained $$A$$. Init of $$A$$ random $$\mathcal{N}(0, 10^{-2}), b=0$$. Noticed (Glorot & Bengio, 2010) after submission.
• can cope with variable input size changing the FC layers to conv $$7\times 7$$, conv$$1\times1$$.

## Striving for simplicity

Introduced in (Springenberg, Dosovitskiy, Brox, & Riedmiller, 2015).

• uses only convolutions, with various strides, no max pooling
• introduces “guided backpropagation” visualization

ILSVR’14 winner. Introduced by (Szegedy et al., 2014).

Idea Multi-scale feature detection and dimensionality reduction

• 22 layers, $$6.8$$M parameters
• trained in parallel , asynchronous SGD, momentum(0.9), learning rate schedule ($$4\%$$ every 8 epochs)
• at test : polyack average and ensemble of $$7$$ models

## Residual Networks (ResNet)

ILSVRC’15 winner. Introduced in (He et al., 2016a)

## Residual Networks (ResNet) ResNet34. Dotted shortcuts and conv“/2” are stride 2 to match the spatial dimensions. Dotted shortcuts use $$1\times1$$ conv to match the depth. $$0.46$$M parameters. Resnet architectures. Conv are “Conv-BN-Relu”. ResNet-50 has $$23$$M parameters.

## Variations around skip layer connections

Highway Networks (Srivastava, Greff, & Schmidhuber, 2015)

• Uses “gates” (as in LSTM, see lectures on RNN) :
• Transform gate $$T(x) = \sigma(W_T x + b_T)$$
• Carry gate $$C(x) = \sigma(f_c(x))$$

$y = T(x).H(x) + C(x).x$

DenseNets

## Other networks

• Fitnet [Romero(2015)]
• Wideresnet(2017)
• Mobilenetv1, v2, v3 [Howard(2019)] : searching for the best architecture
• EfficientNet (Tan & Le, 2020)

# CNN design principles

## Number of filters

You should increase the number of filters throughout the network :

• the first layer extracts low level features
• the higher layers compose on the lower layer dictionary of features

Examples :

• LeNet-5 (1998) : $$6 5\times5$$, $$16 5\times5$$
• AlexNet (2012) : $$96 11\times11$$, $$256 5\times5$$, $$2\times(384 3\times3)$$, $$256 3\times3$$
• VGG (2014) : $$64-128-256-512$$, all $$3\times 3$$
• ResNet (2015) : $$64-128-256-512$$, all $$3\times 3$$
• Inception (2015) : $$32-64-80-192-288-768-1280-2048$$, $$1\times1, 3\times3, "5\times5"$$

## Effective receptive field (3/3)

For calculating the effective receptive field size, see for example this calculator or this guide on conv arithmetic.

## A-trou convolutions

Your effective receptive field can grow faster with a-trou convolutions (or dilated convolutions) (Yu & Koltun, 2016):

Illustrations from this guide on conv arithmetic. The Conv2D object’s constructor accepts a dilation argument.

## Stacking and factorizing small kernels

Introduced in Inception v3 (Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2015) Stacking $$2(3\times3)$$ conv

$$n$$ input filters,$$\alpha n$$ output filters :

• $$(\alpha n, 5\times5)$$ conv : $$25 \alpha n^2$$ params
• $$(\sqrt{\alpha}n,3\times3)$$- $$(\alpha n, 3\times3)$$ : $$9\sqrt{\alpha}n^2+9\sqrt{\alpha}\alpha n^2$$ params;

$$\alpha=2 \Rightarrow -24\%$$ $$1\times3$$ and $$3\times1$$ conv

$$n$$ input filters,$$\alpha n$$ output filters :

• $$(\alpha n, 3\times3)$$ conv : $$9 \alpha n^2$$ params
• $$(\sqrt{\alpha}n, 1\times3)$$ - $$(\alpha n, 3\times1)$$ : $$3\sqrt{\alpha}n^2 + 3\alpha \sqrt{\alpha}n^2$$ params

$$\alpha=2 \Rightarrow -30\%$$

See also the recent work on “Rethinking Model scaling for convolutional neural networks” (Tan & Le, 2020)

## Depthwise separabable convolutions

Inception and Xception, Mobilnets. It separates :

• feature extraction in each channel, in space : depthwise convolution
• feature combination between channels : pointwise convolution $$1\times1$$

## Dimensionality reduction Dimensionality reduction with $$1\times1$$ conv

Trainable non-linear transformation of the channels. Network in network (Lin, Chen, & Yan, 2014)

You can check the norm of the gradient w.r.t. the first layers’ parameters to diagnose vanishing gradients

• Shortcut connections (e.g. ResNet, DenseNet, Highway)

## Do we need max pooling ?

Recent architectures remove the max pooling layers and replace them by conv(stride=2) for downsampling MobileNetv2 (Sandler, Howard, Zhu, Zhmoginov, & Chen, 2018). Bottleneck used also in EfficientNet(2019)

## Model and weight averaging

All the competitors in ImageNet do perform model averaging.

Model averaging Model averaging performance on ImageNet’12 with multiple models and multiple crops-scale-flips

Weight averaging

If you worry about the increased computational complexity, see knowledge distillation (Hinton, Vinyals, & Dean, 2015).

# We need data !

## Using pre-trained models

All the frameworks provide you with a model zoo of pre-trained networks. E.g. in PyTorch, for image classification. You can cut the head and finetune the softmax only.


import torchvision.models as models

resnet18 = models.resnet18(pretrained=True)
alexnet = models.alexnet(pretrained=True)
vgg16 = models.vgg16(pretrained=True)
inception = models.inception_v3(pretrained=True)
mobilenet = models.mobilenet_v2(pretrained=True)
resnext50_32x4d = models.resnext50_32x4d(pretrained=True)
[...]

warning Do not forget the input normalization !

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])

Have a look in the torchvision doc, there are pretrained for classification, detection, segmentation … See also pytorch hub

## Dataset augmentation

You can oversample around your training samples by applying transforms on the inputs that make predictable changes on the targets.

• color jittering, translations, reflections, rotations, PCA, … Some images generated with imgaug. They are all physarum polycephalum, right ? Source image from the CNRS

# Example CNN

## CIFAR-100 dataset

• The CIFAR-100 dataset is made of $$100$$ classes with $$600$$ images per class.
• The images are $$32\times 32$$ RGB
• The training set has $$500\times 100$$ images, and the test set has $$100 \times 100$$ images.

## Model architecture and optimization setup

Operator Resolution RF size #Channels
ConvBlock $$32\times32$$ $$5\times5$$ 32
ConvBlock $$32\times32$$ $$9\times9$$ 32
Sub $$16\times16$$ $$15\times15$$ 32
ConvBlock $$16\times16$$ $$15\times15$$ 128
ConvBlock $$16\times16$$ $$23\times23$$ 128
Sub $$8\times8$$ $$31\times31$$ 128
AvgPool $$1\times1$$ 128
Linear $$100$$

ConvBlock: 2x [Conv(1x3)-(BN)-Relu-Conv(3x1)-(BN)-Relu]
Sub : Conv(3x3, stride=2)-(BN)-Relu

Common settings :

• BatchSize(32),
• SGD(lrate=0.01) with momentum(0.9)
• learning rate halved every 50 epochs
• validation on $$20\%$$, early stopping on the val loss

Different configurations :

• base
• Conv-BN-Relu or Conv-Relu
• dataset augmentation (HFlip, Trans(5pix), Scale(0.8,1.2)), CenterCrop(32)
• Dropout, L2, label smoothing

Number of parameters: $$\simeq 2M$$
Time per epoch (1080Ti) : 17s. , 42min training time

If applied, only the weights of the convolution and linear layers are regularized (not the bias, nor the coefficients of the Batch Norm)

## Baseline

No regularization (either L2, Dropout, Label smoothing, data augmentation), No BatchNorm

## With BatchNorm

With batchnorm after every convolution

Note it is also regularizing the network.

## With data augmentation

With dataset augmentation (HFlip, Scale, Trans)

## With regularization

With regularization : L2 (0.0025), Dropout(0.5), Label smoothing(0.1)