An introduction to deep learning
Jeremy Fix
February 15, 2024
Slides made with slidemakerFrom data that have a spatial structure (locally correlated), features can be extracted with convolutions.
On Images
That also makes sense for temporal series that have a structure in time.
What is a convolution : Example in 2D
Seen as a matrix multiplication
Given two 1D-vectors \(f, k\), say \(k = [c, b, a]\) \[ (f * k) = \begin{bmatrix} b & c & 0 & 0 & \cdots & 0 & 0 \\ a & b & c & 0 & \cdots & 0 & 0 \\ 0 & a & b & c & \cdots & 0 & 0 \\ 0 & 0 & a & b & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & b & c \\ 0 & 0 & 0 & 0 & \cdots & a & b \\ \end{bmatrix} . \begin{bmatrix} \\ f \\ \phantom{}\end{bmatrix} \]
Local features can be combined to learn higher level features.
Let us build a house detector
Ideas Using the structure of the inputs to limit the number of parameters without limiting the expressiveness of the network
For inputs with spatial (or temporal) correlations, features can be extracted with convolutions of local kernels
\(\rightarrow\) strongly regularized !
The architecture of LeNet-5 (LeCun et al., 1989), let’s call it the Vanilla CNN
Architecture
Two main parts :
- convolutional part : C1 -> C5 : convolution - non-linearity - subsampling
- fully connected part : linear - non-linearity
Specificities :
- Weighted sub-sampling
- Gaussian connections (RBF output layer)
- connectivity pattern \(S_2 - C_3\) to reduce the number of weights
Number of parameters :
Layer | Parameters |
---|---|
\(C_1\) | \(156\) |
\(S_2\) | \(12\) |
\(C_3\) | \(1.516\) |
\(S_4\) | \(32\) |
\(C_5\) | \(48.120\) |
\(F_6\) | \(10.164\) |
Convolution :
- size (e.g. \(3 \times 3\), \(5\times 5\))
- padding (e.g. \(1\), \(2\))
- stride (e.g. \(1\))
Pooling (max/average):
- size (e.g. \(2\times 2\))
- padding (e.g. \(0\))
- stride (e.g. \(2\))
We work with 4D tensors for 2D images, 3D tensors for nD temporal series (e.g. multiple simultaneous recordings), 2D tensors for 1D temporal series
In Pytorch, the tensors follow the Batch-Channel-Height-Width (BCHW, channel-first) convention. Other frameworks, like TensorFlow or CNTK, use BHWC (channel-last).
Pytorch code for implementing a CNN : Conv1D Conv2D, MaxPool1D MaxPool2D, AveragePooling, etc…
How can I get the feature dimensions of conv_model output ?
All of these should fit into a nn.Module subclass :
class MyModel(torch.nn.Module):
def __init__(self, ....):
super(MyModel, self).__init__()
self.conv_model = nn.Sequential(...)
output_size = ...
self.fc_model = nn.Sequential(...)
def forward(self, inputs):
conv_features = self.conv_model(inputs)
conv_features = conv_features.view(inputs.shape[0], -1)
return self.fc_model(conv_features)
You can also use the recently introduced nn.Flatten layer.
Given two 1D-vectors \(x_1, k\), say \(k = [c, b, a]\) \[ y_1 = (x_1 * k) = \begin{bmatrix} b & c & 0 & 0 & \cdots & 0 & 0 \\ a & b & c & 0 & \cdots & 0 & 0 \\ 0 & a & b & c & \cdots & 0 & 0 \\ 0 & 0 & a & b & \cdots & 0 & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots\\ 0 & 0 & 0 & 0 & \cdots & b & c \\ 0 & 0 & 0 & 0 & \cdots & a & b \\ \end{bmatrix} . \begin{bmatrix} \\ x_1 \\ \phantom{}\end{bmatrix} = W_k x_1 \]
If we compute the gradient of the loss, in denominator layout: \[ \frac{\partial L}{\partial x_1} = \frac{\partial y_1}{\partial x_1}\frac{\partial L}{\partial y_1} = W_k^T \frac{\partial L}{\partial y_1} \]
Hence, it is coined the term transposed convolution or backward convolution. This will pop up again when speaking about deconvolution. Note also that, if convolution can downscale a signal (with stride>1), a transposed convolution can upscale a signal.
Introduced in (Ciresan, Meier, & Schmidhuber, 2012), ensemble of CNNs trained with dataset augmentation
Introduced in (Krizhevsky, Sutskever, & Hinton, 2012), the “spark” giving birth to the revival of neural networks.
The first layer learned to extract meaningful features
ILSVRC’13 winner. Introduced in (Zeiler & Fergus, 2014)
Ablation studies on AlexNet : the FC layers are not that important
Introduced the idea of supervised pretraining (pretraining on ImageNet, finetune the softmax for Caltech-101, Caltech-256, Pascal 2012)
SGD minibatch(128), momentum(0.9), learning rate (0.01) manual schedule,
Deconvnet computes approximately the gradient of the loss w.r.t. the input (Simonyan, Vedaldi, & Zisserman, 2014). It differs in the way the ReLu is integrated.
ILSVRC’14 1st runner up. Introduced by (Simonyan & Zisserman, 2015).
Introduced in (Springenberg, Dosovitskiy, Brox, & Riedmiller, 2015).
ILSVR’14 winner. Introduced by (Szegedy et al., 2014).
Idea Multi-scale feature detection and dimensionality reduction
ILSVRC’15 winner. Introduced in (He et al., 2016a)
Highway Networks (Srivastava, Greff, & Schmidhuber, 2015)
\[ y = T(x).H(x) + C(x).x \]
DenseNets
Fitnet [Romero(2015)], Wideresnet(2017), Mobilenetv1, v2, v3 [Howard(2019)] : searching for the best architecture, EfficientNet (Tan & Le, 2020)
See also :
You should increase the number of filters throughout the network :
Examples :
EfficientNet (Tan & Le, 2020) studies the scaling strategies of conv. models.
For calculating the effective receptive field size, see this guide on conv arithmetic.
Your effective receptive field can grow faster with a-trou convolutions (or dilated convolutions) (Yu & Koltun, 2016):
Illustrations from this guide on conv arithmetic. The Conv2D object’s constructor accepts a dilation argument.
Introduced in Inception v3 (Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2015)
\(n\) input filters,\(\alpha n\) output filters :
\(\alpha=2 \Rightarrow -24\%\) (\(\sqrt{\alpha}\) is critical!)
\(n\) input filters,\(\alpha n\) output filters :
\(\alpha=2 \Rightarrow -30\%\)
See also the recent work on “Rethinking Model scaling for convolutional neural networks” (Tan & Le, 2020)
Inception and Xception, Mobilnets. It separates :
See also the Feature Pyramid Networks for multi-scale features.
Trainable non-linear transformation of the channels. Network in network (Lin, Chen, & Yan, 2014)
You can check the norm of the gradient w.r.t. the first layers’ parameters to diagnose vanishing gradients
Recent architectures remove the max pooling layers and replace them by conv(stride=2) for downsampling
All the competitors in ImageNet do perform model averaging.
Model averaging
Weight averaging
If you worry about the increased computational complexity, see knowledge distillation (Hinton, Vinyals, & Dean, 2015) : training a light model with the soft targets (vs. the labels, i.e. the hard targets) of a computationally intensive one.
All the frameworks provide you with a model zoo of pre-trained networks. E.g. in PyTorch, for image classification. You can cut the head and finetune the softmax only.
import torchvision.models as models
resnet18 = models.resnet18(pretrained=True)
alexnet = models.alexnet(pretrained=True)
vgg16 = models.vgg16(pretrained=True)
inception = models.inception_v3(pretrained=True)
googlenet = models.googlenet(pretrained=True)
mobilenet = models.mobilenet_v2(pretrained=True)
resnext50_32x4d = models.resnext50_32x4d(pretrained=True)
[...]
warning Do not forget the input normalization !
Have a look in the torchvision doc, there are pretrained for classification, detection, segmentation … See also pytorch hub and timm for very up to date image models.
You can oversample around your training samples by applying transforms on the inputs that make predictable changes on the targets.
Libraries for augmentation : albumentations, imgaug
Operator | Resolution | RF size | #Channels |
---|---|---|---|
ConvBlock | \(32\times32\) | \(5\times5\) | 32 |
ConvBlock | \(32\times32\) | \(9\times9\) | 32 |
Sub | \(16\times16\) | \(15\times15\) | 32 |
ConvBlock | \(16\times16\) | \(15\times15\) | 128 |
ConvBlock | \(16\times16\) | \(23\times23\) | 128 |
Sub | \(8\times8\) | \(31\times31\) | 128 |
AvgPool | \(1\times1\) | 128 | |
Linear | \(100\) |
ConvBlock: 2x [Conv(1x3)-(BN)-Relu-Conv(3x1)-(BN)-Relu]
Sub : Conv(3x3, stride=2)-(BN)-Relu
Common settings :
Different configurations :
Number of parameters: \(\simeq 2M\)
Time per epoch (1080Ti) : 17s. , 42min training time
If applied, only the weights of the convolution and linear layers are regularized (not the bias, nor the coefficients of the Batch Norm)
Rather check the full online document references.pdf
Ciresan, D., Meier, U., & Schmidhuber, J. (2012). Multi-column deep neural networks for image classification. In 2012 IEEE Conference on Computer Vision and Pattern Recognition (pp. 3642–3649). Providence, RI: IEEE. https://doi.org/10.1109/CVPR.2012.6248110
Fukushima, K. (1980). Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), 193–202. https://doi.org/10.1007/BF00344251
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Roceedings of the13thInternational Conferenceon Artificial Intelligence and Statistics (AISTATS) 2010 (p. 8).
He, K., Zhang, X., Ren, S., & Sun, J. (2016a). Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 770–778). Las Vegas, NV, USA: IEEE. https://doi.org/10.1109/CVPR.2016.90
He, K., Zhang, X., Ren, S., & Sun, J. (2016b). Identity Mappings in Deep Residual Networks. arXiv:1603.05027 [Cs]. Retrieved from http://arxiv.org/abs/1603.05027
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1503.02531
Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., … Adam, H. (2017). MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861 [Cs]. Retrieved from http://arxiv.org/abs/1704.04861
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q. (2017). Snapshot ensembles: Train 1, get m for free. In (p. 14).
Huang, G., Liu, Z., Maaten, L. van der, & Weinberger, K. Q. (2018). Densely Connected Convolutional Networks. arXiv:1608.06993 [Cs]. Retrieved from http://arxiv.org/abs/1608.06993
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation, 1(4), 541–551. https://doi.org/10.1162/neco.1989.1.4.541
Lin, M., Chen, Q., & Yan, S. (2014). Network In Network. arXiv:1312.4400 [Cs]. Retrieved from http://arxiv.org/abs/1312.4400
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L.-C. (2018). MobileNetV2: Inverted Residuals and Linear Bottlenecks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4510–4520). Salt Lake City, UT: IEEE. https://doi.org/10.1109/CVPR.2018.00474
Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv:1312.6034 [Cs]. Retrieved from http://arxiv.org/abs/1312.6034
Simonyan, K., & Zisserman, A. (2015). Very Deep Convolutional Networks for Large-Scale Image Recognition. In arXiv:1409.1556 [cs]. Retrieved from http://arxiv.org/abs/1409.1556
Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). Striving for Simplicity: The All Convolutional Net. arXiv:1412.6806 [Cs]. Retrieved from http://arxiv.org/abs/1412.6806
Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Training Very Deep Networks, 9.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2014). Going Deeper with Convolutions. arXiv:1409.4842 [Cs]. Retrieved from http://arxiv.org/abs/1409.4842
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision. arXiv:1512.00567 [Cs]. Retrieved from http://arxiv.org/abs/1512.00567
Tan, M., & Le, Q. V. (2020). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv:1905.11946 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1905.11946
Yu, F., & Koltun, V. (2016). Multi-Scale Context Aggregation by Dilated Convolutions. arXiv:1511.07122 [Cs]. Retrieved from http://arxiv.org/abs/1511.07122
Zeiler, M. D., & Fergus, R. (2014). Visualizing and Understanding Convolutional Networks. In D. Fleet, T. Pajdla, B. Schiele, & T. Tuytelaars (Eds.), Computer Vision – ECCV 2014 (pp. 818–833). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-10590-1_53