An introduction to deep learning
Jeremy Fix
February 15, 2024
Slides made with slidemakerYou can interpret the kernels of the first layers, e.g. with Alexnet :
But what about the hidden/output layers ? Why is this network telling my tumor is malign ?
From a dataset :
\(\hat{x} = \mbox{argmax}_{x\in \mathcal{D}} h_{i,j}(\theta, x)\)
(Regularized) Gradient ascent :
\(\hat{x} = \mbox{argmax}_{x \in \mathbb{R}^n} h_{i,j}(\theta, x) - \lambda \mathcal{R}(x)\)
You need to use regularization or otherwise, the generated images contain artifacts (e.g. high frequency patterns) such as :
See Distill.pub: Feature visualization. The images come from (Olah et al., 2017). This opened the way toward adversial examples.
Given an input, we can compute how important are its pixels to the activation of a unit to produce saliency maps (Simonyan, Vedaldi, & Zisserman, 2014)
e.g. compute the gradient of the class logits (before the softmax)
Images produced for AlexNet with this pytorch implementation of CNN visualizations.
Introduced in (Springenberg et al., 2015), applies to their “no max pooling” architecture.
Introduced in (Selvaraju et al., 2020) for CNNs :
Given :
Examples from ImageNet (see here)
Bounding boxes given, in the datasets (the predictor parametrization may differ), by : \([x, y, w, h]\), \([x_{min},y_{min},x_{max},y_{max}]\), …
Datasets : Coco, ImageNet, Open Images Dataset
Recent survey : Object detection in 20 years: a survey
The metrics should ideally capture :
Given the “true” bounding boxes:
Quantify the quality of these predictions
A predictor should output labeled bounding boxes with a confidence score. Your metric should evaluate the fraction of bbox you correctly detect (TP, recall) and the fraction of bbox you incorrectly detect (FP, precision)
For every class individually, every prediction of every images are considered :
Examples from the Object detection metrics repository.
Rank the predictions by decreasing confidence and compute the average (of the interpolated corrected) precision :
\[ \mbox{precision} = \frac{TP}{TP+FP} \] Which fraction of your detections are actually correct.
\[ \mbox{recall} = \frac{TP}{TP+FN} = \frac{TP}{\#\mbox{gt bbox}} \] Which fraction of labeled objects do you detect (can only increase with decreasing confidence)
AP is the average precision for different levels of recall. Average AP over all the classes to get the mAP.
Examples from the Object detection metrics repository.
MS-Coco detection challenge (see here):
ImageNet (see (Russakovsky et al., 2015)) :
ImageNet now on Kaggle :
Open image evaluation:
Suppose you have a single object to detect, can you localize it into the image ?
How can we proceed with multiple objects ? (Girshick, Donahue, Darrell, & Malik, 2014) proposed to :
Revolution in the object detection community (vs. “traditional” HOG like features).
Drawback :
Notes : pretained on ImageNet, finetuned on the considered classes with warped images. Hard negative mining (boosting).
Introduced in (Girshick, 2015). Idea:
Drawbacks:
Github repository. CVPR’15 slides
Notes : pretrained VGG16 on ImageNet. Fast training with multiple ROIs per image to build the \(128\) mini batch from \(N=2\) images, using \(64\) proposals : \(25\%\) with IoU>0.5 and \(75\%\) with \(IoU \in [0.1, 0.5[\). Data augmentation : horizontal flip. Per layer learning rate, SGD with momentum, etc..
Multi task loss : \[ L(p, u, t, v) = -\log(p_u) + \lambda \mbox{smooth L1}(t, v) \]
The bbox is parameterized as in (Girshick et al., 2014). Single scale is more efficient than multi-scale.
Introduced in (Ren, He, Girshick, & Sun, 2016). The first end-to-end trainable network. Introducing the Region Proposal Network (RPN). A RPN is a sliding Conv(\(3\times3\)) - Conv(\(1\times1\), k + 4k) network (see here). It also introduces anchor boxes of predefined aspect ratios learned by vector quantization.
Check the paper for a lot of quantitative results. Small objects may not have a lot of features.
Bbox parametrization identical to (Girshick et al., 2014), with smooth L1 loss. Multi-task loss for the RPN. Momentum(0.9), weight decay(0.0005), learning rate (0.001) for 60k minibatches, 0.0001 for 20k.
Multi-step training. Gradient is non-trivial due to the coordinate snapping of the boxes (see ROI align for a more continuous version)
With VGG-16, the conv5 layer is \(H/16,W/16\). For an image \(1000 \times 600\), there are \(60 \times 40 = 2400\) anchor boxes centers.
Introduced in (Lin et al., 2017)
Upsampling is performed by using nearest neighbors.
For object detection, a RPN is run on every scale of the pyramid \(P_2, P_3, P_4, P_5\).
ROIPooling/Align is fed with the feature map at a scale depending on ROI size. Large ROI on small/coarse feature maps, Small ROI on large/fine feature maps
Pretrained Faster RCNN + FPN models available in torchvision hub.
The first one-stage detector. Introduced in (Redmon, Divvala, Girshick, & Farhadi, 2016). It outputs:
Bounding box encoding:
In YoLo v3, the network is Feature Pyramid Network (FPN) like with a downsampling and an upsampling paths, with predictions at 3 stages.
The loss is multi-task with :
\[\begin{align*} \mathcal{L} &= \lambda_{coord} \sum_{i=0}^{S^2}\sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} [(t_x-t_x^*)^2+(t_y-t_y^*)^2+(t_w-t_w^*)^2+(t_h-t_h^*)^2] \\ & -\sum_{i=0}^{S^2} \sum_{j=0}^{B} BCE(\mathbb{1}_{ij}^{obj}, \mbox{has_obj}_{ij}) \\ & -\sum_{i=0}^{S^2} \sum_{j=0}^{B} \sum_{k=0}^{K} BCE(\mbox{has_class}_{ijk}, p_{ijk}) \end{align*}\]
Latest release : Yolo v8 (2023)
In v1 and v2, the prediction losses were L2 losses.
Multi labelling can occur in coco (e.g. women, person)
The object detectors may output multiple overlapping bounding box for the same object
NMS algorithm :
NMS may suppress one of two “overlapped” objects. It hard resets the scores of overlapping bboxes.
SoftNMS (Bodla, Singh, Chellappa, & Davis, 2017):
Given an image,
Semantic segmentation : predict the class of every single pixel. We also call dense prediction/dense labelling.
Example image from MS Coco
Instance segmentation : classify all the pixels belonging to the same countable objects
Example image from MS Coco
More recently, panoptic segmentation refers to instance segmentation for countable objects (e.g. people, animals, tools) and semantic segmentation for amorphous regions (grass, sky, road).
Metrics : see Coco panotpic evaluation
Some example networks : PSP-Net, U-Net, Dilated Net, ParseNet, DeepLab, Mask RCNN, …
Introduced in (Ciresan, Giusti, Gambardella, & Schmidhuber, 2012).
Drawbacks:
(on deep neural network calibration, see also (Guo, Pleiss, Sun, & Weinberger, 2017))
Introduced in (Long, Shelhamer, & Darrell, 2015). First end-to-end convolutional network for dense labeling with pretrained networks.
The upsampling can be :
Traditional approaches involves bilinear, bicubic, etc… interpolation.
For upsampling in a learnable way, we can use fractionally strided convolution. That’s one ingredient behind Super-Resolution (Shi, Caballero, Huszár, et al., 2016).
You can initialize the upsampling kernels with a bilinear interpolation kernel. To have some other equivalences, see (Shi, Caballero, Theis, et al., 2016). See ConvTranspose2d.
This can introduce artifacts, check (Odena, Dumoulin, & Olah, 2016). Some prefers a billinear upsampling, followed by convolutions.
Several models along the same architectures : U-Net, SegNet. Encoder-Decoder architecture introduced in (Ronneberger, Fischer, & Brox, 2015)
There is :
Data augmentation : rotation, “color” variations, elastic deformations.
Introduced in (He, Gkioxari, Dollár, & Girshick, 2018) as an extension of Faster RCNN. It outputs a binary mask in addition the class labels + bbox regressions.
It addresses instance segmentation by predicting a mask for individualised object proposals.
Proposed to use ROI-Align (with bilinear interpolation) rather than ROI-Pool.
There is no competition between the classes in the masks. Different objects may use different kernels to compute their masks.
Can be extended to keypoint detection, outputting a \(K\) depth mask for predicting the \(K\) joints.
Introduced in (Kirillov, Girshick, He, & Dollár, 2019), unifies instance segmentation (countable) and semantic segmentation (amorphous) in a single network with two heads on top of a FPN backbone :
Careful design of :
To give a try, check detectron2. Single stage detection Yolo v8 also allows instance segementation, check their instance segmentation page
Graph convolutions :
Processing 3D point clouds :
Rather check the full online document references.pdf
Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-NMS – Improving Object Detection With One Line of Code. arXiv:1704.04503 [Cs]. Retrieved from http://arxiv.org/abs/1704.04503
Ciresan, D., Giusti, A., Gambardella, L. M., & Schmidhuber, J. (2012). Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images, 9.
Erhan, D., Bengio, Y., Courville, A., & Vincent, P. (2009). Visualizing higher-layer features of a deep network (No. 1341). University of Montreal.
Girshick, R. (2015). Fast R-CNN. In 2015 IEEE International Conference on Computer Vision (ICCV) (pp. 1440–1448). https://doi.org/10.1109/ICCV.2015.169
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:1311.2524 [Cs]. Retrieved from http://arxiv.org/abs/1311.2524
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On Calibration of Modern Neural Networks. arXiv:1706.04599 [Cs]. Retrieved from http://arxiv.org/abs/1706.04599
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2018). Mask R-CNN. arXiv:1703.06870 [Cs]. Retrieved from http://arxiv.org/abs/1703.06870
Howard, A. G. (2013). Some Improvements on Deep Convolutional Neural Network Based Image Classification, 6.
Kirillov, A., Girshick, R., He, K., & Dollár, P. (2019). Panoptic Feature Pyramid Networks. arXiv:1901.02446 [Cs]. Retrieved from http://arxiv.org/abs/1901.02446
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017). Feature Pyramid Networks for Object Detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 936–944). Honolulu, HI: IEEE. https://doi.org/10.1109/CVPR.2017.106
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully Convolutional Networks for Semantic Segmentation. arXiv:1411.4038 [Cs]. Retrieved from http://arxiv.org/abs/1411.4038
Odena, A., Dumoulin, V., & Olah, C. (2016). Deconvolution and Checkerboard Artifacts. Distill, 1(10), e3. https://doi.org/10.23915/distill.00003
Olah, C., Mordvintsev, A., & Schubert, L. (2017). Feature visualization. Distill. https://doi.org/10.23915/distill.00007
Ras, G., Xie, N., Gerven, M. van, & Doran, D. (2022). Explainable deep learning: A field guide for the uninitiated. J. Artif. Int. Res., 73. https://doi.org/10.1613/jair.1.13200
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. arXiv:1506.02640 [Cs]. Retrieved from http://arxiv.org/abs/1506.02640
Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv:1506.01497 [Cs]. Retrieved from http://arxiv.org/abs/1506.01497
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv:1505.04597 [Cs]. Retrieved from http://arxiv.org/abs/1505.04597
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., … Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3), 211–252. https://doi.org/10.1007/s11263-015-0816-y
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2020). Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization. International Journal of Computer Vision, 128(2), 336–359. https://doi.org/10.1007/s11263-019-01228-7
Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A. P., Bishop, R., … Wang, Z. (2016). Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. arXiv:1609.05158 [Cs, Stat]. Retrieved from http://arxiv.org/abs/1609.05158
Shi, W., Caballero, J., Theis, L., Huszar, F., Aitken, A., Tejani, A., … Wang, Z. (2016). Is the deconvolution layer the same as a convolutional layer?, 7.
Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv:1312.6034 [Cs]. Retrieved from http://arxiv.org/abs/1312.6034
Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2015). Striving for Simplicity: The All Convolutional Net. arXiv:1412.6806 [Cs]. Retrieved from http://arxiv.org/abs/1412.6806