Classification (Recognition): To identify cat vs dog; classify the object
Detection: Bounding box along with classification; where the object is
Compared to V1: BatchNorm, Skip connection, Fully convolutional, High res classifier/detector etc
Uses anchor boxes: Anchor box are set of fixed size boxes which are scaled to fit object in it. After scaling the anchor box for the object, the final box is called bounding box
Strategy used to train faster:
Start with low res, for example if you are high res images are of 448x448 dims, then you start training with 56x56 i.e. scale down the image and then train. This will be 64 times faster!! (448/56 * 448/56)
With initial low res training, we train initial layers for E/G/T/P
Because the network is fully convolutionally, we can input images of different dimensions
We already saw that we divide the image dims by its own dim, essentially making all the values between 0-1. This way Yolo will work on proportions rather than actual sizes. This way find clusters of anchor boxes of same proportions using k-means clustering.
Output is 13x13x5x25:
Image is divided into 13x13 blocks. Each block's top-left corner coordinate is (0,0) & bottom-right coordinate is (1,1)
5 = 5 anchor boxes
25 = 5 + 20:
5 = 4 + 1 = x,y,h,w + o
h1.e^h
, w1.e^w
20 = 20 classes
Anchor dimensions are picked using k-means clustering on the dimensions of original bounding boxes. Final anchor boxes are: (0.57273, 0.677385), (1.87446, 2.06253), (3.33843, 5.47434), (7.88282, 3.52778), (9.77052, 9.16828). What are these values?
If the cell is offset from the top left corner of the image by cx, cy and the bounding box ground-truth/prior has width and height gw, gh then the predictions correspond to:
13x13 will work for large resolution. But what about small resolution? For this they add a skip connection (passthrough) from 26x26 resolution layer to semi last layer:
Above, we can see that in pass through, in dim is 26x26x256 and out is 13x13x2048? How did that happen?
And below passthrough, calculation is given, where output is 13x13x1024
Both are stacked and we get 13x13x3072
The pass-through layer concatenates the higher resolution features with the low-resolution features by stacking adjacent features into different channels
Every 10 batches, the network chooses a random new image dimension size (multiples of 32) from 320x320 to 608x608. The anchor box dimensions will also need to scale up/down respectively.
The final model, called Darknet-19 has 19 convolution layers and 5 max-pooling layers. 1x1 convolutions are used to compress the feature representations between 3x3.
The network is first trained on classification for 160 epochs.
After classification training, the last convolution layer is removed, and three 3x3 convolution layers with 1024 filters each followed by the final 1x1 convolution layer are added. The network is again trained for 160 epochs.
During training both, detection and classification datasets are mixed. When the network sees an image with detection label, full back-propagation is performed, else only the classification part is back-propagated.
DarkNet-19:
Loss Function:
We need to compute losses for each Anchor Box (5 in total): ∑B represents this part.
We need to do this for each of the 13x13 cells where S = 13: ∑S2 represents this part.
pij => Classes
Cij is also objectness, but that is used here to train the network to predict Cij
1ijobj is 1 when there is an object in the cell ii, else 0.
1ijnoobj is 1 when there is no object in the cell ii, else 0. We need to do this to make sure we reduce confidence when there is no object as well.
1iobj is 1 when there is a particular class is predicted, else 0.
λs are constants. λ is highest for coordinates in order to focus more on detection (remember, we have already trained the network for recognition!)
We can also notice that wi, hi are under square-root. This is done to penalise the smaller bounding boxes as we need to adjust them more.
Check out this table:
var1 | var2 | (var1-var2)^2 | (sqrtvar1-sqrtvar2)^2 |
---|---|---|---|
0.0300 | 0.020 | 9.99e-05 | 0.001 |
0.0330 | 0.022 | 0.00012 | 0.0011 |
0.0693 | 0.046 | 0.000533 | 0.00233 |
0.2148 | 0.143 | 0.00512 | 0.00723 |
0.8808 | 0.587 | 0.0862 | 0.0296 |
4.4920 | 2.994 | 2.2421 | 0.1512 |
For first 160 epochs, Lambda(coord) is set to 0, and then once it is trained for classification, we train it for detection
In YOLO-V2, they added passthrough and did stacking to support smaller resolution. Here, stacking is taken to next level
Feature extractor here is ResNET
V3 uses only Convolutional layers, not even the pooling layer! How can we avoid any pooling layer?
A 3x3 with a stride of 2.
If we start at 416 and end at 13, we have taken a total stride of 32 (416/13).
COCO Dataset has 80 classes. So final output shall be: 13x13x3x(4+1+80) = 13x13x255
YoloV3 has 3 Anchor Boxes! or 9?
As seen above, no pooling layers, dim reduction is done using convolution layer with stride=2
Starting from Scale-3: Post convolution, output size is 13x13x255
At Scale-2, here's the flow:
At Scale-3, here's the flow:
Upsampling is done as shown below:
Class Confidence:
YoloV3 has an interesting take on Class probabilities. Normally you'd take a SoftMax of the output vector. This is based on the assumption that classes are mutually exclusive. If it is a Dog, it cannot be a Cat!
YOLOv3 asks a question, what if we have classes which are not mutually exclusive. If it is a Person, it may be a Man as well! So instead of SoftMax, v3 uses a sigmoid function.
v3 makes predictions at 52x52, 26x26 and 13x13; alternatively, at strides of 8 (416/52), 16 (416/26) and 32 (416/13).
v3 in total now predicts (52x52 + 26x26 + 13x13)*3 = 10647 bounding boxes
Output processing:
Object Confidence Threshold: We filter the boxes based on their objectness score. Say we only consider those boxes which have a value greater than some threshold
Get all the proposed anchor boxes for a class
Calculate (NMS) Non-maximum Suppression: A technique that helps selects the best bunding box among overlapping proposals
NMS fails if there is too much overlap between two objects of same class. For example:
YOLOv4 has improved again in terms of accuracy (average precision) and speed (FPS), the two metrics we generally use to qualify an object detection algorithm.
There are 4 apparent blocks, after the input image:
Backbone: Cross-Stage-Partial connections. The idea here is to separate the current layer into 2 parts, one that will go through a block of convolutions, and one that won’t. Then, we aggregate the results. Here’s an example with DenseNet:
Neck: The purpose of the neck block is to add extra layers between the backbone and the head (dense prediction block). You might see that different feature maps from the different layers used.
YoloV4 used a modified version of the PANet (Path Aggregation Network). The idea is again to aggregate information to get higher accuracy. Rather than addition, it does concatenation
Another technique used is Spatial Attention Module (SAM) (Links to an external site.). Attention mechanisms have been widely used in deep learning, and especially in recurrent neural networks. It is like SENet (Squeeze-Excitation Network)
Finally, Spatial Pyramid Pooling (SPP), used in R-CNN networks and numerous other algorithms, is also used here.