YOLO

YOLO: A Brief History

YOLO (You Only Look Once), a popular object detection and image segmentation model, was developed by Joseph Redmon and Ali Farhadi at the University of Washington. Launched in 2015, YOLO quickly gained popularity for its high speed and accuracy.

YOLOv2, released in 2016, improved the original model by incorporating batch normalization, anchor boxes, and dimension clusters.
YOLOv3, launched in 2018, further enhanced the model's performance using a more efficient backbone network, multiple anchors and spatial pyramid pooling.
YOLOv4 was released in 2020, introducing innovations like Mosaic data augmentation, a new anchor-free detection head, and a new loss function.
YOLOv5 further improved the model's performance and added new features such as hyperparameter optimization, integrated experiment tracking and automatic export to popular export formats.
YOLOv6 was open-sourced by Meituan in 2022 and is in use in many of the company's autonomous delivery robots.
YOLOv7 added additional tasks such as pose estimation on the COCO keypoints dataset.
YOLOv8 is the latest version of YOLO by Ultralytics. As a cutting-edge, state-of-the-art (SOTA) model, YOLOv8 builds on the success of previous versions, introducing new features and improvements for enhanced performance, flexibility, and efficiency. YOLOv8 supports a full range of vision AI tasks, including detection, segmentation, pose estimation, tracking, and classification. This versatility allows users to leverage YOLOv8's capabilities across diverse applications and domains.

Accuracy Evaluation

Intersection over Union (IoU): IoU measures the overlap between the predicted bounding box and the ground truth bounding box. It is calculated as the area of intersection divided by the area of union.
$IoU = \frac{\text{Area of Intersection}}{\text{Area of Union}}$
Typically, a threshold (e.g., IoU > 0.5) is set to determine whether a detection is considered correct.
Precision and Recall: Precision is the ratio of true positive detections to the total number of positive detections (true positives + false positives). Recall is the ratio of true positives to the total number of actual positive instances (true positives + false negatives). $Precision = \frac{\text{True Positives}}{\text{True Positives + False Positives}}$
$Recall = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}$
Average Precision (AP): AP is commonly used in object detection. It involves calculating precision-recall curves for different confidence thresholds and then computing the area under the curve (AUC). The mean AP (mAP) is the average AP across multiple object classes.
F1 Score: F1 score is the harmonic mean of precision and recall, providing a balance between the two metrics. $F1 = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision + Recall}}$
mAP (mean Average Precision): mAP is the average of the AP values across all object classes. It provides a comprehensive evaluation of the model's performance on multiple classes.
False Positive Rate (FPR): FPR measures the ratio of false positive detections to the total number of actual negative instances. $FPR = \frac{\text{False Positives}}{\text{False Positives + True Negatives}}$

Precision = TP/ (TP+FP)

Recall = TP/(TP + FN)

Accuracy = (TP + TN) / (TP + FP + FN + TN)

F1-score = (2*Precision*Recall)/ (Precision + Recall)

Speed Evaluation

Inference Time:
- Measure the time it takes for the model to process a single image or a batch of images. This is often referred to as the inference time per image or batch.
Frames Per Second (FPS):
- Calculate the frames per second by taking the reciprocal of the inference time. It gives you an idea of how many frames the model can process in one second.
$\text{FPS} = \frac{1}{\text{Inference Time per Image}}$
Latency:
- Latency measures the delay between sending an input to the model and receiving the output. It includes both processing time and any additional overhead.
Throughput:
- Throughput is the number of images processed per unit of time. It can be calculated as the inverse of latency.
$\text{Throughput} = \frac{1}{\text{Latency}}$
Model Size:
- Consider the size of the model, as larger models may take longer to load into memory and execute.
Hardware Acceleration:
- Evaluate the impact of hardware acceleration (e.g., GPU, TPU) on the speed of inference. Different hardware platforms can significantly affect the performance.
Optimization Techniques:
- Apply model optimization techniques such as quantization, pruning, and model compression to reduce the model size and improve inference speed.
Batch Size Analysis:
- Evaluate the effect of different batch sizes on the model's inference speed. Larger batch sizes may lead to better parallelization and improved throughput.

Loss Function

The YOLO loss function consists of three components:

Objectness Loss: This component measures how well the model predicts whether an object is present in a grid cell or not. It uses binary cross-entropy loss to compare the predicted objectness score with the ground truth (whether an object is present or not).
Localization Loss: YOLO predicts bounding boxes for detected objects. The localization loss measures the difference between the predicted bounding box coordinates and the ground truth bounding box coordinates. It includes both the loss from the width and height of the bounding box and the loss from the coordinates of the bounding box.
Classification Loss: YOLO performs object classification for each bounding box. The classification loss measures the difference between the predicted class probabilities and the ground truth class probabilities. It uses categorical cross-entropy loss for this purpose.

The overall YOLO loss is a combination of these three components. The model aims to minimize this composite loss function during training to improve both object localization and classification accuracy.

YOLOv5

YOLOv5's architecture consists of three main parts:

Backbone: This is the main body of the network. For YOLOv5, the backbone is designed using the New CSP-Darknet53 structure, a modification of the Darknet architecture used in previous versions.
Neck: This part connects the backbone and the head. In YOLOv5, SPPF and New CSP-PAN structures are utilized.
Head: This part is responsible for generating the final output. YOLOv5 uses the YOLOv3 Head for this purpose.

The structure of the model is depicted in the image below. The model structure details can be found in yolov5l.yaml.

YOLOv5 introduces some minor changes compared to its predecessors:

The Focus structure, found in earlier versions, is replaced with a 6x6 Conv2d structure. This change boosts efficiency #4825.
The SPP structure is replaced with SPPF. This alteration more than doubles the speed of processing.

To test the speed of SPP and SPPF, the following code can be used:

import time
import torch
import torch.nn as nn


class SPP(nn.Module):
    def __init__(self):
        super().__init__()
        self.maxpool1 = nn.MaxPool2d(5, 1, padding=2)
        self.maxpool2 = nn.MaxPool2d(9, 1, padding=4)
        self.maxpool3 = nn.MaxPool2d(13, 1, padding=6)

    def forward(self, x):
        o1 = self.maxpool1(x)
        o2 = self.maxpool2(x)
        o3 = self.maxpool3(x)
        return torch.cat([x, o1, o2, o3], dim=1)


class SPPF(nn.Module):
    def __init__(self):
        super().__init__()
        self.maxpool = nn.MaxPool2d(5, 1, padding=2)

    def forward(self, x):
        o1 = self.maxpool(x)
        o2 = self.maxpool(o1)
        o3 = self.maxpool(o2)
        return torch.cat([x, o1, o2, o3], dim=1)


def main():
    input_tensor = torch.rand(8, 32, 16, 16)
    spp = SPP()
    sppf = SPPF()
    output1 = spp(input_tensor)
    output2 = sppf(input_tensor)

    print(torch.equal(output1, output2))

    t_start = time.time()
    for _ in range(100):
        spp(input_tensor)
    print(f"SPP time: {time.time() - t_start}")

    t_start = time.time()
    for _ in range(100):
        sppf(input_tensor)
    print(f"SPPF time: {time.time() - t_start}")


if __name__ == '__main__':
    main()on

Data Augmentation Techniques

YOLOv5 employs various data augmentation techniques to improve the model's ability to generalize and reduce overfitting. These techniques include:

Mosaic Augmentation: An image processing technique that combines four training images into one in ways that encourage object detection models to better handle various object scales and translations.

Copy-Paste Augmentation: An innovative data augmentation method that copies random patches from an image and pastes them onto another randomly chosen image, effectively generating a new training sample.

Random Affine Transformations: This includes random rotation, scaling, translation, and shearing of the images.

MixUp Augmentation: A method that creates composite images by taking a linear combination of two images and their associated labels.

Albumentations: A powerful library for image augmenting that supports a wide variety of augmentation techniques.
HSV Augmentation: Random changes to the Hue, Saturation, and Value of the images.

Random Horizontal Flip: An augmentation method that randomly flips images horizontally.

Training Strategies

YOLOv5 applies several sophisticated training strategies to enhance the model's performance. They include:

Multiscale Training: The input images are randomly rescaled within a range of 0.5 to 1.5 times their original size during the training process.
AutoAnchor: This strategy optimizes the prior anchor boxes to match the statistical characteristics of the ground truth boxes in your custom data.
Warmup and Cosine LR Scheduler: A method to adjust the learning rate to enhance model performance.
Exponential Moving Average (EMA): A strategy that uses the average of parameters over past steps to stabilize the training process and reduce generalization error.
Mixed Precision Training: A method to perform operations in half-precision format, reducing memory usage and enhancing computational speed.
Hyperparameter Evolution: A strategy to automatically tune hyperparameters to achieve optimal performance.

Additional Features

Compute Losses

The loss in YOLOv5 is computed as a combination of three individual loss components:

Classes Loss (BCE Loss): Binary Cross-Entropy loss, measures the error for the classification task.
Objectness Loss (BCE Loss): Another Binary Cross-Entropy loss, calculates the error in detecting whether an object is present in a particular grid cell or not.
Location Loss (CIoU Loss): Complete IoU loss, measures the error in localizing the object within the grid cell.

The overall loss function is depicted by:

Balance Losses

The objectness losses of the three prediction layers (P3, P4, P5) are weighted differently. The balance weights are [4.0, 1.0, 0.4] respectively. This approach ensures that the predictions at different scales contribute appropriately to the total loss.

Eliminate Grid Sensitivity

The YOLOv5 architecture makes some important changes to the box prediction strategy compared to earlier versions of YOLO. In YOLOv2 and YOLOv3, the box coordinates were directly predicted using the activation of the last layer.

However, in YOLOv5, the formula for predicting the box coordinates has been updated to reduce grid sensitivity and prevent the model from predicting unbounded box dimensions.

The revised formulas for calculating the predicted bounding box are as follows:

Compare the center point offset before and after scaling. The center point offset range is adjusted from (0, 1) to (-0.5, 1.5). Therefore, offset can easily get 0 or 1.

Build Targets

The build target process in YOLOv5 is critical for training efficiency and model accuracy. It involves assigning ground truth boxes to the appropriate grid cells in the output map and matching them with the appropriate anchor boxes.

This process follows these steps:

Calculate the ratio of the ground truth box dimensions and the dimensions of each anchor template.

If the calculated ratio is within the threshold, match the ground truth box with the corresponding anchor.

Assign the matched anchor to the appropriate cells, keeping in mind that due to the revised center point offset, a ground truth box can be assigned to more than one anchor. Because the center point offset range is adjusted from (0, 1) to (-0.5, 1.5). GT Box can be assigned to more anchors.

This way, the build targets process ensures that each ground truth object is properly assigned and matched during the training process, allowing YOLOv5 to learn the task of object detection more effectively.

PreviousComputer Vision NextBuild YOLO from Scratch for Car Detection

Last updated 2 months ago