Introduction

Object detection is the task of finding objects of interest in images and localizing them with bounding boxes while simultaneously predicting their class labels. Unlike image classification which assigns one label to an entire image, detection requires localizing multiple objects of various classes in a single image. A detector might identify "person at (x1, y1, x2, y2)," "car at (x3, y3, x4, y4)," etc., with associated confidence scores.

Object detection is fundamental for autonomous systems: self-driving cars must detect pedestrians, vehicles, and road signs; security systems must find people and threats; retail systems must identify products. The task combines localization (where objects are) with classification (what they are).

The field evolved from sliding window + hand-crafted features (2000s), through R-CNN (2014), to modern real-time detectors achieving 60+ fps with >80% accuracy. Two paradigms emerged: region-based methods (detect regions then classify) and single-stage methods (detect in one pass). Today, both approaches coexist with different trade-offs.

"Object detection is the bridge between image understanding and scene comprehension. It's where computer vision meets real-world applications." -- Ross Girshick, Facebook AI Research

Problem Definition

Formal Specification

Given an image I, predict a set of detections {(b_i, c_i, s_i)} where:

  • b_i = bounding box (x, y, width, height) or (x1, y1, x2, y2)
  • c_i = class label from {1, ..., K}
  • s_i = confidence score in [0, 1]

Model must learn both where objects are (localization) and what they are (classification), in a single end-to-end system.

Key Challenges

  • Scale variation: Objects vary from tiny (few pixels) to huge (image fills frame)
  • Aspect ratio variation: Boxes from square to extreme rectangles
  • Occlusion: Overlapping objects; must separate overlapping regions
  • Background clutter: Distinguish objects from similar background
  • Localization precision: Tight bounding boxes required, not just class prediction
  • Real-time requirements: Many applications demand fast inference

Evaluation Metric: mAP

Mean Average Precision. For each class: compute precision-recall curve, then average precision. Average across classes gives mAP. Standard metric: mAP@IoU=0.5 (match if IoU > 0.5) or mAP@0.5:0.95 (average over IoU thresholds 0.5 to 0.95).

Classical Object Detection

Sliding Window + Hand-Crafted Features

Early approach: slide a window across image at multiple scales, extract hand-crafted features (SIFT, HOG), classify window with SVM or boosting.

Limitations: Computationally expensive (thousands of windows per image), requires careful feature engineering, limited by feature quality.

Deformable Part Models (DPM)

Model objects as deformable combinations of parts. Star-structured graphical model where central part connects to limbs. Significantly outperformed HOG+SVM but required complex inference.

Selective Search

Generate candidate regions hierarchically by grouping similar superpixels. More efficient than sliding window; ~2000 candidate regions per image. Forms basis for R-CNN.

Region-Based Methods (R-CNN Family)

R-CNN (2013)

Breakthrough: combine selective search regions with CNN features.

  1. Generate ~2000 region proposals via selective search
  2. Extract CNN features from each region (extract region, warp to fixed size, forward through pre-trained CNN)
  3. Classify each region with SVM
  4. Refine bounding boxes with linear regression

Accuracy: 54% mAP on PASCAL VOC 2007 (massive improvement over DPM's 33%)

Speed: ~50 seconds per image (impractical for real-time)

Fast R-CNN (2015)

Key innovation: RoI pooling. Extract features once from entire image, then crop feature maps for each region.

Image -> CNN backbone -> Feature maps
For each region proposal:
 Extract region from feature maps (RoI Pool)
 -> Classification head
 -> Bounding box regression

Speedup: ~200x faster than R-CNN (train 9 hours vs. 84 hours)

Accuracy: 66% mAP (even better, due to end-to-end training)

Faster R-CNN (2015)

Replace selective search with learnable Region Proposal Network (RPN).

Region Proposal Network

Slide small network over feature maps. At each position, predict:

  • Objectness score (is there an object here?)
  • Bounding box adjustments for k anchor boxes

Anchors: predefined box sizes and aspect ratios at each location. Network learns delta offsets, not absolute positions.

Speedup: Region proposals now learnable; 5x faster than Fast R-CNN

Accuracy: 69% mAP with faster inference

Faster R-CNN Architecture Summary

Component Function Output
Backbone Feature extraction (ResNet) Feature maps (reduced resolution)
RPN Generate region proposals Bounding boxes with objectness scores
RoI Align Extract features for each region Fixed-size feature vectors
Classification Head Classify each region Class scores per region
Bbox Regression Head Refine box coordinates Delta offsets (dx, dy, dw, dh)

Cascade R-CNN

Apply Faster R-CNN multiple times sequentially, with increasing IoU thresholds. First stage filters easy negatives; later stages refine boxes. Improves accuracy by ~4% at cost of speed.

Single-Stage Detectors

YOLO (You Only Look Once, 2015)

Radical departure: treat detection as single regression problem. Divide image into SxS grid. Each grid cell predicts:

  • Bounding box coordinates and dimensions (4 values)
  • Objectness (probability object in cell)
  • Class probabilities

Single forward pass through network. No region proposals needed.

Speed: 45 fps on GPU, 155 fps on fast variant

Accuracy: 63.4% mAP (lower than Faster R-CNN but real-time)

SSD (Single Shot MultiBox Detector, 2016)

Combines benefits of single-stage speed with multi-scale feature pyramids.

  1. Use CNN backbone (VGG, ResNet)
  2. Add multi-scale convolutional feature layers (progressively smaller)
  3. Use small convolutional filters to predict detections at each scale
  4. Combine predictions from multiple layers for multi-scale objects

Speed: 59 fps for SSD300, 74% mAP (better accuracy-speed trade-off than YOLO)

YOLOv3 and Later Versions

Version Year Key Improvement mAP
YOLOv1 2015 Single-stage detection 63.4%
YOLOv3 2018 Multi-scale predictions, ResNet backbone 76.6%
YOLOv4 2020 Advanced data augmentation, training tricks 77.8%
YOLOv5-v8 2022-2023 Attention mechanisms, improved efficiency 80%+

YOLOv3 added multi-scale predictions (detects at 3 scales); massive accuracy improvement. YOLOv4/v5+ added training optimizations (mosaic augmentation, IoU-aware NMS) bringing single-stage on par with region-based methods in accuracy while maintaining speed advantage.

Anchor-Free Methods

Motivation

Anchor-based methods (RPN, YOLO, SSD) require predefined anchor box sizes and aspect ratios. This is a hyperparameter requiring tuning. Anchor-free methods predict directly: center point, width, height.

CenterNet

Detect objects at their center point. For each location in feature map:

  • Predict whether center is at this location (heatmap)
  • Predict width and height from center

No anchor boxes needed. Inference: find peak values in heatmap, read box dimensions. Cleaner architecture, competitive accuracy.

CornerNet

Detect top-left and bottom-right corners of bounding boxes. Avoid anchor predefinition. Additional refinement module groups corner pairs into boxes.

FCOS (Fully Convolutional One-Stage Object Detection)

For each location in feature map, predict:

  • Class score (what object is here?)
  • Distances to object boundaries (left, right, top, bottom)

Removes anchor dependence. Each location responsible for detecting objects. Competitive with Faster R-CNN in accuracy, faster inference.

Feature Extraction Backbones

All modern detectors use strong feature extractors (backbones). Standard choices:

Backbone Parameters Speed Accuracy
ResNet-50 23.5M Balanced Good
ResNet-101 44.5M Slower Better
EfficientNet Tunable (5-30M) Variable Excellent
MobileNet 3-5M Fast (mobile) Good

Choice depends on accuracy-speed trade-off. Mobile/embedded: MobileNet. Server deployment: ResNet-101 or EfficientNet.

Detection Loss Functions

Classification Loss

Cross-entropy loss for class prediction. Weighted to handle class imbalance (background vs. objects).

Localization Loss

Smooth L1 Loss (standard):

L_loc = sum |delta_pred - delta_gt|

IoU Loss (modern): Directly optimize IoU metric

L_iou = 1 - IoU(pred_box, gt_box)

GIoU Loss: Generalizes IoU to account for non-overlapping boxes

L_giou = 1 - GIoU(pred_box, gt_box)

IoU-based losses directly optimize for evaluation metric (mAP uses IoU); more intuitive than coordinate-space losses.

Objectness Loss

Binary cross-entropy for "is there object" score. Heavily weighted at background to handle class imbalance (e.g., 1000s background anchors vs. few object anchors per image).

Combined Loss

L_total = alpha * L_cls + beta * L_loc + gamma * L_obj

Typical: alpha=1, beta=1, gamma=1 for balanced training. May adjust based on dataset and requirements.

Non-Maximum Suppression (NMS)

Problem

Detectors often produce multiple overlapping predictions for same object. Need to keep best prediction, suppress others.

Standard NMS Algorithm

1. Sort detections by confidence score (descending)
2. While detections remaining:
 a. Keep detection with highest score
 b. Remove all detections with IoU > threshold (e.g., 0.5) to kept detection
3. Return kept detections

Threshold: typically 0.5 (IoU). Higher threshold keeps more overlaps; lower is more aggressive.

Soft-NMS

Instead of hard deletion, reduce confidence score of overlapping detections based on IoU. Handles edge cases better (e.g., touching objects).

confidence_new = confidence_old * (1 - IoU)

DIoU-NMS

Use DIoU (Distance IoU) instead of IoU. Accounts for box center distance, handles non-overlapping predictions better.

Modern detectors sometimes use more sophisticated NMS variants or differentiable alternatives that integrate into training.

Evaluation Metrics (mAP)

Average Precision (AP) Per Class

  1. Sort predictions by confidence score
  2. For each prediction, check if IoU > threshold with ground truth (typically 0.5) - If match and GT not matched yet: True Positive - Otherwise: False Positive
  3. Compute precision and recall at each threshold
  4. Interpolate precision-recall curve, compute area under curve (AP)

AP@0.5: AP at IoU threshold 0.5 (loose matching)

AP@0.75: Stricter matching

AP@0.5:0.95: Average over IoU 0.5 to 0.95 (strictest, standard in COCO dataset)

Mean AP (mAP)

Average AP across all classes. Most important metric.

Example: COCO Dataset Results

Detector AP (0.5:0.95) Speed (fps)
Faster R-CNN 42.0% 7
YOLOv5 50.7% 160
EfficientDet 51.0% 95

Note speed-accuracy trade-off. Faster R-CNN very accurate but slow. YOLO provides practical speed-accuracy balance.

Training Strategies and Data

Data Augmentation

  • Random cropping, flipping, scaling
  • Color jittering (brightness, contrast, saturation)
  • Mosaic augmentation: combine 4 images in grid, improve small object detection
  • MixUp: blend two images and combine labels

Class Imbalance Handling

  • Hard negative mining: Prioritize background anchors that are confidently mis-classified
  • OHEM (Online Hard Example Mining): Dynamic weighting of loss per sample
  • Focal Loss: Down-weight easy background; focus on hard objects

Multi-Scale Training

Train on multiple input resolutions (e.g., 416x416, 512x512, 608x608). Improves robustness to scale variation. Increases training time but significantly improves accuracy.

Anchor Box Selection

For anchor-based methods, analyze dataset to determine good anchor sizes and aspect ratios. K-means clustering on ground truth boxes to find representative anchors. Custom anchors per-dataset improve accuracy.

Learning Rate Scheduling

Typical: Start with learning rate 0.1, decay at epochs 100 and 150 (if training 200 epochs). Cosine annealing becoming standard: smoothly reduce LR following cosine schedule.

Challenges and Advanced Topics

Small Object Detection

Objects < 32x32 pixels are challenging. Solutions: higher resolution input, FPN with more levels, specialized small-object detectors.

Crowded Scenes

Dense overlapping objects hard to separate. Need dense prediction, attention mechanisms to focus on object boundaries.

Domain Shift

Models trained on COCO fail on different domains (nighttime, different sensor, etc.). Requires fine-tuning or domain adaptation.

Real-Time on Edge Devices

Mobile/embedded constraints. Use MobileNet backbones, quantization, knowledge distillation. Trade accuracy for speed.

3D Object Detection

Detect and localize objects in 3D space using monocular images, stereo, or LiDAR. Predicts 3D bounding box (center, dimensions, orientation). Standard in autonomous driving.

References

  • Girshick, R., Donahue, J., Darrell, T., and Malik, J. "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation." IEEE CVPR, 2014.
  • Girshick, R. "Fast R-CNN." IEEE ICCV, 2015.
  • Ren, S., He, K., Girshick, R., and Sun, J. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." IEEE TPAMI, 2016.
  • Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. "You Only Look Once: Unified, Real-Time Object Detection." IEEE CVPR, 2015.
  • Liu, W., Anguelov, D., Erhan, D., et al. "SSD: Single Shot MultiBox Detector." ECCV, 2016.
  • Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. "Feature Pyramid Networks for Object Detection." IEEE ICCV, 2017.