Object Detection | What's Your IQ

Introduction

Object detection is the task of finding objects of interest in images and localizing them with bounding boxes while simultaneously predicting their class labels. Unlike image classification which assigns one label to an entire image, detection requires localizing multiple objects of various classes in a single image. A detector might identify "person at (x1, y1, x2, y2)," "car at (x3, y3, x4, y4)," etc., with associated confidence scores.

Object detection is fundamental for autonomous systems: self-driving cars must detect pedestrians, vehicles, and road signs; security systems must find people and threats; retail systems must identify products. The task combines localization (where objects are) with classification (what they are).

The field evolved from sliding window + hand-crafted features (2000s), through R-CNN (2014), to modern real-time detectors achieving 60+ fps with >80% accuracy. Two paradigms emerged: region-based methods (detect regions then classify) and single-stage methods (detect in one pass). Today, both approaches coexist with different trade-offs.

"Object detection is the bridge between image understanding and scene comprehension. It's where computer vision meets real-world applications." -- Ross Girshick, Facebook AI Research

Problem Definition

Formal Specification

Given an image I, predict a set of detections {(b_i, c_i, s_i)} where:

b_i = bounding box (x, y, width, height) or (x1, y1, x2, y2)
c_i = class label from {1, ..., K}
s_i = confidence score in [0, 1]

Model must learn both where objects are (localization) and what they are (classification), in a single end-to-end system.

Key Challenges

Scale variation: Objects vary from tiny (few pixels) to huge (image fills frame)
Aspect ratio variation: Boxes from square to extreme rectangles
Occlusion: Overlapping objects; must separate overlapping regions
Background clutter: Distinguish objects from similar background
Localization precision: Tight bounding boxes required, not just class prediction
Real-time requirements: Many applications demand fast inference

Evaluation Metric: mAP

Mean Average Precision. For each class: compute precision-recall curve, then average precision. Average across classes gives mAP. Standard metric: mAP@IoU=0.5 (match if IoU > 0.5) or mAP@0.5:0.95 (average over IoU thresholds 0.5 to 0.95).

Classical Object Detection

Sliding Window + Hand-Crafted Features

Early approach: slide a window across image at multiple scales, extract hand-crafted features (SIFT, HOG), classify window with SVM or boosting.

Limitations: Computationally expensive (thousands of windows per image), requires careful feature engineering, limited by feature quality.

Deformable Part Models (DPM)

Model objects as deformable combinations of parts. Star-structured graphical model where central part connects to limbs. Significantly outperformed HOG+SVM but required complex inference.

Selective Search

Generate candidate regions hierarchically by grouping similar superpixels. More efficient than sliding window; ~2000 candidate regions per image. Forms basis for R-CNN.

Region-Based Methods (R-CNN Family)

R-CNN (2013)

Breakthrough: combine selective search regions with CNN features.

Generate ~2000 region proposals via selective search
Extract CNN features from each region (extract region, warp to fixed size, forward through pre-trained CNN)
Classify each region with SVM
Refine bounding boxes with linear regression

Accuracy: 54% mAP on PASCAL VOC 2007 (massive improvement over DPM's 33%)

Speed: ~50 seconds per image (impractical for real-time)

Fast R-CNN (2015)

Key innovation: RoI pooling. Extract features once from entire image, then crop feature maps for each region.

Image -> CNN backbone -> Feature maps
For each region proposal:
 Extract region from feature maps (RoI Pool)
 -> Classification head
 -> Bounding box regression

Speedup: ~200x faster than R-CNN (train 9 hours vs. 84 hours)

Accuracy: 66% mAP (even better, due to end-to-end training)

Faster R-CNN (2015)

Replace selective search with learnable Region Proposal Network (RPN).

Region Proposal Network

Slide small network over feature maps. At each position, predict:

Objectness score (is there an object here?)
Bounding box adjustments for k anchor boxes

Anchors: predefined box sizes and aspect ratios at each location. Network learns delta offsets, not absolute positions.

Speedup: Region proposals now learnable; 5x faster than Fast R-CNN

Accuracy: 69% mAP with faster inference

Faster R-CNN Architecture Summary

Component	Function	Output
Backbone	Feature extraction (ResNet)	Feature maps (reduced resolution)
RPN	Generate region proposals	Bounding boxes with objectness scores
RoI Align	Extract features for each region	Fixed-size feature vectors
Classification Head	Classify each region	Class scores per region
Bbox Regression Head	Refine box coordinates	Delta offsets (dx, dy, dw, dh)

Cascade R-CNN

Apply Faster R-CNN multiple times sequentially, with increasing IoU thresholds. First stage filters easy negatives; later stages refine boxes. Improves accuracy by ~4% at cost of speed.

Single-Stage Detectors

YOLO (You Only Look Once, 2015)

Radical departure: treat detection as single regression problem. Divide image into SxS grid. Each grid cell predicts:

Bounding box coordinates and dimensions (4 values)
Objectness (probability object in cell)
Class probabilities

Single forward pass through network. No region proposals needed.

Speed: 45 fps on GPU, 155 fps on fast variant

Accuracy: 63.4% mAP (lower than Faster R-CNN but real-time)

SSD (Single Shot MultiBox Detector, 2016)

Combines benefits of single-stage speed with multi-scale feature pyramids.

Use CNN backbone (VGG, ResNet)
Add multi-scale convolutional feature layers (progressively smaller)
Use small convolutional filters to predict detections at each scale
Combine predictions from multiple layers for multi-scale objects

Speed: 59 fps for SSD300, 74% mAP (better accuracy-speed trade-off than YOLO)

YOLOv3 and Later Versions

Version	Year	Key Improvement	mAP
YOLOv1	2015	Single-stage detection	63.4%
YOLOv3	2018	Multi-scale predictions, ResNet backbone	76.6%
YOLOv4	2020	Advanced data augmentation, training tricks	77.8%
YOLOv5-v8	2022-2023	Attention mechanisms, improved efficiency	80%+

YOLOv3 added multi-scale predictions (detects at 3 scales); massive accuracy improvement. YOLOv4/v5+ added training optimizations (mosaic augmentation, IoU-aware NMS) bringing single-stage on par with region-based methods in accuracy while maintaining speed advantage.

Anchor-Free Methods

Motivation

Anchor-based methods (RPN, YOLO, SSD) require predefined anchor box sizes and aspect ratios. This is a hyperparameter requiring tuning. Anchor-free methods predict directly: center point, width, height.

CenterNet

Detect objects at their center point. For each location in feature map:

Predict whether center is at this location (heatmap)
Predict width and height from center

No anchor boxes needed. Inference: find peak values in heatmap, read box dimensions. Cleaner architecture, competitive accuracy.

CornerNet

Detect top-left and bottom-right corners of bounding boxes. Avoid anchor predefinition. Additional refinement module groups corner pairs into boxes.

FCOS (Fully Convolutional One-Stage Object Detection)

For each location in feature map, predict:

Class score (what object is here?)
Distances to object boundaries (left, right, top, bottom)

Removes anchor dependence. Each location responsible for detecting objects. Competitive with Faster R-CNN in accuracy, faster inference.

Feature Extraction Backbones

All modern detectors use strong feature extractors (backbones). Standard choices:

Backbone	Parameters	Speed	Accuracy
ResNet-50	23.5M	Balanced	Good
ResNet-101	44.5M	Slower	Better
EfficientNet	Tunable (5-30M)	Variable	Excellent
MobileNet	3-5M	Fast (mobile)	Good

Choice depends on accuracy-speed trade-off. Mobile/embedded: MobileNet. Server deployment: ResNet-101 or EfficientNet.

Detection Loss Functions

Classification Loss

Cross-entropy loss for class prediction. Weighted to handle class imbalance (background vs. objects).

Localization Loss

Smooth L1 Loss (standard):

L_loc = sum |delta_pred - delta_gt|

IoU Loss (modern): Directly optimize IoU metric

L_iou = 1 - IoU(pred_box, gt_box)

GIoU Loss: Generalizes IoU to account for non-overlapping boxes

L_giou = 1 - GIoU(pred_box, gt_box)

IoU-based losses directly optimize for evaluation metric (mAP uses IoU); more intuitive than coordinate-space losses.

Objectness Loss

Binary cross-entropy for "is there object" score. Heavily weighted at background to handle class imbalance (e.g., 1000s background anchors vs. few object anchors per image).

Combined Loss

L_total = alpha * L_cls + beta * L_loc + gamma * L_obj

Typical: alpha=1, beta=1, gamma=1 for balanced training. May adjust based on dataset and requirements.

Non-Maximum Suppression (NMS)

Problem

Detectors often produce multiple overlapping predictions for same object. Need to keep best prediction, suppress others.

Standard NMS Algorithm

1. Sort detections by confidence score (descending)
2. While detections remaining:
 a. Keep detection with highest score
 b. Remove all detections with IoU > threshold (e.g., 0.5) to kept detection
3. Return kept detections

Threshold: typically 0.5 (IoU). Higher threshold keeps more overlaps; lower is more aggressive.

Soft-NMS

Instead of hard deletion, reduce confidence score of overlapping detections based on IoU. Handles edge cases better (e.g., touching objects).

confidence_new = confidence_old * (1 - IoU)

DIoU-NMS

Use DIoU (Distance IoU) instead of IoU. Accounts for box center distance, handles non-overlapping predictions better.

Modern detectors sometimes use more sophisticated NMS variants or differentiable alternatives that integrate into training.

Evaluation Metrics (mAP)

Average Precision (AP) Per Class

Sort predictions by confidence score
For each prediction, check if IoU > threshold with ground truth (typically 0.5) - If match and GT not matched yet: True Positive - Otherwise: False Positive
Compute precision and recall at each threshold
Interpolate precision-recall curve, compute area under curve (AP)

AP@0.5: AP at IoU threshold 0.5 (loose matching)

AP@0.75: Stricter matching

AP@0.5:0.95: Average over IoU 0.5 to 0.95 (strictest, standard in COCO dataset)

Mean AP (mAP)

Average AP across all classes. Most important metric.

Example: COCO Dataset Results

Detector	AP (0.5:0.95)	Speed (fps)
Faster R-CNN	42.0%	7
YOLOv5	50.7%	160
EfficientDet	51.0%	95

Note speed-accuracy trade-off. Faster R-CNN very accurate but slow. YOLO provides practical speed-accuracy balance.

Training Strategies and Data

Data Augmentation

Random cropping, flipping, scaling
Color jittering (brightness, contrast, saturation)
Mosaic augmentation: combine 4 images in grid, improve small object detection
MixUp: blend two images and combine labels

Class Imbalance Handling

Hard negative mining: Prioritize background anchors that are confidently mis-classified
OHEM (Online Hard Example Mining): Dynamic weighting of loss per sample
Focal Loss: Down-weight easy background; focus on hard objects

Multi-Scale Training

Train on multiple input resolutions (e.g., 416x416, 512x512, 608x608). Improves robustness to scale variation. Increases training time but significantly improves accuracy.

Anchor Box Selection

For anchor-based methods, analyze dataset to determine good anchor sizes and aspect ratios. K-means clustering on ground truth boxes to find representative anchors. Custom anchors per-dataset improve accuracy.

Learning Rate Scheduling

Typical: Start with learning rate 0.1, decay at epochs 100 and 150 (if training 200 epochs). Cosine annealing becoming standard: smoothly reduce LR following cosine schedule.

Challenges and Advanced Topics

Small Object Detection

Objects < 32x32 pixels are challenging. Solutions: higher resolution input, FPN with more levels, specialized small-object detectors.

Crowded Scenes

Dense overlapping objects hard to separate. Need dense prediction, attention mechanisms to focus on object boundaries.

Domain Shift

Models trained on COCO fail on different domains (nighttime, different sensor, etc.). Requires fine-tuning or domain adaptation.

Real-Time on Edge Devices

Mobile/embedded constraints. Use MobileNet backbones, quantization, knowledge distillation. Trade accuracy for speed.

3D Object Detection

Detect and localize objects in 3D space using monocular images, stereo, or LiDAR. Predicts 3D bounding box (center, dimensions, orientation). Standard in autonomous driving.

References

Girshick, R., Donahue, J., Darrell, T., and Malik, J. "Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation." IEEE CVPR, 2014.
Girshick, R. "Fast R-CNN." IEEE ICCV, 2015.
Ren, S., He, K., Girshick, R., and Sun, J. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks." IEEE TPAMI, 2016.
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. "You Only Look Once: Unified, Real-Time Object Detection." IEEE CVPR, 2015.
Liu, W., Anguelov, D., Erhan, D., et al. "SSD: Single Shot MultiBox Detector." ECCV, 2016.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. "Feature Pyramid Networks for Object Detection." IEEE ICCV, 2017.