How single-shot detector (SSD) works?

Last Updated : 18 Jun, 2026

Single Shot Detector (SSD) is a deep learning-based object detection algorithm that identifies and localizes objects in an image using a single forward pass through the network. By performing classification and localization simultaneously, SSD achieves high detection speed while maintaining good accuracy.

  • Detects objects of different sizes using multi-scale feature maps.
  • Performs object localization and classification in a single stage.

Model Architecture

1. Base Network

SSD starts with a pre-trained Convolutional Neural Network (CNN), typically VGG16, which extracts important features from the input image and generates feature maps for object detection.

2. Extra Layers

Additional convolutional layers are added after the base network to detect objects at different scales. These layers produce feature maps of progressively smaller sizes.

3. Feature Maps and Multi-scale Detection

SSD uses feature maps at multiple resolutions to detect objects of varying sizes. Higher-resolution feature maps are effective for small objects, while lower-resolution maps are better suited for larger objects.

4. Default Boxes (Anchor Boxes)

At each location in the feature maps, SSD assigns default boxes with different scales and aspect ratios. For each default box, the network predicts:

  • Class Scores: Probability of the object belonging to a specific class.
  • Bounding Box Offsets: Adjustments to improve the box location and size.

5. Predictions

For each default box, SSD predicts:

  • Class Confidences: The probability of the box containing a specific object class.
  • Bounding Box Adjustments: The coordinates to refine the position and size of the default box to match the detected object more precisely.

6. Loss Function

The SSD loss function combines two components:

  • Localization Loss (Lloc): Measures how accurately the predicted bounding boxes match the ground truth boxes using Smooth L1 loss.
  • Confidence Loss (Lconf): Evaluates the confidence in the predicted class scores using softmax loss.

7. Non-Maximum Suppression (NMS)

To finalize the detection process, SSD applies Non-Maximum Suppression (NMS). This step eliminates redundant boxes with lower confidence scores, ensuring that only the most confident and relevant predictions are retained.

Working

  1. Input Image: The image is passed through the base network to extract feature maps.
  2. Feature Extraction: The extra layers process these maps at multiple scales.
  3. Default Boxes Assignment: Default boxes of various sizes and aspect ratios are assigned to each feature map cell.
  4. Prediction: For each default box, class scores and bounding box offsets are predicted.
  5. Loss Calculation: The loss is computed based on localization and confidence.
  6. NMS: Redundant boxes are removed to produce the final set of detections.

Implementation

Let's consider an image containing multiple objects and use SSD to detect and localize them.

Step 1: Import Required Libraries

Importing the necessary libraries for building the SSD model.

Python
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models
from torchvision.models import VGG16_Weights

Step 2: Define the SSD Model

Defining the SSD architecture using a VGG16 backbone, additional convolutional layers, and prediction layers for object localization and classification.

  • Base Network: Extracts image features using VGG16.
  • Extra Layers: Detect objects at multiple scales.
  • Prediction Layers: Generate bounding box coordinates and class scores.
Python
class SSD(nn.Module):
    def __init__(self, num_classes):
        super(SSD, self).__init__()
        self.num_classes = num_classes

        vgg = models.vgg16(weights=VGG16_Weights.IMAGENET1K_V1).features
        self.features = nn.ModuleList(vgg[:30])  

        self.extras = nn.ModuleList([
            nn.Sequential(
                nn.Conv2d(512, 1024, kernel_size=3, padding=1, dilation=1),
                nn.ReLU(inplace=True)
            ),
            nn.Sequential(
                nn.Conv2d(1024, 256, kernel_size=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(256, 512, kernel_size=3, stride=2, padding=1),
                nn.ReLU(inplace=True)
            ),
            nn.Sequential(
                nn.Conv2d(512, 128, kernel_size=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(128, 256, kernel_size=3, stride=2, padding=1),
                nn.ReLU(inplace=True)
            ),
            nn.Sequential(
                nn.Conv2d(256, 128, kernel_size=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(128, 256, kernel_size=3),
                nn.ReLU(inplace=True)
            ),
            nn.Sequential(
                nn.Conv2d(256, 128, kernel_size=1),
                nn.ReLU(inplace=True),
                nn.Conv2d(128, 256, kernel_size=3),
                nn.ReLU(inplace=True)
            )
        ])

        self.loc = nn.ModuleList([
            nn.Conv2d(512, 4 * 4, kernel_size=3, padding=1),  
            nn.Conv2d(1024, 6 * 4, kernel_size=3, padding=1),  
            nn.Conv2d(512, 6 * 4, kernel_size=3, padding=1),  
            nn.Conv2d(256, 6 * 4, kernel_size=3, padding=1),  
            nn.Conv2d(256, 4 * 4, kernel_size=3, padding=1),  
            nn.Conv2d(256, 4 * 4, kernel_size=3, padding=1)   
        ])

        self.conf = nn.ModuleList([
            nn.Conv2d(512, 4 * num_classes, kernel_size=3, padding=1),
            nn.Conv2d(1024, 6 * num_classes, kernel_size=3, padding=1),
            nn.Conv2d(512, 6 * num_classes, kernel_size=3, padding=1),
            nn.Conv2d(256, 6 * num_classes, kernel_size=3, padding=1),
            nn.Conv2d(256, 4 * num_classes, kernel_size=3, padding=1),
            nn.Conv2d(256, 4 * num_classes, kernel_size=3, padding=1)
        ])

Step 3: Implement the Forward Pass

Processing the input image through the SSD network and generates localization and classification predictions.

  • Feature extraction using the base network.
  • Multi-scale feature generation using extra layers.
  • Bounding box and class score prediction.
Python
    def forward(self, x):
        locs = []
        confs = []

        for k in range(len(self.features)):
            x = self.features[k](x)
        
        locs.append(self.loc[0](x).permute(0, 2, 3, 1).contiguous())
        confs.append(self.conf[0](x).permute(0, 2, 3, 1).contiguous())

        for (i, layer) in enumerate(self.extras):
            x = layer(x)
            locs.append(self.loc[i+1](x).permute(0, 2, 3, 1).contiguous())
            confs.append(self.conf[i+1](x).permute(0, 2, 3, 1).contiguous())

        locs = torch.cat([o.view(o.size(0), -1) for o in locs], 1)
        confs = torch.cat([o.view(o.size(0), -1) for o in confs], 1)

        locs = locs.view(locs.size(0), -1, 4)
        confs = confs.view(confs.size(0), -1, self.num_classes)

        return locs, confs

Step 4: Perform Detection

Creating an SSD model instance which passes an input image through the network, and obtain the detection results.

Python
if __name__ == "__main__":
    num_classes = 21  
    ssd = SSD(num_classes)
    x = torch.randn(1, 3, 300, 300)
    locs, confs = ssd(x)
    print("Localization predictions:", locs.size())
    print("Confidence predictions:", confs.size())

Output:

Localization predictions: torch.Size([1, 3916, 4])
Confidence predictions: torch.Size([1, 3916, 21])

Applications

  • Autonomous Driving: Detecting vehicles, pedestrians, and traffic signs in real-time.
  • Surveillance: Monitoring and identifying objects or individuals in security footage.
  • Robotics: Enabling robots to perceive and interact with their environment.
  • Augmented Reality: Detecting and tracking objects for interactive experiences.

Advantages

  • High Speed: Performs object detection in a single pass, making it faster than two-stage detectors such as Faster R-CNN.
  • Simple Architecture: Uses a straightforward single-stage design that is easier to train and implement.
  • Good Accuracy: Utilizes multi-scale feature maps to effectively detect objects of different sizes.

Challenges

  • Small Object Detection: Performance may decrease when detecting very small objects due to limited feature resolution.
  • Class Imbalance: A large number of background boxes compared to object boxes can affect training efficiency.
  • Localization Errors: Bounding box predictions may be less precise than those of some two-stage detectors.
Comment

Explore