SSD(Single Shot Multi-Box Detection) for real time object detection

5 min readJul 16, 2018

Convolutional neural network outperforms other neural network architectures in detecting objects in an image. Soon researchers improved CNN for object localization and detection and called this architecture R-CNN(Region-CNN). The output of R-CNN is the image with rectangular boxes surrounded for the objects in an image as well as the class of that object. The following are the steps on how R-CNN works:

Scan input images for possible objects using algorithm called Selective Search and generate around 2000 region proposals,
Run CNN over each of these region proposals,
Take output of each CNN and feed it into :

SVM to classify region
A linear regressor to tighten bounding box of object if such object exists

Although R-CNN made a lot of progress over traditional CNN for object localization, detection and classification it still seems a little problem for achieving this in real time. Some of the problems are:

Training data is very difficult to handle and very long,
Training happens in two stages(e.g training region proposals, and classification)
Network is slow at inference time( when dealing with non training data )

To improve R-CNN there are also other algorithms like Fast-RCNN, Faster-RCNN. The later one gives more accurate results for object detection. But they are bit slower for real time detection. So SSD came into play. It has a good balance over accuracy and speed of calculation.

SSD(Single Shot MultiBox Detector) Meaning

Single Shot: Object localization and classification is done in single forward pass of network

MultiBox: Technique for bounding box regression

Detector: Classify the detected objects

Architecture

The architecture of SSD is built based on the VGG-16 architecture. But here is a little tweak on the VGG-16, we use the set of auxiliary convolutional layers from Conv6 layer onwards instead of fully connected layers. The reason of using VGG-16 as foundational network is its high quality image classification and transfer learning to improve results. Using the auxiliary convolutional layers we can extract features at multiple scales and progressively decrease the size at each following layer. I have discussed how this works in following section. You can see the following image for VGG-16 architecture. It contains fully connected layers.

Working Mechanism

To train our algorithm, we need a training set that contains image with objects and those objects must have bounding boxes on them. Learning this way, the algorithms learn how to put rectangle on the object and where to put. We minimize the errors between inferred bounding boxes and ground truth to optimize our model to detect the object correctly. Unlike in CNN, we don’t only predict if there is an object in the image or not we also need to predict where in the image the object is. During training the algorithm learn to adjust the height and width of the rectangle in the object.

The image above is the example of our training set of data for object detection. These dataset must contain object with their classes labeled in the image. More default boxes results in more accurate detection but will cost on speed.

The Pascal VOC and COCO datasets are a good starting point for beginners.

Dealing With Scale Problem

In the left we have an image with few horses. We have divided our input image into the set of grids. Then we make couple of rectangles of different aspect ratio around those grids. Then we apply convolution in those boxes to find if there is an object or not in those grids. Here one of the black horse is closer to the camera in the image. So the rectangle we draw is unable to identify if that is horse or not because the rectangle does not have any features that are identifying to horses.

If we see the above architecture of SSD, we can see in each step after conv6 layer the size of images gets reduced substantially. Then every operation we discussed on making grids and finding objects on those grids applies in every single step of the convolution going from back to front of the network. The classifiers are applied in every single step to detect the objects too. So since the objects become smaller in each steps they gets easily identified.

The SSD algorithm also knows how to go back from one convolution operation to another. It not only learns to go forward but backwards too. For e.g if it sees horse in conv4 then it can return to conv6 and the algorithm will draw the rectangle around the horse.

If you like this post, don’t forget to clap the post and follow me on medium and on twitter.

SSD(Single Shot Multi-Box Detection) for real time object detection

SSD(Single Shot MultiBox Detector) Meaning

Architecture

Working Mechanism

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Rabin Poudyal

Responses (1)

More from Rabin Poudyal

Building a knowledge graph in python from scratch

A knowledge graph is one of the widely used applications of machine learning that tech giants like Google and Microsoft are using in their…

Build a recommendation engine from scratch for your university project

Almost every CS student need to complete a final year project. There is a lot of confusion in what language to choose, what frameworks to…

Content Based Filtering in Recommendation Systems

This is one of the simple approach of recommending products or contents to the user. The idea here is that if a user indicates (s)he likes…

Nearest neighbour based method for collaborative filtering

It is one of the method for performing collaborative filtering. If collaborative filtering is new to you don’t forget to read this article…

Recommended from Medium

Why writing is just like running

Botanical journaling + beating writer’s block (Issue #284)

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

Lists

Predictive Modeling w/ Python

Natural Language Processing

AI Regulation

Practical Guides to Machine Learning

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Tools I use that are cheaper than Netflix

Laziness Does Not Exist

Psychological research is clear: when people procrastinate, there's usually a good reason

I Pretended to Be a Man on a Dating Site — And I Hate What I Discovered

As a 23-year-old woman fascinated by human behavior (and, let’s be honest, sometimes just bored and curious), I decided to conduct a…

How I Am Using a Lifetime 100% Free Server

Get a server with 24 GB RAM + 4 CPU + 200 GB Storage + Always Free