An introduction to RetinaNet and how we make it easier to use

Engineering Jan 14, 2020

A simpler and faster way to use RetinaNet for your deep learning project.

Accelerating Artificial Intelligence by empowering the AI engineers of the world with tools that allow them to focus on better accuracy and efficiency is what drives everything we build at SegMind.

To that end, we are pushing our first network on the CRAL library to help Computer Vision engineers build faster with RetinaNet, a state of the art network for detection applications.

What is RetinaNet?

The genesis of RetinaNet was to find a better way of improving upon the accuracy of previous single stage networks such as YOLO (You Only Look Once) and SSD (Single Shot Multibox Detector) without compromising too much on the time of inference.

Researchers Tsung Yi-Lin et al. discovered the reason for poor performance of single-stage networks stems from an extreme imbalance in foreground-background classes. Essentially, this meant that the presence of simple samples with a high correlation to ground truth did not weigh much in training because networks such as SSD, for example, needed to evaluate 10,000 to 100,000 candidate positions in each image when only a few 100 boxes or lesser would actually contain the target object.

The approach of the team at Facebook AI Research to resolve this imbalance was to reconfigure the standard cross entropy and provide less weightage to the loss assigned to well classified samples. The “Focal Loss” or the reconfigured entropy loss also trained on a smaller set of “hard examples” to prevent instances of a large number of easy negatives that would have overwhelmed the detector during training.

The results of running this reconfigured Focal Loss function were compared with other networks. In general RetinaNet offered better results with respect to speed (ms) and accuracy (AP) on COCO data-sets.

Comparison of object detection models on speed and mAP performance. (Image source: YOLOv3 paper.)

How does RetinaNet work?

RetinaNet primarily consists of 3 parts:

  1. Backbone network: FPN (Feature Pyramid Net) built on top of ResNet (50, 101, 152).  Computes convolutional feature maps at multiple scales of the image.
  2. Regression network: Finds suitable bounding boxes using regression head
  3. Classification network: Assigns suitable classes to regressed bounding boxes using classification head.
RetinaNet Architecture (Image source: focal loss paper)

Hyperparameter Tuning

4 important hyperparameters that need tuning are:

  • Sizes: The base pixel size for an anchor box
  • Strides: The distance between the centers of two neighbouring anchor boxes
  • Scales: The coefficient to be multiplied with size
  • Ratios: The height/width ratio for a box
Example image used for calculating hyperparameters for RetinaNet. Image source: PascalVOC 2017 

Base size

5 values indicating base sizes that are used in calculating the anchor boxes. This corresponds to 5 layers for pyramid pooling that makes retinanet work well at different scales..

Example: 16, 32, 64, 128, 256


5 values indicating distance between the center of two adjacent anchor boxes. The default values shown below works for most of the cases.

Example: 8, 16, 32, 64, 128


List of ratios (h/w) possible for each class. Refer to the table below to see ratios calculated for the example shown here. For this dataset, the distribution of annotation height and width was normal. Hence, we have used mean height and widths to calculate the ratios.

Label Mean Height (px) Mean Width (px) Ratio
person 346 209 1.65
pole 90 9 10
ski 20 310 0.06
ski_pole 164 107 1.53


Coefficient multiplied with size parameters. The list of numbers obtained by multiplying sizes with scales indicate the widths of anchor boxes generated. Refer to table 1.2 to see the calculated values. At scale = 1, the base size value indicates the width of the objects.

Calculating scales

Taking the example of a label person, whose average width is 209, the closest base size value is 256. Hence, we chose a scale of (209/256) = 0.816, which makes the 5th FPN layer responsible for generating the anchor boxes for label person.
Alternatively, you can also choose a base size not the closest, eg 128. In that case, the scale will be (209/128) = 1.6328. For calculations in table 1.2, we have chosen the closest base size values to calculate scales.

FPN Layer 1 FPN Layer 2 FPN Layer 3 FPN Layer 4 FPN Layer 5
↓ scale / base_size → 16 32 64 128 256
209/256 = 0.81640625 13.0625 26.125 52.25 104.5 209
9/16 = 0.5625 9 18 36 72 144
310/256 = 1.2109375 19.375 38.75 77.5 155 310
107/128 = 0.8359375 13.375 26.75 53.5 107 214

To get started with RetinaNet, refer to the CRAL documentation.

If anyone of your peers need to use RetinaNet in their research labs, startups, and/or companies do get in touch with us for an early access to the products we are building at