SSD (Single Shot Multibox Detector)¶

Detection Links¶

SSD300¶

class chainercv.links.model.ssd.SSD300(n_fg_class=None, pretrained_model=None)[source]¶

Single Shot Multibox Detector with 300x300 inputs.

This is a model of Single Shot Multibox Detector 1. This model uses VGG16Extractor300 as its feature extractor.

1: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters

n_fg_class (int) – The number of classes excluding the background.
pretrained_model (string) –
The weight file to be loaded. This can take 'voc0712', filepath or None. The default value is None.
- 'voc0712': Load weights trained on trainval split of PASCAL VOC 2007 and 2012. The weight file is downloaded and cached automatically. n_fg_class must be 20 or None. These weights were converted from the Caffe model provided by the original implementation. The conversion code is chainercv/examples/ssd/caffe2npz.py.
- 'imagenet': Load weights of VGG-16 trained on ImageNet. The weight file is downloaded and cached automatically. This option initializes weights partially and the rests are initialized randomly. In this case, n_fg_class can be set to any number.
- filepath: A path of npz file. In this case, n_fg_class must be specified properly.
- None: Do not load weights.

SSD512¶

class chainercv.links.model.ssd.SSD512(n_fg_class=None, pretrained_model=None)[source]¶

Single Shot Multibox Detector with 512x512 inputs.

This is a model of Single Shot Multibox Detector 2. This model uses VGG16Extractor512 as its feature extractor.

2: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters

n_fg_class (int) – The number of classes excluding the background.
pretrained_model (string) –
The weight file to be loaded. This can take 'voc0712', filepath or None. The default value is None.
- 'voc0712': Load weights trained on trainval split of PASCAL VOC 2007 and 2012. The weight file is downloaded and cached automatically. n_fg_class must be 20 or None. These weights were converted from the Caffe model provided by the original implementation. The conversion code is chainercv/examples/ssd/caffe2npz.py.
- 'imagenet': Load weights of VGG-16 trained on ImageNet. The weight file is downloaded and cached automatically. This option initializes weights partially and the rests are initialized randomly. In this case, n_fg_class can be set to any number.
- filepath: A path of npz file. In this case, n_fg_class must be specified properly.
- None: Do not load weights.

Utility¶

Multibox¶

class chainercv.links.model.ssd.Multibox(n_class, aspect_ratios, initialW=None, initial_bias=None)[source]¶

Multibox head of Single Shot Multibox Detector.

This is a head part of Single Shot Multibox Detector 3. This link computes mb_locs and mb_confs from feature maps. mb_locs contains information of the coordinates of bounding boxes and mb_confs contains confidence scores of each classes.

3: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters

n_class (int) – The number of classes possibly including the background.
aspect_ratios (iterable of tuple or int) – The aspect ratios of default bounding boxes for each feature map.
initialW – An initializer used in chainer.links.Convolution2d.__init__(). The default value is chainer.initializers.LeCunUniform.
initial_bias – An initializer used in chainer.links.Convolution2d.__init__(). The default value is chainer.initializers.Zero.

forward(xs)[source]¶

Compute loc and conf from feature maps

This method computes mb_locs and mb_confs from given feature maps.

Parameters

xs (iterable of chainer.Variable) – An iterable of feature maps. The number of feature maps must be same as the number of aspect_ratios.

Returns

This method returns two chainer.Variable: mb_locs and mb_confs.

mb_locs: A variable of float arrays of shape \((B, K, 4)\), where \(B\) is the number of samples in the batch and \(K\) is the number of default bounding boxes.
mb_confs: A variable of float arrays of shape \((B, K, n\_fg\_class + 1)\).

Return type

tuple of chainer.Variable

MultiboxCoder¶

class chainercv.links.model.ssd.MultiboxCoder(grids, aspect_ratios, steps, sizes, variance)[source]¶

A helper class to encode/decode bounding boxes.

This class encodes (bbox, label) to (mb_loc, mb_label) and decodes (mb_loc, mb_conf) to (bbox, label, score). These encoding/decoding are used in Single Shot Multibox Detector 4.

mb_loc: An array representing offsets and scales from the default bounding boxes. Its shape is \((K, 4)\), where \(K\) is the number of the default bounding boxes. The second axis is composed by \((\Delta y, \Delta x, \Delta h, \Delta w)\). These values are computed by the following formulas.
- \(\Delta y = (b_y - m_y) / (m_h * v_0)\)
- \(\Delta x = (b_x - m_x) / (m_w * v_0)\)
- \(\Delta h = log(b_h / m_h) / v_1\)
- \(\Delta w = log(b_w / m_w) / v_1\)
\((m_y, m_x)\) and \((m_h, m_w)\) are center coodinates and size of a default bounding box. \((b_y, b_x)\) and \((b_h, b_w)\) are center coodinates and size of a given bounding boxes that is assined to the default bounding box. \((v_0, v_1)\) are coefficients that can be set by argument variance.
mb_label: An array representing classes of ground truth bounding boxes. Its shape is \((K,)\).
mb_conf: An array representing classes of predicted bounding boxes. Its shape is \((K, n\_fg\_class + 1)\).

4: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters

grids (iterable of ints) – An iterable of integers. Each integer indicates the size of a feature map.
aspect_ratios (iterable of tuples of ints) – An iterable of tuples of integers used to compute the default bounding boxes. Each tuple indicates the aspect ratios of the default bounding boxes at each feature maps. The length of this iterable should be len(grids).
steps (iterable of floats) – The step size for each feature map. The length of this iterable should be len(grids).
sizes (iterable of floats) – The base size of default bounding boxes for each feature map. The length of this iterable should be len(grids) + 1.
variance (tuple of floats) – Two coefficients for encoding/decoding the locations of bounding boxes. The first value is used to encode/decode coordinates of the centers. The second value is used to encode/decode the sizes of bounding boxes.

decode(mb_loc, mb_conf, nms_thresh=0.45, score_thresh=0.6)[source]¶

Decodes back to coordinates and classes of bounding boxes.

This method decodes mb_loc and mb_conf returned by a SSD network back to bbox, label and score.

Parameters

mb_loc (array) – A float array whose shape is \((K, 4)\), \(K\) is the number of default bounding boxes.
mb_conf (array) – A float array whose shape is \((K, n\_fg\_class + 1)\).
nms_thresh (float) – The threshold value for non_maximum_suppression(). The default value is 0.45.
score_thresh (float) – The threshold value for confidence score. If a bounding box whose confidence score is lower than this value, the bounding box will be suppressed. The default value is 0.6.

Returns

This method returns a tuple of three arrays, (bbox, label, score).

bbox: A float array of shape \((R, 4)\), where \(R\) is the number of bounding boxes in a image. Each bounding box is organized by \((y_{min}, x_{min}, y_{max}, x_{max})\) in the second axis.
label : An integer array of shape \((R,)\). Each value indicates the class of the bounding box.
score : A float array of shape \((R,)\). Each value indicates how confident the prediction is.

Return type

tuple of three arrays

encode(bbox, label, iou_thresh=0.5)[source]¶

Encodes coordinates and classes of bounding boxes.

This method encodes bbox and label to mb_loc and mb_label, which are used to compute multibox loss.

Parameters

bbox (array) – A float array of shape \((R, 4)\), where \(R\) is the number of bounding boxes in an image. Each bounding box is organized by \((y_{min}, x_{min}, y_{max}, x_{max})\) in the second axis.
label (array) – An integer array of shape \((R,)\). Each value indicates the class of the bounding box.
iou_thresh (float) – The threshold value to determine a default bounding box is assigned to a ground truth or not. The default value is 0.5.

Returns

This method returns a tuple of two arrays, (mb_loc, mb_label).

mb_loc: A float array of shape \((K, 4)\), where \(K\) is the number of default bounding boxes.
mb_label: An integer array of shape \((K,)\).

Return type

tuple of two arrays

Normalize¶

class chainercv.links.model.ssd.Normalize(n_channel, initial=0, eps=1e-05)[source]¶

Learnable L2 normalization 5.

This link normalizes input along the channel axis and scales it. The scale factors are trained channel-wise.

5: Wei Liu, Andrew Rabinovich, Alexander C. Berg. ParseNet: Looking Wider to See Better. ICLR 2016.

Parameters

n_channel (int) – The number of channels.
initial – A value to initialize the scale factors. It is pased to chainer.initializers._get_initializer(). The default value is 0.
eps (float) – A small value to avoid zero-division. The default value is \(1e-5\).

forward(x)[source]¶

Normalize input and scale it.

Parameters: x (chainer.Variable) – A variable holding 4-dimensional array. Its dtype is numpy.float32.
Returns: The shape and dtype are same as those of input.
Return type: chainer.Variable

SSD¶

class chainercv.links.model.ssd.SSD(extractor, multibox, steps, sizes, variance=(0.1, 0.2), mean=0)[source]¶

Base class of Single Shot Multibox Detector.

This is a base class of Single Shot Multibox Detector 6.

6: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters

extractor –
A link which extracts feature maps. This link must have insize, grids and forward().
- insize: An integer which indicates the size of input images. Images are resized to this size before feature extraction.
- grids: An iterable of integer. Each integer indicates the size of feature map. This value is used by MultiBboxCoder.
- __call_(): A method which computes feature maps. It must take a batched images and return batched feature maps.
multibox –
A link which computes mb_locs and mb_confs from feature maps. This link must have n_class, aspect_ratios and forward().
- n_class: An integer which indicates the number of classes. This value should include the background class.
- aspect_ratios: An iterable of tuple of integer. Each tuple indicates the aspect ratios of default bounding boxes at each feature maps. This value is used by MultiboxCoder.
- forward(): A method which computes mb_locs and mb_confs. It must take a batched feature maps and return mb_locs and mb_confs.
steps (iterable of float) – The step size for each feature map. This value is used by MultiboxCoder.
sizes (iterable of float) – The base size of default bounding boxes for each feature map. This value is used by MultiboxCoder.
variance (tuple of floats) – Two coefficients for decoding the locations of bounding boxes. This value is used by MultiboxCoder. The default value is (0.1, 0.2).
nms_thresh (float) – The threshold value for non_maximum_suppression(). The default value is 0.45. This value can be changed directly or by using use_preset().
score_thresh (float) – The threshold value for confidence score. If a bounding box whose confidence score is lower than this value, the bounding box will be suppressed. The default value is 0.6. This value can be changed directly or by using use_preset().

forward(x)[source]¶

Compute localization and classification from a batch of images.

This method computes two variables, mb_locs and mb_confs. self.coder.decode() converts these variables to bounding box coordinates and confidence scores. These variables are also used in training SSD.

Parameters

x (chainer.Variable) – A variable holding a batch of images. The images are preprocessed by _prepare().

Returns

This method returns two variables, mb_locs and mb_confs.

mb_locs: A variable of float arrays of shape \((B, K, 4)\), where \(B\) is the number of samples in the batch and \(K\) is the number of default bounding boxes.
mb_confs: A variable of float arrays of shape \((B, K, n\_fg\_class + 1)\).

Return type

tuple of chainer.Variable

predict(imgs)[source]¶

Detect objects from images.

This method predicts objects for each image.

Parameters

imgs (iterable of numpy.ndarray) – Arrays holding images. All images are in CHW and RGB format and the range of their value is \([0, 255]\).

Returns

This method returns a tuple of three lists, (bboxes, labels, scores).

bboxes: A list of float arrays of shape \((R, 4)\), where \(R\) is the number of bounding boxes in a image. Each bounding box is organized by \((y_{min}, x_{min}, y_{max}, x_{max})\) in the second axis.
labels : A list of integer arrays of shape \((R,)\). Each value indicates the class of the bounding box. Values are in range \([0, L - 1]\), where \(L\) is the number of the foreground classes.
scores : A list of float arrays of shape \((R,)\). Each value indicates how confident the prediction is.

Return type

tuple of lists

to_cpu()[source]¶

Copies parameter variables and persistent values to CPU.

This method does not handle non-registered attributes. If some of such attributes must be copied to CPU, the link implementation should override device_resident_accept() to do so.

Returns: self

to_gpu(device=None)[source]¶

Copies parameter variables and persistent values to GPU.

This method does not handle non-registered attributes. If some of such attributes must be copied to GPU, the link implementation must override device_resident_accept() to do so.

Parameters: device – Target device specifier. If omitted, the current device is used.

Returns: self

use_preset(preset)[source]¶

Use the given preset during prediction.

This method changes values of nms_thresh and score_thresh. These values are a threshold value used for non maximum suppression and a threshold value to discard low confidence proposals in predict(), respectively.

If the attributes need to be changed to something other than the values provided in the presets, please modify them by directly accessing the public attributes.

Parameters: preset ({'visualize', 'evaluate'}) – A string to determine the preset to use.

VGG16¶

class chainercv.links.model.ssd.VGG16[source]¶

An extended VGG-16 model for SSD300 and SSD512.

This is an extended VGG-16 model proposed in 7. The differences from original VGG-16 8 are shown below.

conv5_1, conv5_2 and conv5_3 are changed from Convolution2d to DilatedConvolution2d.
Normalize is inserted after conv4_3.
The parameters of max pooling after conv5_3 are changed.
fc6 and fc7 are converted to conv6 and conv7.

7: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.
8: Karen Simonyan, Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015.

VGG16Extractor300¶

class chainercv.links.model.ssd.VGG16Extractor300[source]¶

A VGG-16 based feature extractor for SSD300.

This is a feature extractor for SSD300. This extractor is based on VGG16.

forward(x)[source]¶

Compute feature maps from a batch of images.

This method extracts feature maps from conv4_3, conv7, conv8_2, conv9_2, conv10_2, and conv11_2.

Parameters: x (ndarray) – An array holding a batch of images. The images should be resized to \(300\times 300\).
Returns: Each variable contains a feature map.
Return type: list of Variable

VGG16Extractor512¶

class chainercv.links.model.ssd.VGG16Extractor512[source]¶

A VGG-16 based feature extractor for SSD512.

This is a feature extractor for SSD512. This extractor is based on VGG16.

forward(x)[source]¶

Compute feature maps from a batch of images.

This method extracts feature maps from conv4_3, conv7, conv8_2, conv9_2, conv10_2, conv11_2, and conv12_2.

Parameters: x (ndarray) – An array holding a batch of images. The images should be resized to \(512\times 512\).
Returns: Each variable contains a feature map.
Return type: list of Variable

Train-only Utility¶

GradientScaling¶

class chainercv.links.model.ssd.GradientScaling(rate)[source]¶

Optimizer/UpdateRule hook function for scaling gradient.

This hook function scales gradient by a constant value.

Parameters: rate (float) – Coefficient for scaling.
Variables: rate (float) – Coefficient for scaling.

multibox_loss¶

chainercv.links.model.ssd.multibox_loss(mb_locs, mb_confs, gt_mb_locs, gt_mb_labels, k, comm=None)[source]¶

Computes multibox losses.

This is a loss function used in 9. This function returns loc_loss and conf_loss. loc_loss is a loss for localization and conf_loss is a loss for classification. The formulas of these losses can be found in the equation (2) and (3) in the original paper.

9: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters

mb_locs (chainer.Variable or array) – The offsets and scales for predicted bounding boxes. Its shape is \((B, K, 4)\), where \(B\) is the number of samples in the batch and \(K\) is the number of default bounding boxes.
mb_confs (chainer.Variable or array) – The classes of predicted bounding boxes. Its shape is \((B, K, n\_class)\). This function assumes the first class is background (negative).
gt_mb_locs (chainer.Variable or array) – The offsets and scales for ground truth bounding boxes. Its shape is \((B, K, 4)\).
gt_mb_labels (chainer.Variable or array) – The classes of ground truth bounding boxes. Its shape is \((B, K)\).
k (float) – A coefficient which is used for hard negative mining. This value determines the ratio between the number of positives and that of mined negatives. The value used in the original paper is 3.
comm (CommunicatorBase) – A ChainerMN communicator. If it is specified, the number of positive examples is computed among all GPUs.

Returns

This function returns two chainer.Variable: loc_loss and conf_loss.

Return type

tuple of chainer.Variable

random_crop_with_bbox_constraints¶

chainercv.links.model.ssd.random_crop_with_bbox_constraints(img, bbox, min_scale=0.3, max_scale=1, max_aspect_ratio=2, constraints=None, max_trial=50, return_param=False)[source]¶

Crop an image randomly with bounding box constraints.

This data augmentation is used in training of Single Shot Multibox Detector 10. More details can be found in data augmentation section of the original paper.

10: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters

img (ndarray) – An image array to be cropped. This is in CHW format.
bbox (ndarray) – Bounding boxes used for constraints. The shape is \((R, 4)\). \(R\) is the number of bounding boxes.
min_scale (float) – The minimum ratio between a cropped region and the original image. The default value is 0.3.
max_scale (float) – The maximum ratio between a cropped region and the original image. The default value is 1.
max_aspect_ratio (float) – The maximum aspect ratio of cropped region. The default value is 2.
constaraints (iterable of tuples) – An iterable of constraints. Each constraint should be (min_iou, max_iou) format. If you set min_iou or max_iou to None, it means not limited. If this argument is not specified, ((0.1, None), (0.3, None), (0.5, None), (0.7, None), (0.9, None), (None, 1)) will be used.
max_trial (int) – The maximum number of trials to be conducted for each constraint. If this function can not find any region that satisfies the constraint in \(max\_trial\) trials, this function skips the constraint. The default value is 50.
return_param (bool) – If True, this function returns information of intermediate values.

Returns

If return_param = False, returns an array img that is cropped from the input array.

If return_param = True, returns a tuple whose elements are img, param. param is a dictionary of intermediate parameters whose contents are listed below with key, value-type and the description of the value.

constraint (tuple): The chosen constraint.
y_slice (slice): A slice in vertical direction used to crop the input image.
x_slice (slice): A slice in horizontal direction used to crop the input image.

Return type

ndarray or (ndarray, dict)

random_distort¶

chainercv.links.model.ssd.random_distort(img, brightness_delta=32, contrast_low=0.5, contrast_high=1.5, saturation_low=0.5, saturation_high=1.5, hue_delta=18)[source]¶

A color related data augmentation used in SSD.

This function is a combination of four augmentation methods: brightness, contrast, saturation and hue.

brightness: Adding a random offset to the intensity of the image.
contrast: Multiplying the intensity of the image by a random scale.
saturation: Multiplying the saturation of the image by a random scale.
hue: Adding a random offset to the hue of the image randomly.

This data augmentation is used in training of Single Shot Multibox Detector 11.

Note that this function requires cv2.

11: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters

img (ndarray) – An image array to be augmented. This is in CHW and RGB format.
brightness_delta (float) – The offset for saturation will be drawn from \([-brightness\_delta, brightness\_delta]\). The default value is 32.
contrast_low (float) – The scale for contrast will be drawn from \([contrast\_low, contrast\_high]\). The default value is 0.5.
contrast_high (float) – See contrast_low. The default value is 1.5.
saturation_low (float) – The scale for saturation will be drawn from \([saturation\_low, saturation\_high]\). The default value is 0.5.
saturation_high (float) – See saturation_low. The default value is 1.5.
hue_delta (float) – The offset for hue will be drawn from \([-hue\_delta, hue\_delta]\). The default value is 18.

Returns

An image in CHW and RGB format.

resize_with_random_interpolation¶

chainercv.links.model.ssd.resize_with_random_interpolation(img, size, return_param=False)[source]¶

Resize an image with a randomly selected interpolation method.

This function is similar to chainercv.transforms.resize(), but this chooses the interpolation method randomly.

This data augmentation is used in training of Single Shot Multibox Detector 12.

Note that this function requires cv2.

12: Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters

img (ndarray) – An array to be transformed. This is in CHW format and the type should be numpy.float32.
size (tuple) – This is a tuple of length 2. Its elements are ordered as (height, width).
return_param (bool) – Returns information of interpolation.

Returns

If return_param = False, returns an array img that is the result of rotation.

If return_param = True, returns a tuple whose elements are img, param. param is a dictionary of intermediate parameters whose contents are listed below with key, value-type and the description of the value.

interpolatation: The chosen interpolation method.

Return type

ndarray or (ndarray, dict)