SSD (Single Shot Multibox Detector)¶

Detection Links¶

SSD300¶

class chainercv.links.model.ssd.SSD300(n_fg_class=None, pretrained_model=None)¶

Single Shot Multibox Detector with 300x300 inputs.

This is a model of Single Shot Multibox Detector [1]. This model uses VGG16Extractor300 as its feature extractor.

[1]	Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters:

n_fg_class (int) – The number of classes excluding the background.
pretrained_model (str) –
The weight file to be loaded. This can take 'voc0712', filepath or None. The default value is None.
- 'voc0712': Load weights trained on trainval split of PASCAL VOC 2007 and 2012. The weight file is downloaded and cached automatically. n_fg_class must be 20 or None. These weights were converted from the Caffe model provided by the original implementation. The conversion code is chainercv/examples/ssd/caffe2npz.py.
- 'imagenet': Load weights of VGG-16 trained on ImageNet. The weight file is downloaded and cached automatically. This option initializes weights partially and the rests are initialized randomly. In this case, n_fg_class can be set to any number.
- filepath: A path of npz file. In this case, n_fg_class must be specified properly.
- None: Do not load weights.

SSD512¶

class chainercv.links.model.ssd.SSD512(n_fg_class=None, pretrained_model=None)¶

Single Shot Multibox Detector with 512x512 inputs.

This is a model of Single Shot Multibox Detector [2]. This model uses VGG16Extractor512 as its feature extractor.

[2]	Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters:

n_fg_class (int) – The number of classes excluding the background.
pretrained_model (str) –
The weight file to be loaded. This can take 'voc0712', filepath or None. The default value is None.
- 'voc0712': Load weights trained on trainval split of PASCAL VOC 2007 and 2012. The weight file is downloaded and cached automatically. n_fg_class must be 20 or None. These weights were converted from the Caffe model provided by the original implementation. The conversion code is chainercv/examples/ssd/caffe2npz.py.
- 'imagenet': Load weights of VGG-16 trained on ImageNet. The weight file is downloaded and cached automatically. This option initializes weights partially and the rests are initialized randomly. In this case, n_fg_class can be set to any number.
- filepath: A path of npz file. In this case, n_fg_class must be specified properly.
- None: Do not load weights.

Utility¶

Multibox¶

class chainercv.links.model.ssd.Multibox(n_class, aspect_ratios, initialW=None, initial_bias=None)¶

Multibox head of Single Shot Multibox Detector.

This is a head part of Single Shot Multibox Detector [3]. This link computes mb_locs and mb_confs from feature maps. mb_locs contains information of the coordinates of bounding boxes and mb_confs contains confidence scores of each classes.

[3]	Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters:

n_class (int) – The number of classes possibly including the background.
aspect_ratios (iterable of tuple or int) – The aspect ratios of default bounding boxes for each feature map.
initialW – An initializer used in chainer.links.Convolution2d.__init__(). The default value is chainer.initializers.LeCunUniform.
initial_bias – An initializer used in chainer.links.Convolution2d.__init__(). The default value is chainer.initializers.Zero.

__call__(xs)¶

Compute loc and conf from feature maps

This method computes mb_locs and mb_confs from given feature maps.

Parameters:	xs (iterable of chainer.Variable) – An iterable of feature maps. The number of feature maps must be same as the number of `aspect_ratios`.
Returns:	This method returns two `chainer.Variable`: `mb_locs` and `mb_confs`. mb_locs: A variable of float arrays of shape \((B, K, 4)\), where \(B\) is the number of samples in the batch and \(K\) is the number of default bounding boxes. mb_confs: A variable of float arrays of shape \((B, K, n\_fg\_class + 1)\).
Return type:	tuple of chainer.Variable

MultiboxCoder¶

class chainercv.links.model.ssd.MultiboxCoder(grids, aspect_ratios, steps, sizes, variance)¶

A helper class to encode/decode bounding boxes.

This class encodes (bbox, label) to (mb_loc, mb_label) and decodes (mb_loc, mb_conf) to (bbox, label, score). These encoding/decoding are used in Single Shot Multibox Detector [4].

mb_loc: An array representing offsets and scales from the default bounding boxes. Its shape is \((K, 4)\), where \(K\) is the number of the default bounding boxes. The second axis is composed by \((\Delta y, \Delta x, \Delta h, \Delta w)\). These values are computed by the following formulas.
- \(\Delta y = (b_y - m_y) / (m_h * v_0)\)
- \(\Delta x = (b_x - m_x) / (m_w * v_0)\)
- \(\Delta h = log(b_h / m_h) / v_1\)
- \(\Delta w = log(b_w / m_w) / v_1\)
\((m_y, m_x)\) and \((m_h, m_w)\) are center coodinates and size of a default bounding box. \((b_y, b_x)\) and \((b_h, b_w)\) are center coodinates and size of a given bounding boxes that is assined to the default bounding box. \((v_0, v_1)\) are coefficients that can be set by argument variance.
mb_label: An array representing classes of ground truth bounding boxes. Its shape is \((K,)\).
mb_conf: An array representing classes of predicted bounding boxes. Its shape is \((K, n\_fg\_class + 1)\).

[4]	Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters:

grids (iterable of ints) – An iterable of integers. Each integer indicates the size of a feature map.
aspect_ratios (iterable of tuples of ints) – An iterable of tuples of integers used to compute the default bouding boxes. Each tuple indicates the aspect ratios of the default bounding boxes at each feature maps. The length of this iterable should be len(grids).
steps (iterable of floats) – The step size for each feature map. The length of this iterable should be len(grids).
sizes (iterable of floats) – The base size of default bounding boxes for each feature map. The length of this iterable should be len(grids) + 1.
variance (tuple of floats) – Two coefficients for encoding/decoding the locations of bounding boxes. The first value is used to encode/decode coordinates of the centers. The second value is used to encode/decode the sizes of bounding boxes.

decode(mb_loc, mb_conf, nms_thresh=0.45, score_thresh=0.6)¶

Decodes back to coordinates and classes of bounding boxes.

This method decodes mb_loc and mb_conf returned by a SSD network back to bbox, label and score.

Parameters:

mb_loc (array) – A float array whose shape is \((K, 4)\), \(K\) is the number of default bounding boxes.
mb_conf (array) – A float array whose shape is \((K, n\_fg\_class + 1)\).
nms_thresh (float) – The threshold value for non_maximum_suppression(). The default value is 0.45.
score_thresh (float) – The threshold value for confidence score. If a bounding box whose confidence score is lower than this value, the bounding box will be suppressed. The default value is 0.6.

Returns:

This method returns a tuple of three arrays, (bbox, label, score).

bbox: A float array of shape \((R, 4)\), where \(R\) is the number of bounding boxes in a image. Each bouding box is organized by \((y_{min}, x_{min}, y_{max}, x_{max})\) in the second axis.
label : An integer array of shape \((R,)\). Each value indicates the class of the bounding box.
score : A float array of shape \((R,)\). Each value indicates how confident the prediction is.

Return type:

tuple of three arrays

encode(bbox, label, iou_thresh=0.5)¶

Encodes coordinates and classes of bounding boxes.

This method encodes bbox and label to mb_loc and mb_label, which are used to compute multibox loss.

Parameters:

bbox (array) – A float array of shape \((R, 4)\), where \(R\) is the number of bounding boxes in an image. Each bouding box is organized by \((y_{min}, x_{min}, y_{max}, x_{max})\) in the second axis.
label (array) – An integer array of shape \((R,)\). Each value indicates the class of the bounding box.
iou_thresh (float) – The threshold value to determine a default bounding box is assigned to a ground truth or not. The default value is 0.5.

Returns:

This method returns a tuple of two arrays, (mb_loc, mb_label).

mb_loc: A float array of shape \((K, 4)\), where \(K\) is the number of default bounding boxes.
mb_label: An integer array of shape \((K,)\).

Return type:

tuple of two arrays

Normalize¶

class chainercv.links.model.ssd.Normalize(n_channel, initial=0, eps=1e-05)¶

Learnable L2 normalization [5].

This link normalizes input along the channel axis and scales it. The scale factors are trained channel-wise.

[5]	Wei Liu, Andrew Rabinovich, Alexander C. Berg. ParseNet: Looking Wider to See Better. ICLR 2016.

Parameters:	n_channel (int) – The number of channels. initial – A value to initialize the scale factors. It is pased to `chainer.initializers._get_initializer()`. The default value is 0. eps (float) – A small value to avoid zero-division. The default value is \(1e-5\).

__call__(x)¶

Normalize input and scale it.

Parameters:	x (chainer.Variable) – A variable holding 4-dimensional array. Its `dtype` is `numpy.float32`.
Returns:	The shape and `dtype` are same as those of input.
Return type:	chainer.Variable

SSD¶

class chainercv.links.model.ssd.SSD(extractor, multibox, steps, sizes, variance=(0.1, 0.2), mean=0)¶

Base class of Single Shot Multibox Detector.

This is a base class of Single Shot Multibox Detector [6].

[6]	Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters:

extractor –
A link which extracts feature maps. This link must have insize, grids and __call__().
- insize: An integer which indicates the size of input images. Images are resized to this size before feature extraction.
- grids: An iterable of integer. Each integer indicates the size of feature map. This value is used by MultiBboxCoder.
- __call_(): A method which computes feature maps. It must take a batched images and return batched feature maps.
multibox –
A link which computes mb_locs and mb_confs from feature maps. This link must have n_class, aspect_ratios and __call__().
- n_class: An integer which indicates the number of classes. This value should include the background class.
- aspect_ratios: An iterable of tuple of integer. Each tuple indicates the aspect ratios of default bounding boxes at each feature maps. This value is used by MultiboxCoder.
- __call__(): A method which computes mb_locs and mb_confs. It must take a batched feature maps and return mb_locs and mb_confs.
steps (iterable of float) – The step size for each feature map. This value is used by MultiboxCoder.
sizes (iterable of float) – The base size of default bounding boxes for each feature map. This value is used by MultiboxCoder.
variance (tuple of floats) – Two coefficients for decoding the locations of bounding boxe. This value is used by MultiboxCoder. The default value is (0.1, 0.2).
nms_thresh (float) – The threshold value for non_maximum_suppression(). The default value is 0.45. This value can be changed directly or by using use_preset().
score_thresh (float) – The threshold value for confidence score. If a bounding box whose confidence score is lower than this value, the bounding box will be suppressed. The default value is 0.6. This value can be changed directly or by using use_preset().

__call__(x)¶

Compute localization and classification from a batch of images.

This method computes two variables, mb_locs and mb_confs. self.coder.decode() converts these variables to bounding box coordinates and confidence scores. These variables are also used in training SSD.

Parameters:	x (chainer.Variable) – A variable holding a batch of images. The images are preprocessed by `_prepare()`.
Returns:	This method returns two variables, `mb_locs` and `mb_confs`. mb_locs: A variable of float arrays of shape \((B, K, 4)\), where \(B\) is the number of samples in the batch and \(K\) is the number of default bounding boxes. mb_confs: A variable of float arrays of shape \((B, K, n\_fg\_class + 1)\).
Return type:	tuple of chainer.Variable

predict(imgs)¶

Detect objects from images.

This method predicts objects for each image.

Parameters:	imgs (iterable of numpy.ndarray) – Arrays holding images. All images are in CHW and RGB format and the range of their value is \([0, 255]\).
Returns:	This method returns a tuple of three lists, `(bboxes, labels, scores)`. bboxes: A list of float arrays of shape \((R, 4)\), where \(R\) is the number of bounding boxes in a image. Each bouding box is organized by \((y_{min}, x_{min}, y_{max}, x_{max})\) in the second axis. labels : A list of integer arrays of shape \((R,)\). Each value indicates the class of the bounding box. Values are in range \([0, L - 1]\), where \(L\) is the number of the foreground classes. scores : A list of float arrays of shape \((R,)\). Each value indicates how confident the prediction is.
Return type:	tuple of lists

use_preset(preset)¶

Use the given preset during prediction.

This method changes values of nms_thresh and score_thresh. These values are a threshold value used for non maximum suppression and a threshold value to discard low confidence proposals in predict(), respectively.

If the attributes need to be changed to something other than the values provided in the presets, please modify them by directly accessing the public attributes.

Parameters:	preset ({'visualize', 'evaluate'}) – A string to determine the preset to use.

VGG16¶

class chainercv.links.model.ssd.VGG16¶

An extended VGG-16 model for SSD300 and SSD512.

This is an extended VGG-16 model proposed in [7]. The differences from original VGG-16 [8] are shown below.

conv5_1, conv5_2 and conv5_3 are changed from Convolution2d to DilatedConvolution2d.
Normalize is inserted after conv4_3.
The parameters of max pooling after conv5_3 are changed.
fc6 and fc7 are converted to conv6 and conv7.

[7]	Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

[8]	Karen Simonyan, Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. ICLR 2015.

VGG16Extractor300¶

class chainercv.links.model.ssd.VGG16Extractor300¶

A VGG-16 based feature extractor for SSD300.

This is a feature extractor for SSD300. This extractor is based on VGG16.

__call__(x)¶

Compute feature maps from a batch of images.

This method extracts feature maps from conv4_3, conv7, conv8_2, conv9_2, conv10_2, and conv11_2.

Parameters:	x (ndarray) – An array holding a batch of images. The images should be resized to \(300\times 300\).
Returns:	Each variable contains a feature map.
Return type:	list of Variable

VGG16Extractor512¶

class chainercv.links.model.ssd.VGG16Extractor512¶

A VGG-16 based feature extractor for SSD512.

This is a feature extractor for SSD512. This extractor is based on VGG16.

__call__(x)¶

Compute feature maps from a batch of images.

This method extracts feature maps from conv4_3, conv7, conv8_2, conv9_2, conv10_2, conv11_2, and conv12_2.

Parameters:	x (ndarray) – An array holding a batch of images. The images should be resized to \(512\times 512\).
Returns:	Each variable contains a feature map.
Return type:	list of Variable

Train-only Utility¶

GradientScaling¶

class chainercv.links.model.ssd.GradientScaling(rate)¶

Optimizer/UpdateRule hook function for scaling gradient.

This hook function scales gradient by a constant value.

Parameters:	rate (float) – Coefficient for scaling.
Variables:	rate (float) – Coefficient for scaling.

multibox_loss¶

chainercv.links.model.ssd.multibox_loss(mb_locs, mb_confs, gt_mb_locs, gt_mb_labels, k, comm=None)¶

Computes multibox losses.

This is a loss function used in [9]. This function returns loc_loss and conf_loss. loc_loss is a loss for localization and conf_loss is a loss for classification. The formulas of these losses can be found in the equation (2) and (3) in the original paper.

[9]	Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters:	mb_locs (chainer.Variable or array) – The offsets and scales for predicted bounding boxes. Its shape is \((B, K, 4)\), where \(B\) is the number of samples in the batch and \(K\) is the number of default bounding boxes. mb_confs (chainer.Variable or array) – The classes of predicted bounding boxes. Its shape is \((B, K, n\_class)\). This function assumes the first class is background (negative). gt_mb_locs (chainer.Variable or array) – The offsets and scales for ground truth bounding boxes. Its shape is \((B, K, 4)\). gt_mb_labels (chainer.Variable or array) – The classes of ground truth bounding boxes. Its shape is \((B, K)\). k (float) – A coefficient which is used for hard negative mining. This value determines the ratio between the number of positives and that of mined negatives. The value used in the original paper is `3`. comm (CommunicatorBase) – A ChainerMN communicator. If it is specified, the number of positive examples is computed among all GPUs.
Returns:	This function returns two `chainer.Variable`: `loc_loss` and `conf_loss`.
Return type:	tuple of chainer.Variable

random_crop_with_bbox_constraints¶

chainercv.links.model.ssd.random_crop_with_bbox_constraints(img, bbox, min_scale=0.3, max_scale=1, max_aspect_ratio=2, constraints=None, max_trial=50, return_param=False)¶

Crop an image randomly with bounding box constraints.

This data augmentation is used in training of Single Shot Multibox Detector [10]. More details can be found in data augmentation section of the original paper.

[10]	Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters:

img (ndarray) – An image array to be cropped. This is in CHW format.
bbox (ndarray) – Bounding boxes used for constraints. The shape is \((R, 4)\). \(R\) is the number of bounding boxes.
min_scale (float) – The minimum ratio between a cropped region and the original image. The default value is 0.3.
max_scale (float) – The maximum ratio between a cropped region and the original image. The default value is 1.
max_aspect_ratio (float) – The maximum aspect ratio of cropped region. The default value is 2.
constaraints (iterable of tuples) – An iterable of constraints. Each constraint should be (min_iou, max_iou) format. If you set min_iou or max_iou to None, it means not limited. If this argument is not specified, ((0.1, None), (0.3, None), (0.5, None), (0.7, None), (0.9, None), (None, 1)) will be used.
max_trial (int) – The maximum number of trials to be conducted for each constraint. If this function can not find any region that satisfies the constraint in \(max\_trial\) trials, this function skips the constraint. The default value is 50.
return_param (bool) – If True, this function returns information of intermediate values.

Returns:

If return_param = False, returns an array img that is cropped from the input array.

If return_param = True, returns a tuple whose elements are img, param. param is a dictionary of intermediate parameters whose contents are listed below with key, value-type and the description of the value.

constraint (tuple): The chosen constraint.
y_slice (slice): A slice in vertical direction used to crop the input image.
x_slice (slice): A slice in horizontal direction used to crop the input image.

Return type:

ndarray or (ndarray, dict)

random_distort¶

chainercv.links.model.ssd.random_distort(img, brightness_delta=32, contrast_low=0.5, contrast_high=1.5, saturation_low=0.5, saturation_high=1.5, hue_delta=18)¶

A color related data augmentation used in SSD.

This function is a combination of four augmentation methods: brightness, contrast, saturation and hue.

brightness: Adding a random offset to the intensity of the image.
contrast: Multiplying the intensity of the image by a random scale.
saturation: Multiplying the saturation of the image by a random scale.
hue: Adding a random offset to the hue of the image randomly.

This data augmentation is used in training of Single Shot Multibox Detector [11].

Note that this function requires cv2.

[11]	Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters:

img (ndarray) – An image array to be augmented. This is in CHW and RGB format.
brightness_delta (float) – The offset for saturation will be drawn from \([-brightness\_delta, brightness\_delta]\). The default value is 32.
contrast_low (float) – The scale for contrast will be drawn from \([contrast\_low, contrast\_high]\). The default value is 0.5.
contrast_high (float) – See contrast_low. The default value is 1.5.
saturation_low (float) – The scale for saturation will be drawn from \([saturation\_low, saturation\_high]\). The default value is 0.5.
saturation_high (float) – See saturation_low. The default value is 1.5.
hue_delta (float) – The offset for hue will be drawn from \([-hue\_delta, hue\_delta]\). The default value is 18.

Returns:

An image in CHW and RGB format.

resize_with_random_interpolation¶

chainercv.links.model.ssd.resize_with_random_interpolation(img, size, return_param=False)¶

Resize an image with a randomly selected interpolation method.

This function is similar to chainercv.transforms.resize(), but this chooses the interpolation method randomly.

This data augmentation is used in training of Single Shot Multibox Detector [12].

Note that this function requires cv2.

[12]	Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. SSD: Single Shot MultiBox Detector. ECCV 2016.

Parameters:

img (ndarray) – An array to be transformed. This is in CHW format and the type should be numpy.float32.
size (tuple) – This is a tuple of length 2. Its elements are ordered as (height, width).
return_param (bool) – Returns information of interpolation.

Returns:

If return_param = False, returns an array img that is the result of rotation.

If return_param = True, returns a tuple whose elements are img, param. param is a dictionary of intermediate parameters whose contents are listed below with key, value-type and the description of the value.

interpolatation: The chosen interpolation method.

Return type:

ndarray or (ndarray, dict)