Faster RCNN¶
Detection Link¶
FasterRCNNVGG16¶

class
chainercv.links.model.faster_rcnn.
FasterRCNNVGG16
(n_fg_class=None, pretrained_model=None, min_size=600, max_size=1000, ratios=[0.5, 1, 2], anchor_scales=[8, 16, 32], vgg_initialW=None, rpn_initialW=None, loc_initialW=None, score_initialW=None, proposal_creator_params={})¶ Faster RCNN based on VGG16.
When you specify the path of a pretrained chainer model serialized as a
npz
file in the constructor, this chain model automatically initializes all the parameters with it. When a string in prespecified set is provided, a pretrained model is loaded from weights distributed on the Internet. The list of pretrained models supported are as follows:voc07
: Loads weights trained with the trainval split of PASCAL VOC2007 Detection Dataset.imagenet
: Loads weights trained with ImageNet Classfication task for the feature extractor and the head modules. Weights that do not have a corresponding layer in VGG16 will be randomly initialized.
For descriptions on the interface of this model, please refer to
chainercv.links.model.faster_rcnn.FasterRCNN
.FasterRCNNVGG16
supports finer control on random initializations of weights by argumentsvgg_initialW
,rpn_initialW
,loc_initialW
andscore_initialW
. It accepts a callable that takes an array and edits its values. IfNone
is passed as an initializer, the default initializer is used.Parameters:  n_fg_class (int) – The number of classes excluding the background.
 pretrained_model (str) – The destination of the pretrained
chainer model serialized as a
npz
file. If this is one of the strings described above, it automatically loads weights stored under a directory$CHAINER_DATASET_ROOT/pfnet/chainercv/models/
, where$CHAINER_DATASET_ROOT
is set as$HOME/.chainer/dataset
unless you specify another value by modifying the environment variable.  min_size (int) – A preprocessing paramter for
prepare()
.  max_size (int) – A preprocessing paramter for
prepare()
.  ratios (list of floats) – This is ratios of width to height of the anchors.
 anchor_scales (list of numbers) – This is areas of anchors.
Those areas will be the product of the square of an element in
anchor_scales
and the original area of the reference window.  vgg_initialW (callable) – Initializer for the layers corresponding to the VGG16 layers.
 rpn_initialW (callable) – Initializer for Region Proposal Network layers.
 loc_initialW (callable) – Initializer for the localization head.
 score_initialW (callable) – Initializer for the score head.
 proposal_creator_params (dict) – Key valued paramters for
chainercv.links.model.faster_rcnn.ProposalCreator
.
Utility¶
bbox2loc¶

chainercv.links.model.faster_rcnn.
bbox2loc
(src_bbox, dst_bbox)¶ Encodes the source and the destination bounding boxes to “loc”.
Given bounding boxes, this function computes offsets and scales to match the source bounding boxes to the target bounding boxes. Mathematcially, given a bounding box whose center is \((y, x) = p_y, p_x\) and size \(p_h, p_w\) and the target bounding box whose center is \(g_y, g_x\) and size \(g_h, g_w\), the offsets and scales \(t_y, t_x, t_h, t_w\) can be computed by the following formulas.
 \(t_y = \frac{(g_y  p_y)} {p_h}\)
 \(t_x = \frac{(g_x  p_x)} {p_w}\)
 \(t_h = \log(\frac{g_h} {p_h})\)
 \(t_w = \log(\frac{g_w} {p_w})\)
The output is same type as the type of the inputs. The encoding formulas are used in works such as RCNN [1].
[1] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014. Parameters:  src_bbox (array) – An image coordinate array whose shape is \((R, 4)\). \(R\) is the number of bounding boxes. These coordinates are \(p_{ymin}, p_{xmin}, p_{ymax}, p_{xmax}\).
 dst_bbox (array) – An image coordinate array whose shape is \((R, 4)\). These coordinates are \(g_{ymin}, g_{xmin}, g_{ymax}, g_{xmax}\).
Returns: Bounding box offsets and scales from
src_bbox
todst_bbox
. This has shape \((R, 4)\). The second axis contains four values \(t_y, t_x, t_h, t_w\).Return type:
FasterRCNN¶

class
chainercv.links.model.faster_rcnn.
FasterRCNN
(extractor, rpn, head, mean, min_size=600, max_size=1000, loc_normalize_mean=(0.0, 0.0, 0.0, 0.0), loc_normalize_std=(0.1, 0.1, 0.2, 0.2))¶ Base class for Faster RCNN.
This is a base class for Faster RCNN links supporting object detection API [2]. The following three stages constitute Faster RCNN.
 Feature extraction: Images are taken and their feature maps are calculated.
 Region Proposal Networks: Given the feature maps calculated in the previous stage, produce set of RoIs around objects.
 Localization and Classification Heads: Using feature maps that belong to the proposed RoIs, classify the categories of the objects in the RoIs and improve localizations.
Each stage is carried out by one of the callable
chainer.Chain
objectsfeature
,rpn
andhead
.There are two functions
predict()
and__call__()
to conduct object detection.predict()
takes images and returns bounding boxes that are converted to image coordinates. This will be useful for a scenario when Faster RCNN is treated as a black box function, for instance.__call__()
is provided for a scnerario when intermediate outputs are needed, for instance, for training and debugging.Links that support obejct detection API have method
predict()
with the same interface. Please refer toFasterRCNN.predict()
for further details.[2] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks. NIPS 2015. Parameters:  extractor (callable Chain) – A callable that takes a BCHW image array and returns feature maps.
 rpn (callable Chain) – A callable that has the same interface as
chainercv.links.RegionProposalNetwork
. Please refer to the documentation found there.  head (callable Chain) – A callable that takes a BCHW array, RoIs and batch indices for RoIs. This returns class dependent localization paramters and class scores.
 mean (numpy.ndarray) – A value to be subtracted from an image
in
prepare()
.  min_size (int) – A preprocessing paramter for
prepare()
. Please refer to a docstring found forprepare()
.  max_size (int) – A preprocessing paramter for
prepare()
.  loc_normalize_mean (tuple of four floats) – Mean values of localization estimates.
 loc_normalize_std (tupler of four floats) – Standard deviation of localization estimates.

__call__
(x, scale=1.0)¶ Forward Faster RCNN.
Scaling paramter
scale
is used by RPN to determine the threshold to select small objects, which are going to be rejected irrespective of their confidence scores.Here are notations used.
 \(N\) is the number of batch size
 \(R'\) is the total number of RoIs produced across batches. Given \(R_i\) proposed RoIs from the \(i\) th image, \(R' = \sum _{i=1} ^ N R_i\).
 \(L\) is the number of classes excluding the background.
Classes are ordered by the background, the first class, ..., and the \(L\) th class.
Parameters:  x (Variable) – 4D image variable.
 scale (float) – Amount of scaling applied to the raw image during preprocessing.
Returns: Returns tuple of four values listed below.
 roi_cls_locs: Offsets and scalings for the proposed RoIs. Its shape is \((R', (L + 1) \times 4)\).
 roi_scores: Class predictions for the proposed RoIs. Its shape is \((R', L + 1)\).
 rois: RoIs proposed by RPN. Its shape is \((R', 4)\).
 roi_indices: Batch indices of RoIs. Its shape is \((R',)\).
Return type:

predict
(imgs)¶ Detect objects from images.
This method predicts objects for each image.
Parameters: imgs (iterable of numpy.ndarray) – Arrays holding images. All images are in CHW and RGB format and the range of their value is \([0, 255]\). Returns: This method returns a tuple of three lists, (bboxes, labels, scores)
. bboxes: A list of float arrays of shape \((R, 4)\), where \(R\) is the number of bounding boxes in a image. Each bouding box is organized by
(y_min, x_min, y_max, x_max)
in the second axis.  labels : A list of integer arrays of shape \((R,)\). Each value indicates the class of the bounding box. Values are in range \([0, L  1]\), where \(L\) is the number of the foreground classes.
 scores : A list of float arrays of shape \((R,)\). Each value indicates how confident the prediction is.
Return type: tuple of lists  bboxes: A list of float arrays of shape \((R, 4)\), where \(R\) is the number of bounding boxes in a image. Each bouding box is organized by

prepare
(img)¶ Preprocess an image for feature extraction.
The length of the shorter edge is scaled to
self.min_size
. After the scaling, if the length of the longer edge is longer thanself.max_size
, the image is scaled to fit the longer edge toself.max_size
.After resizing the image, the image is subtracted by a mean image value
self.mean
.Parameters: img (ndarray) – An image. This is in CHW and RGB format. The range of its value is \([0, 255]\). Returns: A preprocessed image. Return type: ndarray

use_preset
(preset)¶ Use the given preset during prediction.
This method changes values of
self.nms_thresh
andself.score_thresh
. These values are a threshold value used for non maximum suppression and a threshold value to discard low confidence proposals inpredict()
, respectively.If the attributes need to be changed to something other than the values provided in the presets, please modify them by directly accessing the public attributes.
Parameters: preset ({'visualize', 'evaluate') – A string to determine the preset to use.
generate_anchor_base¶

chainercv.links.model.faster_rcnn.
generate_anchor_base
(base_size=16, ratios=[0.5, 1, 2], anchor_scales=[8, 16, 32])¶ Generate anchor base windows by enumerating aspect ratio and scales.
Generate anchors that are scaled and modified to the given aspect ratios. Area of a scaled anchor is preserved when modifying to the given aspect ratio.
R = len(ratios) * len(anchor_scales)
anchors are generated by this function. Thei * len(anchor_scales) + j
th anchor corresponds to an anchor generated byratios[i]
andanchor_scales[j]
.For example, if the scale is \(8\) and the ratio is \(0.25\), the width and the height of the base window will be stretched by \(8\). For modifying the anchor to the given aspect ratio, the height is halved and the width is doubled.
Parameters:  base_size (number) – The width and the height of the reference window.
 ratios (list of floats) – This is ratios of width to height of the anchors.
 anchor_scales (list of numbers) – This is areas of anchors.
Those areas will be the product of the square of an element in
anchor_scales
and the original area of the reference window.
Returns: An array of shape \((R, 4)\). Each element is a set of coordinates of a bounding box. The second axis corresponds to
y_min, x_min, y_max, x_max
of a bounding box.Return type:
loc2bbox¶

chainercv.links.model.faster_rcnn.
loc2bbox
(src_bbox, loc)¶ Decode bounding boxes from bounding box offsets and scales.
Given bounding box offsets and scales computed by
bbox2loc()
, this function decodes the representation to coordinates in 2D image coordinates.Given scales and offsets \(t_y, t_x, t_h, t_w\) and a bounding box whose center is \((y, x) = p_y, p_x\) and size \(p_h, p_w\), the decoded bounding box’s center \(\hat{g}_y\), \(\hat{g}_x\) and size \(\hat{g}_h\), \(\hat{g}_w\) are calculated by the following formulas.
 \(\hat{g}_y = p_h t_y + p_y\)
 \(\hat{g}_x = p_w t_x + p_x\)
 \(\hat{g}_h = p_h \exp(t_h)\)
 \(\hat{g}_w = p_w \exp(t_w)\)
The decoding formulas are used in works such as RCNN [3].
The output is same type as the type of the inputs.
[3] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014. Parameters: Returns: Decoded bounding box coordinates. Its shape is \((R, 4)\). The second axis contains four values \(\hat{g}_{ymin}, \hat{g}_{xmin}, \hat{g}_{ymax}, \hat{g}_{xmax}\).
Return type:
ProposalCreator¶

class
chainercv.links.model.faster_rcnn.
ProposalCreator
(nms_thresh=0.7, n_train_pre_nms=12000, n_train_post_nms=2000, n_test_pre_nms=6000, n_test_post_nms=300, force_cpu_nms=False, min_size=16)¶ Proposal regions are generated by calling this object.
The
__call__()
of this object outputs object detection proposals by applying estimated bounding box offsets to a set of anchors.This class takes parameters to control number of bounding boxes to pass to NMS and keep after NMS. If the paramters are negative, it uses all the bounding boxes supplied or keep all the bounding boxes returned by NMS.
This class is used for Region Proposal Networks introduced in Faster RCNN [4].
[4] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks. NIPS 2015. Parameters:  nms_thresh (float) – Threshold value used when calling NMS.
 n_train_pre_nms (int) – Number of top scored bounding boxes to keep before passing to NMS in train mode.
 n_train_post_nms (int) – Number of top scored bounding boxes to keep after passing to NMS in train mode.
 n_test_pre_nms (int) – Number of top scored bounding boxes to keep before passing to NMS in test mode.
 n_test_post_nms (int) – Number of top scored bounding boxes to keep after passing to NMS in test mode.
 force_cpu_nms (bool) – If this is
True
, always use NMS in CPU mode. IfFalse
, the NMS mode is selected based on the type of inputs.  min_size (int) – A paramter to determine the threshold on discarding bounding boxes based on their sizes.

__call__
(loc, score, anchor, img_size, scale=1.0)¶ Propose RoIs.
Inputs
loc, score, anchor
refer to the same anchor when indexed by the same index.On notations, \(R\) is the total number of anchors. This is equal to product of the height and the width of an image and the number of anchor bases per pixel.
Type of the output is same as the inputs.
Parameters:  loc (array) – Predicted offsets and scaling to anchors. Its shape is \((R, 4)\).
 score (array) – Predicted foreground probability for anchors. Its shape is \((R,)\).
 anchor (array) – Coordinates of anchors. Its shape is \((R, 4)\).
 img_size (tuple of ints) – A tuple
height, width
, which contains image size after scaling.  scale (float) – The scaling factor used to scale an image after reading it from a file.
Returns: An array of coordinates of proposal boxes. Its shape is \((S, 4)\). \(S\) is less than
self.n_test_post_nms
in test time and less thanself.n_train_post_nms
in train time. \(S\) depends on the size of the predicted bounding boxes and the number of bounding boxes discarded by NMS.Return type:
RegionProposalNetwork¶

class
chainercv.links.model.faster_rcnn.
RegionProposalNetwork
(in_channels=512, mid_channels=512, ratios=[0.5, 1, 2], anchor_scales=[8, 16, 32], feat_stride=16, initialW=None, proposal_creator_params={})¶ Region Proposal Network introduced in Faster RCNN.
This is Region Proposal Network introduced in Faster RCNN [5]. This takes features extracted from images and propose class agnostic bounding boxes around “objects”.
[5] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks. NIPS 2015. Parameters:  in_channels (int) – The channel size of input.
 mid_channels (int) – The channel size of the intermediate tensor.
 ratios (list of floats) – This is ratios of width to height of the anchors.
 anchor_scales (list of numbers) – This is areas of anchors.
Those areas will be the product of the square of an element in
anchor_scales
and the original area of the reference window.  feat_stride (int) – Stride size after extracting features from an image.
 initialW (callable) – Initial weight value. If
None
then this function uses Gaussian distribution scaled by 0.1 to initialize weight. May also be a callable that takes an array and edits its values.  proposal_creator_params (dict) – Key valued paramters for
chainercv.links.model.faster_rcnn.ProposalCreator
.

__call__
(x, img_size, scale=1.0)¶ Forward Region Proposal Network.
Here are notations.
 \(N\) is batch size.
 \(C\) channel size of the input.
 \(H\) and \(W\) are height and witdh of the input feature.
 \(A\) is number of anchors assigned to each pixel.
Parameters:  x (Variable) – The Features extracted from images. Its shape is \((N, C, H, W)\).
 img_size (tuple of ints) – A tuple
height, width
, which contains image size after scaling.  scale (float) – The amount of scaling done to the input images after reading them from files.
Returns: This is a tuple of five following values.
 rpn_locs: Predicted bounding box offsets and scales for anchors. Its shape is \((N, H W A, 4)\).
 rpn_scores: Predicted foreground scores for anchors. Its shape is \((N, H W A, 2)\).
 rois: A bounding box array containing coordinates of proposal boxes. This is a concatenation of bounding box arrays from multiple images in the batch. Its shape is \((R', 4)\). Given \(R_i\) predicted bounding boxes from the \(i\) th image, \(R' = \sum _{i=1} ^ N R_i\).
 roi_indices: An array containing indices of images to which RoIs correspond to. Its shape is \((R',)\).
 anchor: Coordinates of enumerated shifted anchors. Its shape is \((H W A, 4)\).
Return type:
VGG16RoIHead¶

class
chainercv.links.model.faster_rcnn.
VGG16RoIHead
(n_class, roi_size, spatial_scale, vgg_initialW=None, loc_initialW=None, score_initialW=None)¶ Faster RCNN Head for VGG16 based implementation.
This class is used as a head for Faster RCNN. This outputs classwise localizations and classification based on feature maps in the given RoIs.
Parameters:  n_class (int) – The number of classes possibly including the background.
 roi_size (int) – Height and width of the feature maps after RoIpooling.
 spatial_scale (float) – Scale of the roi is resized.
 vgg_initialW (callable) – Initializer for the layers corresponding to the VGG16 layers.
 loc_initialW (callable) – Initializer for the localization head.
 score_initialW (callable) – Initializer for the score head.
Trainonly Utility¶
AnchorTargetCreator¶

class
chainercv.links.model.faster_rcnn.
AnchorTargetCreator
(n_sample=256, pos_iou_thresh=0.7, neg_iou_thresh=0.3, pos_ratio=0.5)¶ Assign the ground truth bounding boxes to anchors.
Assigns the ground truth bounding boxes to anchors for training Region Proposal Networks introduced in Faster RCNN [6].
Offsets and scales to match anchors to the ground truth are calculated using the encoding scheme of
chainercv.links.model.faster_rcnn.bbox2loc
.[6] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks. NIPS 2015. Parameters:  n_sample (int) – The number of regions to produce.
 pos_iou_thresh (float) – Anchors with IoU above this threshold will be assigned as positive.
 neg_iou_thresh (float) – Anchors with IoU below this threshold will be assigned as negative.
 pos_ratio (float) – Ratio of positive regions in the sampled regions.

__call__
(bbox, anchor, img_size)¶ Assign ground truth supervision to sampled subset of anchors.
Types of input arrays and output arrays are same.
Here are notations.
 \(S\) is the number of anchors.
 \(R\) is the number of bounding boxes.
Parameters: Returns:  loc: Offsets and scales to match the anchors to the ground truth bounding boxes. Its shape is \((S, 4)\).
 label: Labels of anchors with values
(1=positive, 0=negative, 1=ignore)
. Its shape is \((S,)\).
Return type:
FasterRCNNTrainChain¶

class
chainercv.links.model.faster_rcnn.
FasterRCNNTrainChain
(faster_rcnn, rpn_sigma=3.0, roi_sigma=1.0, anchor_target_creator=<chainercv.links.model.faster_rcnn.utils.anchor_target_creator.AnchorTargetCreator object>, proposal_target_creator=<chainercv.links.model.faster_rcnn.utils.proposal_target_creator.ProposalTargetCreator object>)¶ Calculate losses for Faster RCNN and report them.
This is used to train Faster RCNN in the joint training scheme [7].
The losses include:
rpn_loc_loss
: The localization loss for Region Proposal Network (RPN).rpn_cls_loss
: The classification loss for RPN.roi_loc_loss
: The localization loss for the head module.roi_cls_loss
: The classification loss for the head module.
[7] (1, 2, 3) Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks. NIPS 2015. Parameters:  faster_rcnn (FasterRCNN) – A Faster RCNN model that is going to be trained.
 rpn_sigma (float) – Sigma parameter for the localization loss of Region Proposal Network (RPN). The default value is 3, which is the value used in [7].
 roi_sigma (float) – Sigma paramter for the localization loss of the head. The default value is 1, which is the value used in [7].
 anchor_target_creator – An instantiation of
chainercv.links.model.faster_rcnn.AnchorTargetCreator
.  proposal_target_creator_params – An instantiation of
chainercv.links.model.faster_rcnn.ProposalTargetCreator
.

__call__
(imgs, bboxes, labels, scale)¶ Forward Faster RCNN and calculate losses.
Here are notations used.
 \(N\) is the batch size.
 \(R\) is the number of bounding boxes per image.
Currently, only \(N=1\) is supported.
Parameters:  imgs (Variable) – A variable with a batch of images.
 bboxes (Variable) – A batch of bounding boxes. Its shape is \((N, R, 4)\).
 labels (Variable) – A batch of labels. Its shape is \((N, R)\). The background is excluded from the definition, which means that the range of the value is \([0, L  1]\). \(L\) is the number of foreground classes.
 scale (float or Variable) – Amount of scaling applied to the raw image during preprocessing.
Returns: Scalar loss variable. This is the sum of losses for Region Proposal Network and the head module.
Return type: chainer.Variable
ProposalTargetCreator¶

class
chainercv.links.model.faster_rcnn.
ProposalTargetCreator
(n_sample=128, pos_ratio=0.25, pos_iou_thresh=0.5, neg_iou_thresh_hi=0.5, neg_iou_thresh_lo=0.0)¶ Assign ground truth bounding boxes to given RoIs.
The
__call__()
of this class generates training targets for each object proposal. This is used to train Faster RCNN [8].[8] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster RCNN: Towards RealTime Object Detection with Region Proposal Networks. NIPS 2015. Parameters:  n_sample (int) – The number of sampled regions.
 pos_ratio (float) – Fraction of regions that is labeled as a foreground.
 pos_iou_thresh (float) – IoU threshold for a RoI to be considered as a foreground.
 neg_iou_thresh_hi (float) – RoI is considered to be the background
if IoU is in
[
neg_iou_thresh_hi
,neg_iou_thresh_hi
).  neg_iou_thresh_lo (float) – See above.

__call__
(roi, bbox, label, loc_normalize_mean=(0.0, 0.0, 0.0, 0.0), loc_normalize_std=(0.1, 0.1, 0.2, 0.2))¶ Assigns ground truth to sampled proposals.
This function samples total of
self.n_sample
RoIs from the combination ofroi
andbbox
. The RoIs are assigned with the ground truth class labels as well as bounding box offsets and scales to match the ground truth bounding boxes. As many aspos_ratio * self.n_sample
RoIs are sampled as foregrounds.Offsets and scales of bounding boxes are calculated using
chainercv.links.model.faster_rcnn.bbox2loc()
. Also, types of input arrays and output arrays are same.Here are notations.
 \(S\) is the total number of sampled RoIs, which equals
self.n_sample
.  \(L\) is number of object classes possibly including the background.
Parameters:  roi (array) – Region of Interests (RoIs) from which we sample. Its shape is \((R, 4)\)
 bbox (array) – The coordinates of ground truth bounding boxes. Its shape is \((R', 4)\).
 label (array) – Ground truth bounding box labels. Its shape is \((R',)\). Its range is \([0, L  1]\), where \(L\) is the number of foreground classes.
 loc_normalize_mean (tuple of four floats) – Mean values to normalize coordinates of bouding boxes.
 loc_normalize_std (tupler of four floats) – Standard deviation of the coordinates of bounding boxes.
Returns:  sample_roi: Regions of interests that are sampled. Its shape is \((S, 4)\).
 gt_roi_loc: Offsets and scales to match the sampled RoIs to the ground truth bounding boxes. Its shape is \((S, 4)\).
 gt_roi_label: Labels assigned to sampled RoIs. Its shape is \((S,)\). Its range is \([0, L]\). The label with value 0 is the background.
Return type:  \(S\) is the total number of sampled RoIs, which equals