Sparse R-CNN: End-to-End Object Detection with Learnable Proposals
We present Sparse R-CNN, a purely sparse method for object detection in images. Existing works on object detection heavily rely on dense object candidates, such as anchor boxes pre-defined on all grids of image feature map of size . In our method, however, a fixed sparse set of learned object proposals, total length of , are provided to object recognition head to perform classification and location. By eliminating (up to hundreds of thousands) hand-designed object candidates to (/eg100) learnable proposals, Sparse R-CNN completely avoids all efforts related to object candidates design and many-to-one label assignment. More importantly, final predictions are directly output without non-maximum suppression post-procedure. Sparse R-CNN demonstrates accuracy, run-time and training convergence performance on par with the well-established detector baselines on the challenging COCO dataset, /eg, achieving 44.5 AP in standard training schedule and running at 22 fps using ResNet-50 FPN model. We hope our work could inspire re-thinking the convention of dense prior in object detectors.
The code is available at: https://github.com/PeizeSun/SparseR-CNN.
Object detection aims at localizing a set of objects and recognizing their categories in an image.
Dense prior has always been cornerstone to success in detectors.
In classic computer vision, the sliding-window paradigm, in which a classifier is applied on a dense image grid, is leading detection method for decades [6, 9, 38].
Modern mainstream one-stage detectors pre-define marks on a dense feature map grid, such as anchors boxes [23, 29], shown in Figure LABEL:fig:1a, or reference points [34, 44], and predict the relative scaling and offsets to bounding boxes of objects, as well as the corresponding categories. Although two-stage pipelines work on a sparse set of proposal boxes, their proposal generation algorithms are still built on dense candidates [10, 30], shown in Figure LABEL:fig:1b.
These well-established methods are conceptually intuitive and offer robust performance [8, 24], together with fast training and inference time . Besides their great success, it is important to note that dense-prior detectors suffer some limitations:
1) Such pipelines usually produce redundant and near-duplicate results, thus making non-maximum suppression (NMS) [1, 39] post-processing a necessary component. 2) The many-to-one label assignment problem [2, 42, 43] in training makes the network sensitive to heuristic assign rules.
3) The final performance is largely affected by sizes, aspect ratios and number of anchor boxes [23, 29], density of reference points [19, 34, 44] and proposal generation algorithm [10, 30].
Despite the dense convention is widely recognized among object detectors, a natural question to ask is: Is it possible to design a sparse detector? Recently, DETR proposes to reformulate object detection as a direct and sparse set prediction problem , whose input is merely 100 learned object queries .
The final set of predictions are output directly without any hand-designed post-processing. In spite of its simple and fantastic framework, DETR requires each object query to interact with global image context. This dense property not only slows down its training convergence , but also blocks it establishing a thoroughly sparse pipeline for object detection.
We believe the sparse property should be in two aspects: sparse boxes and sparse features.
Sparse boxes mean that a small number of starting boxes (/eg100) is enough to predict all objects in an image.
While sparse features indicate the feature of each box does not need to interactively interact with all other features over the full image.
From this perspective,
DETR is not a pure spare method since each object query must interact with dense features over full images.
In this paper, we propose Sparse R-CNN, a purely sparse method, without object positional candidates enumerating on all(dense) image grids nor object queries interacting with global(dense) image feature.
As shown in Figure LABEL:fig:1d, object candidates are given with a fixed small set of learnable bounding boxes represented by 4-d coordinate. For example of COCO dataset , 100 boxes and 400 parameters are needed in total, rather than the predicted ones from hundreds of thousands of candidates in Region Proposal Network (RPN) .
These sparse candidates are used as proposal boxes to extract the feature of Region of Interest (RoI) by RoIPool  or RoIAlign .
The learnable proposal boxes are the statistics of potential object location in the image. Whereas, the 4-d coordinate is merely a rough representation of object and lacks a lot of informative details such as pose and shape.
Here we introduce another concept termed proposal feature, which is a high-dimension (/eg, 256) latent vector. Compared with rough bounding box, it is expected to encode the rich instance characteristics. Specially, proposal feature generates a series of customized parameters for its exclusive object recognition head. We call this operation Dynamic Instance Interactive Head, since it shares similarities with recent dynamic scheme [18, 35].
Compared to the shared 2-fc layers in , our head is more flexible and holds a significant lead in accuracy. We show in our experiment that the formulation of head conditioned on unique proposal feature instead of the fixed parameters is actually the key to Sparse R-CNN’s success.
Both proposal boxes and proposal features are randomly initialized and optimized together with other parameters in the whole network.
The most remarkable property in our Sparse R-CNN is its sparse-in sparse-out paradigm in the whole time. The initial input is a sparse set of proposal boxes and proposal features, together with the one-to-one dynamic instance interaction. Neither dense candidates [23, 30] nor interacting with global(dense) feature  exists in the pipeline.
This pure sparsity makes Sparse R-CNN a brand new member in R-CNN family.
Sparse R-CNN demonstrates its accuracy, run-time and training convergence performance on par with the well-established detectors [2, 30, 34] on the challenging COCO dataset , /eg, achieving 44.5 AP in standard training schedule and running at 22 fps using ResNet-50 FPN model.
To our best knowledge, the proposed Sparse R-CNN is the first work that demonstrates a considerably sparse design is qualified yet. We hope our work could inspire re-thinking the necessary of dense prior in object detection and exploring next generation of object detector.
2 Related Work
Dense method. Sliding-window paradigm has been popular for many years in object detection.
Limited by classical feature extraction techniques [6, 38], the performance has plateaued for decades and the application scenarios are limited. Development of deep convolution neural networks (CNNs) [14, 17, 20] cultivates general object detection achieving significant improvement in performance [8, 24].
One of mainstream pipelines is one-stage detector, which directly predicts the category and location of anchor boxes densely covering spatial positions, scales, and aspect ratios in a single-shot way, such as OverFeat , YOLO , SSD  and RetinaNet . Recently, anchor-free algorithms [16, 21, 34, 44] are proposed to make this pipeline much simpler by replacing hand-crafted anchor boxes with reference points. All of above methods are built on dense candidates and each candidate is directly classified and regressed. These candidates are assigned to ground-truth object boxes in training time based on a pre-defined principle, /eg, whether the anchor has a higher intersection-over-union (IoU) threshold with its corresponding ground truth, or whether the reference point falls in one of object boxes. Moreover, NMS post-processing [1, 39] is needed to remove redundant predictions during inference time.
Dense-to-sparse method. Two-stage detector is another mainstream pipeline and has dominated modern object detection for years [2, 4, 11, 10, 13, 30].
This paradigm can be viewed as an extension of dense detector. It firstly obtains a sparse set of foreground proposal boxes from dense region candidates, and then refines location of each proposal and predicts its specific category. The region proposal algorithm plays an important role in the first stage in these two-stage methods, such as Selective Search  in R-CNN and Region Proposal Networks (RPN)  in Faster R-CNN. Similar to dense pipeline, it also needs NMS post-processing and hand-crafted label assignment. There are only a few of foreground proposals from hundreds of thousands of candidates, thus these detectors can be concluded as dense-to-sparse methods.
Recently, DETR  is proposed to directly output the predictions without any hand-crafted components, achieving very
competitive performance. DETR utilizes a sparse set of object queries, to interact with global(dense) image feature, in this view, it can be seen as another dense-to-sparse formulation.
Sparse method. Sparse object detection has the potential to eliminate efforts to design dense candidates, but has trailed the accuracy of above detectors. G-CNN  can be viewed as a precursor to this group of algorithms. It starts with a multi-scale regular grid over the image and iteratively updates the boxes to cover and classify objects. This hand-designed regular prior is obviously sub-optimal and fails to achieve top performance. Instead, our Sparse R-CNN applies learnable proposals and achieves better performance. Concurrently, Deformable-DETR  is introduced to restrict each object query to attend to a small set of key sampling points around the reference points, instead of all points in feature map.
We hope sparse methods could serve as solid baseline and help ease future research in object detection community.
3 Sparse R-CNN
The central idea of Sparse R-CNN framework is to replace hundreds of thousands of candidates from Region Proposal Network (RPN) with a small set of proposal boxes (/eg, 100).
In this section, we first briefly introduce the overall architecture of the proposed method. Then we describe each components in details.
Sparse R-CNN is a simple, unified network composed of a backbone network, a dynamic instance interactive head and two task-specific prediction layers (Figure 2).
There are three inputs in total, an image, a set of proposal boxes and proposal features. The latter two are learnable and can be optimized together with other parameters in network.
Feature Pyramid Network (FPN) based on ResNet architecture [14, 22] is adopted as the backbone network to produce multi-scale feature maps from input image.
Following , we construct the pyramid with levels through , where indicates pyramid level and has resolution lower than the input. All pyramid levels have channels. Please refer to  for more details. Actually, Spare R-CNN has the potential to benefit from more complex designs to further improve its performance, such as stacked encoder layers  and deformable convolution network , on which a recent work Deformable-DETR  is built. However, we align the setting with Faster R-CNN  to show the simplicity and effectiveness of our method.
Learnable proposal box.
A fixed small set of learnable proposal boxes () are used as region proposals, instead of the predictions from Region Proposal Network (RPN).
These proposal boxes are represented by 4-d parameters ranging from 0 to 1, denoting normalized center coordinates, height and width.
The parameters of proposals boxes will be updated with the back-propagation algorithm during training.
Thanks to the learnable property, we find in our experiment that the effect of initialization is minimal, thus making the framework much more flexible.
Conceptually, these learned proposal boxes are the statistics of potential object location in the training set and can be seen as an initial guess of the regions that are most likely to encompass the objects in the image, regardless of the input.
Whereas, the proposals from RPN are strongly correlated to the current image and provide coarse object locations. We rethink that the first-stage locating is luxurious in the presence of later stages to refine the location of boxes. Instead, a reasonable statistic can already be qualified candidates.
In this view, Sparse R-CNN can be categorized as the extension of object detector paradigm from thoroughly dense [23, 25, 28, 34] to dense-to-sparse [2, 4, 10, 30] to thoroughly sparse, shown in Figure LABEL:fig:1.
Learnable proposal feature.
Though the 4-d proposal box is a brief and explicit expression to describe objects, it provides a coarse localization of objects and a lot of informative details are lost, such as object pose and shape.
Here we introduce another concept termed proposal feature (), it is a high-dimension (/eg, 256) latent vector and is expected to encode the rich instance characteristics. The number of proposal features is same as boxes, and we will discuss how to use it next.
Dynamic instance interactive head.
Given proposal boxes,
Sparse R-CNN first utilizes the RoIAlign operation to extract features for each box. Then each box feature will be used to generate the final predictions using our prediction head.
Figure 3 illustrates the prediction head,
termed as Dynamic Instance Interactive Module, motivated by dynamic algorithms [18, 35].
Each RoI feature is fed into its own exclusive head for object location and classification,
where each head is conditioned on specific proposal feature.
In our design, proposal feature and proposal box are in one-to-one correspondence.
For proposal boxs, proposal features are employed.
Each RoI feature () will interact with the corresponding proposal feature () to filter out ineffective bins and outputs the final object feature ().
The final regression prediction is computed
by a 3-layer perception with ReLU activation function and hidden dimension
, and classification prediction is by a linear projection layer.
For light design, we carry out consecutive convolution with ReLU activation function, to implement the interaction process.
Each proposal feature will be convolved with the RoI feature to get a more discriminate feature. For more details, please refer to our code.
We note that implementation detail of interactive head is not crucial as long as parallel operation is supported for efficiency.
Our proposal feature can be seen as an implementation of attention mechanism, for attending to which bins in a RoI of size .
The proposal feature generates kernel parameters of convolution, then RoI feature is processed by the generated convolution to obtain the final feature.
In this way, those bins with most foreground information make effect on final object location and classification.
We also adopt the iteration structure to further improve the performance.
The newly generated object boxes and object features will serve as the proposal boxes and proposal features of the next stage in iterative process.
Thanks to the sparse property and light dynamic head, it introduces only a marginal computation overhead.
Self-attention module  is embedded into dynamic head to reason about the relations between objects. We note that Relation Network  also utilizes attention module. However, it demands geometry attributes and complex rank feature in addition to object feature.
Our module is much more simple and only takes object feature as input.
Object query proposed in DETR  shares a similar design as proposal feature.
However, object query is learned positional encoding.
Feature map is required to add spatial positional encoding when interacting with object query, otherwise leads to a significant drop.
Our proposal feature is irrelevant to position and we demonstrate that our framework can work well without positional encoding. We provide further comparisons in the experimental section.
|Faster R-CNN-R50 ||FPN||36||40.2||61.0||43.8||24.2||43.5||52.0||26|
|Faster R-CNN-R101 ||FPN||36||42.0||62.5||45.9||25.2||45.6||54.6||20|
|Cascade R-CNN-R50 ||FPN||36||44.3||62.2||48.0||26.6||47.7||57.7||19|
|Deformable DETR-R50 ||DeformEncoder||50||43.8||62.6||47.7||26.4||47.1||58.0||19|
Here “” indicates that the model is with 300 learnable proposal boxes and random crop training augmentation, similar to Deformable DETR .
Run time is evaluated on NVIDIA Tesla V100 GPU.
Set prediction loss.
Sparse R-CNN applies set prediction loss [3, 33, 41] on the fixed-size set of predictions of classification and box coordinates. Set-based loss produces an optimal bipartite matching between predictions and ground truth objects.
The matching cost is defined as follows:
is focal loss  of predicted classifications and ground truth category labels, and are L1 loss and generalized IoU loss  between normalized center coordinates and height and width of predicted boxes and ground truth box, respectively.
, and are coefficients of each component.
The training loss is the same as the matching cost except that only performed on matched pairs. The final loss is the sum of all pairs normalized by the number of objects inside the training batch.
R-CNN families [2, 43] have always been puzzled by label assignment problem since many-to-one matching remains. Here we provide new possibilities that directly bypassing many-to-one matching and introducing one-to-one matching with set-based loss. This is an attempt towards exploring end-to-end object detection.
Our experiments are conducted on the challenging MS COCO benchmark  using the standard metrics for object detection.
All models are trained on the COCO train2017 split (118k images) and evaluated with val2017 (5k images).
|✓||✓||32.2 (+13.7)||47.5 (+12.5)||34.4 (+16.7)||18.2 (+9.9)||35.2 (+13.5)||41.7 (+15.3)|
|✓||✓||✓||42.3 (+10.1)||61.2 (+13.7)||45.7 (+11.3)||26.7 (+8.5)||44.6 (+9.4)||57.6 (+15.9)|
ResNet-50  is used as the backbone network unless otherwise specified. The optimizer is AdamW  with weight decay 0.0001.
The mini-batch is 16 images and all models are trained with 8 GPUs.
Default training schedule is 36 epochs and the initial learning rate is set to , divided by 10 at epoch 27 and 33, respectively.
The backbone is initialized with the pre-trained weights on ImageNet  and other newly added layers are initialized with Xavier .
Data augmentation includes random horizontal,
scale jitter of resizing the input images such that the shortest side is at least 480 and at most 800 pixels while the longest at most 1333. Following [3, 45], , , . The default number of proposal boxes, proposal features and iteration is 100, 100 and 6, respectively.
The inference process is quite simple in Sparse R-CNN.
Given an input image,
Sparse R-CNN directly predicts 100 bounding boxes associated with their scores.
The scores indicate the probability of boxes containing an object.
For evaluation, we directly use these 100 boxes without any post-processing.
4.1 Main Result
We provide two versions of Sparse R-CNN for fair comparison with different detectors.
The first one adopts 100 learnable proposal boxes without random crop data augmentation,
and is used to make comparison with mainstream object detectors, /egFaster R-CNN and RetinaNet .
The second one leverages 300 learnable proposal boxes with random crop data augmentations,
and is used to make comparison with DETR-series models [3, 45].
As shown in Table 1,
Sparse R-CNN outperforms well-established mainstream detectors, such as RetinaNet and Faster R-CNN, by a large margin.
Surprisingly, Sparse R-CNN based on ResNet-50 achieves 42.3 AP, which has already competed with Faster R-CNN on ResNet-101 in accuracy.
We note that DETR and Deformable DETR usually employ stronger feature extracting method, such as stacked encoder layers and deformable convolution.
The stronger implementation of Sparse R-CNN is used to give a more fair comparison with these detectors.
Sparse R-CNN exhibits higher accuracy
even using the simple FPN as feature extracting method.
Moreover, Sparse R-CNN gets much better detection performance on small objects compared with DETR(26.9 AP vs. 22.5 AP).
The training convergence speed of Sparse R-CNN is faster over DETR, as shown in Figure 1. Since proposed, DETR has been suffering from slow convergence, which motivates the proposal of Deformable DETR. Compared with Deformable DETR, Sparse R-CNN exhibits better performance in accuracy (44.5 AP vs. 43.8 AP) and shorter running-time (22 FPS vs. 19 FPS), with shorter training schedule (36 epochs vs. 50 epochs).
4.2 Module Analysis
In this section, we analyze each component in Sparse R-CNN. All models are based on ResNet50-FPN backbone, 100 proposals, 3x training schedule, unless otherwise noted.
Learnable proposal box.
Starting with Faster R-CNN, we naively replace RPN with a sparse set of learnable proposal boxes.
The performance drops from 40.2 AP (Table 1 line 3) to 18.5 (Table 2).
We find that there is no noticeable improvement even more fully-connected layers are stacked.
Iteratively updating the boxes is an intuitive idea to improve its performance.
However, we find that a simple cascade architecture does not make a big difference, as shown in Table 3.
We analyze the reason is that compared with the refined proposal boxes in  which mainly locating around the objects, the candidates in our case are much more coarse, making it hard to be optimized. We observe that the target object for one proposal box is usually consistent in the whole iterative process.
Therefore, the object feature in previous stage can be reused to play a strong cue for the next stage. The object feature encodes rich information such as object pose and location.
This minor change of feature reuse results in a huge gain of 11.7 AP on basis of original cascade architecture.
Finally, the iterative architecture brings 13.7 AP improvement, as shown in second row of Table 2.
The dynamic head uses object feature of previous stage in a different way with iterative architecture discussed above.
Instead of simply concatenating, the object feature of previous stage is first processed by self-attention module,
and then used as proposal feature to implement instance interaction of current stage.
The self-attention module is applied to the set of object features for reasoning about the relation between objects.
Table 4 shows the benefit of self-attention and dynamic instance interaction. Finally, Sparse R-CNN achieves accuracy performance of 42.3 AP.
Initialization of proposal boxes.
The dense detectors always heavily depend on design of object candidates, whereas, object candidates in Sparse R-CNN are learnable and thus, all efforts related to designing hand-crafted anchors are avoided.
However, one may concern that the initialization of proposal boxes plays a key role in Sparse R-CNN. Here we study the effect of different methods for initializing proposal boxes:
“Center” means all proposal boxes are located in the center of image at beginning, height and width is set to 0.1 of image size.
“Image” means all proposal boxes are initialized as the whole image size.
“Grid” means proposal boxes are initialized as regular grid in image, which is exactly the initial boxes in G-CNN .
“Random” denotes the center, height and width of proposal boxes are randomly initialized with Gaussian distribution.
Number of proposals.
The number of proposals largely effects both dense and sparse detectors.
Original Faster R-CNN uses 300 proposals .
Later on it increases to 2000  and obtains better performance.
We also study the effect of proposal numbers on Sparse R-CNN in Table 6.
Increasing proposal number from 100 to 500 leads to continuous improvement,
indicating that our framework is easily to be used in various circumstances. Whereas, 500 proposals take much more training time, so we choose 100 and 300 as the main configurations.
Number of stages in iterative architecture.
Iterative architecture is a widely-used technique to improve object detection performance [2, 3, 38], especially for Sparse R-CNN. Table 7 shows the effect of stage numbers in iterative architecture. Without iterative architecture, performance is merely 21.7 AP.
Considering the input proposals of first stage is a guess of possible object positions, this result is not surprising.
Increasing to 2 stage brings in a gain of 14.5 AP, up to competitive 36.2 AP.
Gradually increasing the number of stages, the performance is saturated at 6 stages. We choose 6 stages as the default configuration.
|Multi-head Attention ||35.7||54.9||37.7|
|DETR ||32.8 (-7.8)||55.2||–|
Dynamic head vs. Multi-head Attention. As discussed in Section 3, dynamic head uses proposal feature to filter RoI feature and finally outputs object feature.
We find that multi-head attention module  provides another possible implementation for the instance interaction.
We carry out the comparison experiments in Table 8, and its performance falls behind 6.6 AP.
Compared with linear multi-head attention, our dynamic head is much more flexible, whose parameters are conditioned on its specific proposal feature and more non-linear capacity can be easily introduced.
Proposal feature vs. Object query.
Here we make a comparison of object query  proposed in DETR and our proposal feature.
As discussed in , object query is learned positional encoding, guiding the decoder interacting with the summation of image feature map and spatial positional encoding.
Using only image feature map will lead to a significant drop.
However, our proposal feature can be seen as a feature filter,
which is irrelevant to position.
The comparisons are shown in Table 9, DETR drops 7.8 AP if the spatial positional encoding is removed.
On the contrary, positional encoding gives no gain in Sparse R-CNN.
4.3 The Proposal Boxes Behavior
Figure 4 shows the learned proposal boxes of a converged model.
These boxes are randomly distributed on the image to cover the whole image area.
This guarantees the recall performance on the condition of sparse candidates.
Further, each stage of cascading heads gradually refines box position and remove duplicate ones.
This results in high precision performance.
Figure 4 also shows that Sparse R-CNN presents robust performance in both rare and crowd scenarios.
For object in rare scenario, its duplicate boxes are removed within a few of stages.
Crowd scenarios consume more stages to refine but finally each object is detected precisely and uniquely.
We present Sparse R-CNN, a purely sparse method for object detection in images. A fixed sparse set of learned object proposals are provided to perform classification and location by dynamic heads. Final predictions are directly output without non-maximum suppression post-procedure. Sparse R-CNN demonstrates its accuracy, run-time and training convergence performance on par with the well-established detector. We hope our work could inspire re-thinking the convention of dense prior and exploring next generation of object detector.
- footnotetext: * Equal contribution.
Object detection with discriminatively trained part based models.
T-PAMI 32 (9), pp. 1627–1645.
Cited by: §1.
Understanding the difficulty of training deep feedforward neural networks.
In Proceedings of the thirteenth international conference on artificial intelligence and statistics,
Cited by: §4.
DenseBox: unifying landmark localization with end to end object detection.
arXiv preprint arXiv:1509.04874.
Cited by: §2.
Foveabox: beyound anchor-based object detection.
IEEE Transactions on Image Processing 29, pp. 7389–7398.
Cited by: §1.
ImageNet classification with deep convolutional neural networks.
Cited by: §2.
Rapid object detection using a boosted cascade of simple features.
In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001,
Vol. 1, pp. I–I.
Cited by: §1,