# DeepMOT: A Differentiable Framework for Training Multiple Object Trackers

###### Abstract

Multiple Object Tracking accuracy and precision (MOTA and MOTP) are two standard and widely-used metrics to assess the quality of multiple object trackers. They are specifically designed to encode the challenges and difficulties of tracking multiple objects. To directly optimize a tracker based on MOTA and MOTP is difficult, since both the metrics are strongly rely on the Hungarian algorithm, which are non-differentiable. We propose a differentiable proxy for the MOTA and MOTP, thus allowing to train a deep multiple-object tracker by directly optimizing (a proxy of) the standard MOT metrics. The proposed approximation is based on a bidirectional recurrent network that inputs the object-to-hypothesis distance matrix and outputs the optimal hypothesis-to-object association, thus emulating the Hungarian algorithm. Followed by a differentiable module, the estimated association is used to compute the MOTA and MOTP. The experimental study demonstrates the benefits of this differentiable framework on two recent deep trackers over the MOT17 dataset. Moreover, the code is publicly available from https://gitlab.inria.fr/yixu/deepmot.

- DHN
- deep Hungarian network

## 1 Introduction

Object tracking is one of the core scientific challenges of computer vision. In the recent past, thanks to the advances of neural networks, great progress has

been achieved in object tracking [42, 28, 46, 27, 11, 10]. Most of the

research in visual tracking deals with single object tracking (SOT),^{1}^{1}1In this paper SOT may refer to single object tracking or to single

object tracker depending on the context. The same holds for MOT. where the main difficulties are (i) the properly modeling of the target dynamics and (ii) the

learning of a robust appearance model. In addition to the challenges of the single-object tracking, the complexity of multiple object tracking is further

characterized by the data-to-track assignment problem [5]. Indeed, when dealing with tracking of multiple objects, one needs to

associate data points (i.e. detections) to each of the tracks. Data association is essential not only at inference time, where the detections must be associated to different tracks, but also at evaluation time (training and performance evalution), where the inferred tracks have to be associated to the ground-truth. We will

rather focus in the latter. Furthermore, these assignment problems are combinatorial and global. Combinatorial, because there is a number of assignment

possibilities exponentially growing with the number of elements to be associated (tracks) and polynomially growing with the number of elements to be associated

to (ground-truth objects). Global, because the decision has to be taken, considering the distance between all inferred tracks and all ground-truth objects.

Moreover, both the number of inferred tracks and ground-truth objects vary over time, and therefore any assignment method must be able to cope with input

distance matrices of variable size.

Traditionally, this generic assignment problem has been addressed with the Hungarian or Munkres algorithm [22]. In the case of MOT, a

modified version of the Hungarian algorithm is used to assign tracks to ground-truth objects [6]. On one side, there can be a

different number of tracks and ground-truth objects. On the other side, when the distance between a track and a ground-truth object exceeds a threshold, we

should avoid this assignment (see [6] for more details). Therefore, not all tracks will be associated to a ground-truth object, and

not all ground-truth objects will have an associated track. In an extreme case, there could be no associations at all at a given frame. What the standard and

the modified Hungarian algorithms have in common is that the optimal assignment cannot be expressed as a differentiable function of the input distance matrix.

This is problematic when one wishes to use MOTA or MOTP as training loss for a deep multiple object tracker. Current deep trackers are trained with ad-hoc

differentiable losses that have little or nothing to do with MOTA or MOTP.

In this paper, we introduce a differentiable operator that approximates the Hungarian algorithm like in [6]. To that aim, we use recurrent neural networks, and we

call it deep Hungarian network (DHN). Once trained, the DHN can be used in two different ways. Firstly, to provide an approximation of the optimal

track-to-ground-truth assignment, that is differentiable w.r.t. the distance matrix (and thus to approximate MOTA and MOTP to directly train deep MOT).

Secondly, to convert any fully trainable deep single-object tracker, into a deep multiple object tracker. Indeed, the tracks can be inferred from a purely MOT

method, from several SOT methods running in parallel, or from any combination of MOT and SOT methods working in parallel.

The rest of the paper is structured as follows. We first discuss works related to ours. We then present the structure of the overall pipeline, including the

deep Hungarian network (DHN) and the operators to compute the MOTA and MOTP metrics. After, we evaluate the usefulness of the proposed pipeline in several ways. First, the

advantages of the proposed training framework compared with standard losses are shown and discussed. Second, we show that thanks to the proposed training

framework, a single-object tracker can achieve state-of-the-art multiple object online-tracking performance.

## 2 Related Work

### 2.1 Single Object Tracking

Single object tracking (SOT) methods [11, 10] aim to learn discriminative models to track one object and separate it from the background.

A simple feed-forward regression network learning a generic relationship between object motion and appearance was proposed in [15]. Moreover, siamese trackers

[42, 28, 46, 27] have recently achieved state-of-the-art performance. The original siamese tracker was proposed in [42],

which is based on a correlation filter learner. It is able to extract deep features that are tightly coupled with the correlation filter. An extension of the siamese tracker using region proposal

networks (RPN) was introduced in [28]. It employs the siamese subnetwork for feature extraction and the RPN to perform classification and regression. Most of recent SOT trackers can

be trained end-to-end, but their performance significantly drops when applied directly to MOT.

### 2.2 Multiple Object Tracking

Multiple object tracking often follows the tracking-by-detection paradigm. Unlike single object tracking, the goal of a MOT tracker is to solve the data association problem. Standard benchmarks are

proposed in [25, 31] for pedestrians tracking. Based on whether the algorithms use future information, MOT methods can be split into online and offline tracking.

##### Offline MOT

[39] formulates the multi-person tracking by a multi-cut problem and use a pair-wise feature which is robust to occlusions.

Person re-identification methods have been combined into a tracking framework in [40].

Moreover, [17] solves the problem by co-clustering the low-level feature point motions from optical flow and the

high-level bounding-box trajectories. Quadruplet convolutional neural networks are used in [38]. They perform metric learning

for target appearances together with the temporal adjacencies, which is then used for data association. In addition, [19]

proposes a bilinear LSTM to learn the long-term appearance models, where the memory and the input have a linear relationship. An iterative

multiple hypothesis tracking (MHT) is proposed in [37], which includes the prior association information from the

previous frames. [16] fuses both the head and full-body detection into one tracking framework. Another approach is

proposed in [24], which introduces a Siamese network. It encodes both the appearance information from the RGB image and

the motion information from the optical-flow map. The obtained features are then processed by a linear programming based tracker.

##### Online MOT

[1, 4] formulate the problem in a probabilistic framework, then use a variational expectation-maximization algorithm to find the track solution.

Moreover, [9] proposes an aggregated local flow descriptor which encodes the relative motion pattern and then performs tracking. Another solution is proposed in [43],

which uses Markov Decision Processes and reinforcement learning for the best data association. Alternatively, [35] presents a framework based on Recurrent Neural Networks (RNN).

The dynamics of the appearance change, motion, and people interactions are modeled independently by a RNN. Then different information are fused together with a convolutional network.

Besides, a model with dual matching attention networks is introduced by [45], which uses both spatial and temporal attention mechanisms. The estimation of the detection confidence is

realized in [2], where the detections with different scores are then clustered into different classes and they are processed separately. [32] proposes a

recurrent neural network (RNN) based method for both motion dynamics and detection-to-track data association and further explored it to NP-hard problems [33]. Finally, in [36] an optical-flow based approach is proposed.

### 2.3 Single vs. Multiple Object Trackers

Multiple object tracking frameworks have the clear advantage to be able to address the MOT problem, but most of the existing trackers are not trainable end-to-end. On the contrary, the literature on end-to-end

trainable single object trackers is much more mature, but adapting to multiple-object tracking is still unclear and tracker-dependent. In this paper, we propose a deep

proxy for the standard MOT metrics: a deep recurrent network that emulates the Hungarian algorithm. Interestingly, this allows us to directly use – and train – end-to-end trainable single object

trackers, for multiple object tracking. Therefore, our formulation allows us to take full advantage of the maturity of the single object tracking literature and to train these trackers in an end-to-end fashion, to tackle the multiple object tracking problem.

## 3 Problem Formulation

The objective of MOT is to predict the trajectories of all objects at each time step, including their bounding boxes and associated

identities. First, the trajectory should be precise in the sense that each bounding box should enclose its associated object well. Second, the trajectories

should be accurate, meaning that only real objects and all of them should be captured (no clutter objects or missed objects) by the bounding boxes; and each

trajectory should always have a unique and consistent identity through time. These properties are respectively measured by MOTP and MOTA.

With the emergence of deep learning, people address the MOT problem with (deep) neural networks, mainly by constructing robust models that capture

information about motion, appearance and/or object interactions [35, 45]. However, an end-to-end training

framework with a dedicated loss function for MOT remains undiscovered. As our first contribution, we propose a differentiable approximation of the two common

MOT performance metrics, MOTA and MOTP, so that any trainable MOT method can be optimised to maximise these (now differentiable) metrics.

As an intermediate component for calculating the MOT metrics, the Hungarian algorithm of time complexity give an approximated solution to the linear sum assignment problem (LSAP) by minimizing the sum

distance of assigning ground-truth bounding boxes to the predicted ones. Given the algorithmic nature of this operation, no gradient of the metrics with

respect to the predicted bounding boxes can be computed. And therefore the standard MOTA and MOTP cannot be used as optimisation criteria. To solve this

problem, we propose a Deep Hungarian Network, or DHN, which can be seen as a differentiable function that approximates the Hungarian

algorithm. In this way, the full pipeline is differentiable and the gradient of the (approximated) MOT loss can be back-propagated to any trainable MOT

methods.

## 4 Methodology

Figure 1 shows the overview of the proposed method. Taking tracking pedestrians in a video sequence as an example, at time , several single object trackers are used to track multiple

people, providing estimated bounding boxes (e-bb), , ,

. These estimates are then compared to the ground-truth bounding boxes (gt-bb),

, , . The pair-wise distance between e-bb and gt-bb is stored in

a distance matrix . The proposed deep Hungarian network is then used to estimate the optimal

soft-assignment matrix