欢迎,计算机科学与信息计算爱好者!

ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture

论文 scott 3个月前 (06-26) 18次浏览 0个评论 扫描二维码

ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture

Abstract

The accuracy of monocular 3D human pose estimation depends on the viewpoint from which the image is captured. While camera-equipped drones provide control over this viewpoint, automatically positioning them at the location which will yield the highest accuracy remains an open problem. This is the problem that we address in this paper. Specifically, given a short video sequence, we introduce an algorithm that predicts the where a drone should go in the future frame so as to maximize 3D human pose estimation accuracy. A key idea underlying our approach is a method to estimate the uncertainty of the 3D body pose estimates. We integrate several sources of
uncertainty, originating from a deep learning based regressors and temporal smoothness.
The resulting motion planner leads to improved 3D body pose estimates
and outperforms or matches existing planners that are based on person following and orbiting.

/cvprfinalcopy

1 Introduction

Monocular solutions to 3D human pose estimation have become increasingly competent in recent years, but their accuracy remains relatively low. In this paper, we explore the use of a moving camera whose motion we can control to resolve ambiguities inherent to monocular 3D reconstruction and to increase accuracy. This is known as active vision and has received surprisingly little attention in the context of using modern approaches to body pose estimation. Fig. 1 depicts our approach for a camera carried by a drone, a key use-case for it.

ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
Figure 1: Method overview. The 2D and 3D human pose is inferred from the current frame of the drone footage, using off the shelf CNNs. The 2D pose and relative 3D pose of the last frames is then used to optimize for the global 3D human motion. The next view of the drone is chosen so that the uncertainty of the human pose estimation from that view is minimized, which improves reconstruction accuracy.

In this paper, we introduce an algorithm designed to continuously position a moving camera, possibly mounted on a drone, at optimal locations to maximize the 3D pose estimation accuracy for a freely moving subject. We achieve this by moving the camera in 6D pose space to locations that maximize a utility function designed to predict reconstruction accuracy. This is non-trivial because reconstruction accuracy cannot be directly used in our utility function. Estimating it while planning the motion would require knowing the true person and drone position, a chicken and egg problem. Instead we use prediction uncertainty as a surrogate for accuracy. This is a common strategy to handle rigid scenes in robotics, for example to send the robot to locations where its internal map is most incomplete [57]. However, in our situation, estimating uncertainty is much more difficult since multiple sources of uncertainty need to be considered. They include uncertainties about what the subject will do next, the reliability of the pose estimation algorithm, and the accuracy of distance estimation along the camera’s line of sight.

Our key contribution is therefore a formal model that provides an estimate of the posterior variance and probabilistically fuses these sources of uncertainty with appropriate prior distributions. This has enabled us to develop an active motion capture technique that takes raw video footage as input from a moving aerial camera and continuously computes future target locations for positioning the camera, in a way that is optimized for human motion capture. We demonstrate our algorithm in two different scenarios and compare it against standard heuristics, such as constantly rotating around the subject and remaining at a constant angle with respect to the subject. We find that when allowed to choose the next location without constraints, our algorithm outperforms the baselines consistently. For simulated drone flight, our results are on par with constant rotation, which we conclude is the best trajectory to choose in the case of no obstacles blocking the circular flight path.

2 Related work

Most recent approaches to markerless motion capture rely on deep networks that regress 3D pose from single images [53, 54, 58, 73, 62, 68, 59, 76, 71, 70, 75]. While a few try to increase robustness by enforcing temporal consistency [60], none considers the effect that actively controlling the camera may have on accuracy. The methods most closely related to what we do are therefore those that optimize camera placement in multi-camera setups and those that are used to guide robots in a previously-unknown environment.

Optimal Camera Placement for Motion Capture. Optimal camera placement is a well-studied problem in the context of static multi-view setups. Existing solutions rely on maximizing image resolution while minimizing self-occlusion of body parts [43, 40] or target point occlusion and triangulation errors [64]. However, these methods operate offline and on pre-recorded exemplar motions. This makes them unsuitable for motion capture using a single moving camera that films a priori unknown motions in a much larger scene where estimation noise can be high.

Concurrent work [61] optimizes multiple cameras poses for triangulation of joints in a dome environment using a self-supervised reinforcement learning approach. In our case, we consider the monocular problem. Our method is not learning based, we try to obtain the next best view from the loss function itself.

View Planning for Static and People Reconstruction. There has been much robotics work on active reconstruction and view planning. This usually involves moving so as to maximize information gain while minimizing motion cost, for example by a discretizing space into a volumetric grid and counting previously unseen voxels [52, 46] or by accumulating estimation uncertainty [57]. When a coarse scene model is available, an optimal trajectory can be found using offline optimization [67, 51]. This has also been done to achieve desired aesthetic properties in cinematography [49]. Another approach is to use reinforcement learning to define policies [45] or to learn a metric [50] for later online path planning. These methods deal with rigid unchanging scenes, except the one in [44] that performs volumetric scanning of people during information gain maximization. However, this approach can only deal with very slowly moving people who stay where they are.

Human Motion Capture on Drones. Drones can be viewed as flying cameras and are therefore natural targets for our approach. One problem, however, is that the drone must keep the person in its field of view. To achieve this, the algorithm of [77] uses 2D human pose estimation in a monocular video and non-rigid structure from motion to reconstruct the articulated 3D pose of a subject, while that of [55] reacts online to the subject’s motion to keep them in view and to optimize for screen-space framing objectives.
In [56], this was integrated into an autonomous system that actively directs a swarm of drones and simultaneously reconstructs 3D human and drone poses from onboard cameras. This strategy implements a pre-defined policy to stay at constant distance to the subject and uses pre-defined view angles ( between two drones) to maximize triangulation accuracy. This enables mobile large-scale motion capture, but relies on markers for accurate 2D pose estimation. In [74], three drones are used for markerless motion capture, using an RGBD video input for tracking the subject.

In short, existing methods either optimize for drone placement but for mostly rigid scenes, or estimate 3D human pose but without optimizing the camera placement. [61] performs optimal camera placement for multiple cameras. Here, we propose an approach that aims to find the best next drone location for monocular view so as to maximize 3D human pose estimation accuracy.

3 Active Human Motion Capture

Our goal is to continually pre-position the camera in 6D pose space so that the images it acquires can be used to achieve the best overall human pose estimation accuracy. What makes this problem challenging is that, when we decide where to send the camera, we do not yet know where the subject will be and in what position exactly. We therefore have to guess. To this end, we propose the following three-step approach depicted by Fig. 1:

  1. Estimate the 3D human pose up to the current time instant.

  2. Predict the person’s location and pose by the time the camera acquires the next image, including an uncertainty estimate.

  3. Cause the camera to assume the optimal 6D pose given that estimate. i j

We will consider two ways the camera can move. In the first, the camera can teleport from one location to the next without restriction. This can be simulated using a multi-camera setup and allows us to explore the theoretical limits of this approach. In the second, more realistic scenario, the camera is carried by a drone, and we must take into account physical limits about the motion it can undertake.

3.1 3D Pose Estimation

The 3D pose estimation step takes as input the video feed from the on-board camera over the past frames and outputs for each frame, , the 3D human pose, represented as 15 3D points , and the drone pose, as 3D position and rotation angles .
The drone pose is computed relative to the static background with monocular structure from motion. Our focus is on estimating the human pose, on the basis of the deep-learning-based real-time method of
[41], which detects the 2D locations of the human’s major joints in the image plane, , and the subsequent use of [71], which lifts these 2D predictions to 3D pose, . However, these per-frame estimates are error prone and relative to the camera.
To remedy this, we fuse 2D and 3D predictions with temporal smoothness and bone-length constraints in a space-time optimization. This exploits the fact that the drone is constantly moving so as to disambiguate the individual estimates. The bone lengths, , of the subject’s skeleton are found through a calibration mode where the subject has to stand still for 20 seconds. This is performed only once for each subject. Formally, we optimize for the global 3D human pose by minimizing an objective function , which we detail below.

Formulation

Our primary goal is to improve the global 3D human pose estimation of a subject changing position and pose. We optimize the time-varying pose trajectories across the last frames. Let be the last observed frame. We capture the trajectory of poses to in the pose matrix .
We then write an energy function

(1)

The individual terms are defined as follows.
The lift term, , leverages the 3D pose estimates, , from LiftNet [71]. Because these are relative to the hip and without absolute scale, we subtract the hip position from our absolute 3D pose, , and apply a scale factor to to match the bone lengths in the least-square sense. We write

(2)

with its relative weight.

The projection term measures the difference between the detected 2D joint locations and the projection of the estimated 3D pose in the least-square sense. We write it as

(3)

where is the perspective projection function, is the matrix of camera intrinsic parameters, and is a weight that controls the influence of this term.

The smoothness term exploits that we are using a continuous video feed and that the motion is smooth by penalizing velocity computed by finite differences as

(4)

with as its weight.

To further constrain the solution space, we use our knowledge of the bone lengths found during calibration and penalize deviations in length. The length of each bone in the set of all bones is found as for frame . The bone length term is then defined as

(5)

with as its weight.

The complete energy is minimized by gradient descent at the beginning of each control cycle, to get a pose estimate for control. The resulting pose estimate is the maximum a posteriori estimate in a probabilistic view.

Calibration Mode

Calibration mode only has to be run once for each subject to find the bone lengths, . In this mode, the scene is assumed to be stationary. The situation is equivalent to having the scene observed from multiple stationary cameras, such as in [66]. We find the single static pose that minimizes

(6)

In this objective, the projection term, , is akin to the one in our main formulation but acts on all calibration frames.
It can be written as

(7)

with controlling its influence.
The symmetry term, , ensures that the left and right limbs of the estimated skeleton have the same lengths by penalizing the squared difference between length of the left and right bones.

3.2 Best Next View Selection

Our goal is to find the best next view for the drone at the future time step , .
We will model the uncertainty of the pose estimate in a probabilistic setting. Let be the posterior distribution of poses. Then, is its negative logarithm and its minimization corresponds to maximum a posteriori (MAP) estimation.
In this formalism, the sum of the individual terms in models that our posterior distribution is composed of independent likelihood and prior distributions. For a purely quadratic term, , the corresponding distribution is a Gaussian with mean and standard deviation . Notably, is directly linked to the weight of the energy.
Most of our energy terms involve non-linear operations, such as perspective projection in , and therefore induce non-Gaussian distributions, as visualized in Fig. 2.
Nevertheless, as for the simple quadratic case, the weights and of and can be interpreted as surrogates for the amount of measurement noise in the 2D and 3D pose estimates.

ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
Figure 2: Probabilistic interpretation. Left: A quadratic energy function and its associated Gaussian error distribution. Right: A complex energy function, which is locally approximated with a Gaussian (blue) near the minimum. The curvature of the energy function is a measure of the confidence in the estimate and the variance of the associated error distribution. The energy on the right is more constrained and its error distribution has a lower variance.

A good measure of uncertainty is the sum of the eigenvalues of the covariance of the underlying distribution .
The sum of the eigenvalues captures the spread of a multivariate distribution with a single variable, similarly to the variance in the univariate case. To exploit this uncertainty estimation for our problem, we now extend to model not only the current and past poses but also the future ones and condition it on the choice of the future drone position.
To determine the best next drone pose, we sample candidate positions and chose the one with the lowest uncertainty. This process is illustrated in Figure 3.

ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
Figure 3: Uncertainty estimates for each candidate drone position, visualized on the left as 3D ellipsoids and on the right from a 2D top-down view. Each Ellipse visualizes the eigenvalues of the hip location when incorporating an additional view from its displayed position. Here, the previous image was taken from the bottom (position 16) and uncertainty is minimized by moving to an orthogonal view. The complete distribution has more than three eigenvectors and can not straightforwardly be visualized in 3D.

Future pose forecasting.
In our setting, accounting for the dynamic motion of the person is key to successfully positioning the camera. We model the motion of the person from the current frame to the next future frames , linearly, i.e. we aim to keep the velocity of the joints constant across our window of frames. We also constrain the future poses by the bone length term. The future pose vectors are constrained by the smoothness and bone length terms, but for now not by any image-based term since the future images are not yet available at time . Minimizing this extended for future poses gives the MAP poses . It continues the motion smoothly while maintaining the bone lengths.
As we predict only the near future, we have found this simple extrapolation to be sufficient. While more advanced methods [48] could be applied to forecast further, we leave this as future work.

Future measurement forecasting.
We aim to find the future drone position, , that reduces the posterior uncertainty, but we do not have footage from future viewpoints to condition the posterior on. Instead, we use the predicted future human pose , , as a proxy for and approximate with the projection

(8)

At first glance, constraining the future pose on these virtual estimates in does not add anything since the terms and are zero at by this construction. However, it changes the energy landscape and models how strong a future observation would constrain the pose posterior. In particular, the projection term, , narrows down the solution space in the direction of the image plane but cannot constrain it in the depth direction, creating an elliptical uncertainty as visualized in Fig 3. The combined influence of all terms is conveniently modeled as the energy landscape of and its corresponding posterior.

In our current implementation we assume that the 2D and 3D detections are affected by pose-independent noise, and their variance is captured by and , respectively.
These factors could, in principle, be view dependent and in relation to the person’s pose. For instance, [42] may be more accurate at reconstructing a front view than a side view.
However, while estimating the uncertainty in deep networks is an active research field [63], predicting the expected uncertainty for an unobserved view has not yet been attempted in the pose estimation literature. It is an interesting avenue for future work.

Variance estimator.
and its corresponding posterior has a complex form due to the projection and prior terms. Hence, the sought-after covariance cannot be expressed in closed form and approximating it by sampling the space of all possible poses would be expensive. Instead, for the sake of uncertainty estimation, we approximate locally with a Gaussian distribution , such that

(9)

with and the Gaussians mean and covariance matrix, respectively. Such an approximation is exemplified in Figure 2.
For a Gaussian, the covariance of can be computed in closed form as the inverse of the Hessian of the negative log likelihood, , where .
Under the Gaussian assumption, is thereby well approximated by the second order gradients, , of .
Our experiments show that this simplification holds well for all of the introduced error terms, except for the bone length one, which we therefore exclude from uncertainty estimation.

To select the view with minimum uncertainty among a set of candidate drone positions, we therefore

  1. optimize once to forecast human poses , for

  2. use these forecasted poses to set and for each for each candidate position ,

  3. compute the second order derivatives of for each , which form , and

  4. compute and sum up the respective eigenvalues to select the candidate with the least uncertainty.

Discussion.
In principle, , i.e. the probability of the most likely pose, could also act as a measure of certainty, as implicitly used in [64] on a known motion trajectory to minimize triangulation error. However, the term of is zero for the future time step , because the projection of is by construction equal to and therefore uninformative.
Another alternative that has been proposed in the literature is to approximate the covariance through first order estimates [72], as a function of the Jakobi matrix. However, as also the first order gradients of vanish at the MAP estimate, this approximation is not possible in our case.

3.3 Drone Control Policies and Flight Model

We control the flight of our drone by passing it the desired velocity vector and the desired yaw rotation amount with the maximum speed kept constant at m/s. The drone is sent new commands once every seconds.
Modeling the flight of the drone allows us to foresee the possible positions the drone will be able to reach when we give it various commands. Since our future locations By forecasting the future locations of the drone, we can predict the 2D pose estimations for each more accurately.

We model the drone flight in the following manner. We assume that the drone moves with constant acceleration during a time step . If the drone has current position and velocity , then with acceleration , its next position in time will be

(10)

We model our input to the system as the acceleration, . The direction of the acceleration is assumed to be the direction of the velocity vector we pass as the movement command to the simulator, and the magnitude is a value determined through least-square minimization. The current acceleration at time is found as a weighted average of the input acceleration and the acceleration of the previous step . This can be written as

(11)

By estimating the future positions of the drone, we are able to forecast more accurate future 2D pose estimations, leading to more accurate decision making. Examples of predicted trajectories are shown in Figure 4 Further details are provided in the supplementary material.

ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
Figure 4: The predicted trajectories as the drone is circling the subject. The future drone positions are predicted for the future steps, represented by triangle markers on the trajectories. Red depicts the chosen trajectory.

4 Evaluation

In this section we, evaluate the improvement on 3D human pose estimation that is achieved through optimization of the drone flight.

Simulation environment. Although [65, 41, 71] run in real time, and online SLAM from a monocular camera [47] is possible, we use a drone simulator since the integration of all components onto constrained drone hardware is difficult and beyond our expertise.
We make simulation realistic by driving our characters with real motion capture data from the CMU Graphics Lab Motion Capture Database [78] and using the AirSim [69] drone simulator that builds upon the Unreal game engine and therefore produces realistic images of natural environments. An image of AirSim is shown in Figure 5. Simulation also has the advantage that the same experiment can be repeated with different parameters and be directly compared to baseline methods and ground-truth motion.

ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
Figure 5: Image of the simulation environment, AirSim.

Simulated test set.
We test our approach on three motions from the CMU database that increase in difficulty:
Walking straight (subject 2, trial 1),
Dance with twirling (subject 5, trial 8), and
Running in a circle (subject 38, trial 3). Additionally, we use a validation set consisting of Basketball dribble (subject 6, trial 13), and
Sitting on a stool (subject 13, trial 6), to conduct a grid search for hyperparameters.

Real test set.
To show that our planner also works outside the simulator, we evaluate our approach on a section of the MPI-INF-3DHP dataset, which includes motions such as running around in a circle and waving arms in the air. The dataset provides fixed viewpoints that are at varying distances from one another and from the subject, as depicted in Figure 7. In this case, the best next view is restricted to one of the fixed viewpoints. This dataset lets us evaluate whether the object detector of [65], the 2D pose estimation method of [42], and the 3D pose regression technique of [71] are reliable enough in real environments. Since we cannot control the camera in this setting, we remove those cameras from the candidate locations where we predict that the subject will be out of the viewpoint.

ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
Figure 6: Uncertainties estimates across potential viewpoints (left image) compared with the average error we would obtain if we were to visit these locations (right image). The star represents the location of the subject and the large triangle depicts the chosen viewpoint according to the lowest uncertainty.
CMU-Walk CMU-Dance CMU-Run MPI-INF-3DHP Total
Oracle 0.1010.001 0.1010.001 0.1090.001 0.1090.002 0.1050.001
Ours (Active) 0.1130.001 0.1160.003 0.1350.002 0.1450.006 0.1270.003
Random 0.1230.002 0.1250.003 0.1590.003 0.2590.011 0.1670.005
Constant Rotation 0.1570.002 0.1460.004 0.2230.003 0.2540.008 0.1950.004
Constant Angle 0.8950.54 0.6830.31 0.9850.24 1.450.63 1.000.43
Table 1: 3D pose accuracy on toy experiment, using noisy ground truth for estimating and . We outperform all predefined baseline trajectories and approach the accuracy of the oracle that has access to the average error of each candidate position.
ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
Figure 7: MPI_INF_3DHP dataset, which has images taken from viewpoints with various distances to the subject. We use this dataset to evaluate our performance on datasets with realistic camera positioning and real images.

Baselines.
Existing drone-based pose estimation methods use predefined policies to control the drone position relative to the human. Either the human is followed from a constant angle and the angle is set externally by the user [56] or the drone undergoes a constant rotation around the human [77]. As another baseline, we use a random decision policy, where the drone picks uniformly randomly among the proposed viewpoints. Finally, the oracle is obtained by moving the drone to the viewpoint where the reconstruction in the next time step will have the lowest average error, which is achieved by exhaustively trying all proposed viewpoints with the corresponding image in the next time frame.

Hyper parameters. We set the weights of the loss term for the reconstruction as follows: (projection), (smoothness), (lift term), (bone length), which were found by grid search. We set the weights for the decision making as , , , . Our reasoning is, we need to set the weights of the projection and lift terms slightly lower because they are estimated with large noise, which is introduced by the neural networks or as additive noise. However, they do not need to be as low for the uncertainty estimation.

4.1 Analyzing Reconstruction Accuracy

We report the mean Euclidean distance per joint in meters in the middle frame of the temporal window we optimize over. In the toy example, the size of the temporal window is set to past frames and future frame, and for the drone flight simulations, to for past frames and future frames.

Simulation Initialization.
The frames are initialized by back-projecting the 2D joint locations estimated in the first frame, , to a distance from the camera that is chosen such that the back-projected bone lengths match with the average human height. We then refine this initialization by running the optimization without the smoothness term, as there is only one frame. All the sequences are evaluated for frames, with the animation sequences played at Hz.

Toy Example: Simulating Teleportation.
To understand whether our uncertainty predictions for potential viewpoints coincide with the actual 3D pose errors we will have at these locations, we run the following simulation: We sample a total of points on a ring around the person, as shown in Fig. 6, and allow the drone to teleport to any of these points. We optimize over a total of past frames and forecast frame into the future. The reasoning behind our choice of window size is we wanted to emphasize the importance of the next choice of frame.

We perform two variants of this experiment. In the first one, we simulate the 2D and 3D pose estimates, , by adding Gaussian noise to the ground-truth data. The mean and standard deviation of this noise is set as the error of [41] and [71], run on the validation set of animations. Figure 8 shows a comparison between the ground truth values, noisy ground truth values and the network results. The results of this experiment are reported in Table 1, where we also provide the standard deviations across 5 trials with varying noise and starting from different viewpoints. As a second variant, we use [41] and [71] on the simulator images to obtain the 2D and 3D pose estimates. The results are in the supplementary material.

ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
Figure 8: Example image from the MPI-INF-3DHP dataset along with the 2D pose detections and 3D relative pose detections obtained using ground truth, noisy ground truth or the networks of [41] and [71]. The noise we add on the ground truth poses is determined according to the statistics of [41] and [71], measured on our validation set.

Altogether, the results show that our active motion planner achieves consistently lower error values than the baselines and we come the closest to achieving the best possible error for these sequences and viewpoints, despite having no access to the true error. The random baseline also performs quite well in these experiments, as it takes advantage of the drone teleporting to a varied set of viewpoints. The trajectories generated by our active planner and the baselines is depicted in Figure  9. Importantly, Figure 6 evidences that our predicted uncertainties accurately reflect the true pose errors, thus making them well suited to our goal.

ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
Figure 9: The trajectories the active planner finds along with random and constant rotation baselines. The first row depicts the trajectories for the MPI-INF-3DHP dataset, and the second row shows the trajectories for the dancing motion. We see that trajectories are quite regular and look different from the random trajectories, especially for the dancing motion case. The algorithm prefers trajectories resulting in large angular variance with respect to the subject between viewpoints.

Simulating Drone Flight.
To evaluate more realistic cases where the drone is actively controlled and constrained to only move to nearby locations, we simulate the drone flight using the AirSim environment.
While simulating drone flight, we target a fixed radius of m from the subject and therefore provide direction candidates that lead to preserving this distance. We do not provide samples at different distances, as moving closer is unsafe and moving farther leads to more concentrated image projections and thus higher 3D errors.
We also restrict the drone from flying outside the altitude range m-m, so as to avoid crashing into the ground and endangering the subject by flying above them.

In this set of experiments, we fly the drone using the simulator’s realistic physics engine. To this end, we sample candidate directions in the directions up, down, left, right, up-right, up-left, down-right, down-left and center. We then predict the consecutive future locations using our simplified (closed form) physics model, to get and estimate where the drone will be at when continuing in each of the directions. We then estimate the uncertainty at these sampled viewpoints and choose the minimum.

CMU-Walk CMU-Dance CMU-Run Total
Ours (Active) 0.260.03 0.220.04 0.440.04 0.310.04
Constant Rotation 0.280.06 0.210.04 0.410.02 0.300.04
Random 0.600.13 0.440.19 0.810.16 0.620.16
Constant Angle 0.410.07 0.63 0.06 1.260.17 0.770.10
Table 2: Results of drone full flight simulation, using noisy ground truth as input for estimating and . The results of constant rotation are the average of runs, with runs rotating clockwise and counter-clockwise. We find that we have comparable results to constant rotation, as compared to other baselines. The trajectory our algorithm draws also results in a constant rotation, the only difference being the rotation direction.

We achieve comparable results to constant rotation on simulated drone flight. In fact, except for the first few frames where the drone starts flying, we observe the same trajectory as constant rotation, only the rotation direction varies. Constant rotation being optimal in this setting is not counter-intuitive, as constant rotation is very useful for preserving momentum. This allows the drone to sample viewpoints as far apart from one another as possible, while keeping the subject in view. Figure  10 depicts the different baseline trajectories and the active trajectory.

ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
Figure 10: The trajectories found during flight by the active planner and the baselines. We see that the algorithm also chose to perform constant rotation. The random baseline cannot increase the distance between its camera viewpoints, as it is constrained by drone momentum.

5 Limitations and Future Work

Our primary goal was to make uncertainty estimation tractable, further improvements are needed to run it on an embedded drone system. The current implementation runs at Hz, however, the optimization is implemented in Python using the convenient but slow automatic differentiation of Pytorch to obtain second derivatives.

Currently, we assume a constant error for the 2D and 3D pose estimates. We will investigate in future work as how to derive situation dependent noise models of deep neural networks.

We have considered a physically plausible drone model but neglected physical obstacles and virtual no-go areas that would restrict the possible flight trajectories. In the case of complex scenes with dynamic obstacles, we expect our algorithm to outperform any simple, predefined policy.

6 Conclusion

We have proposed a theoretical framework for estimating the uncertainty of future measurements from a drone. This permits us to improve 3D human pose estimation by optimizing the drone flight to visit those locations with the lowest expected uncertainty.
We have demonstrated with increasingly complex examples, in simulation with synthetic and real footage, that this theory translates to closed-loop drone control and improves pose estimation accuracy. Key to the success of our approach is the integration of several sources of uncertainty. In future work, we would like to find new ways of estimating the uncertainty of the deployed deep learning methods and extend our work to optimize drone trajectories for different computer vision tasks.

References

    Appendix A Supplementary Material

    a.1 Supplementary Video

    The supplementary video provides a short overview of our work and summarizes the methodology and results. It includes video results of our active trajectories for both the teleportation and simulated flight cases. The video is available at https://youtu.be/Dqv7ZJQi28o.

    a.2 The Drone Flight Model

    As we mention in Section  of our main document, in order to accurately predict where the drone will be positioned after passing it a goal velocity, we have formulated a drone flight model.

    Ablation Study. We replace our drone flight model with uniform sampling around the drone. This is illustrated in Figure 11. We evaluate the performance of our active decision making policy with the uniform sampling in Table 3. The trajectories found using this sampling policy is shown in Figure 12. We find that the algorithm cannot find the constant rotation policy when we remove the drone flight model and in turn, performs worse.

    CMU-Dribble CMU-Sitting CMU-Dinosour Total
    Active with Flight Model 0.280.006 0.150.007 0.120.02 0.180.01
    Active w/o Flight Model 0.650.09 0.480.09 0.220.07 0.450.08
    Constant Rot. 0.300.02 0.150.01 0.150.03 0.200.02
    Table 3: Ablation study on the importance of having a drone flight model. We show 3D pose accuracy on simulated drone flight using noisy ground truth for estimating and . We show that we have a large improvement when we use our flight model to predict the future locations of the drone. Using a flight model allows us to find the same trajectories as constant rotation.
    ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
    Figure 11: The predicted future positions of the drone (a) without using our flight model and (b) using our flight model.
    ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
    Figure 12: The trajectories drawn by our active decision making policy (a) without using our flight model and (b) using our flight model. We are able to find the well performing policy of constant rotation when we are using more realistic sampling of future drone positions, found using our drone flight model.

    a.3 Results with Openpose and Liftnet

    We evaluate our results on the toy example case, using the networks of [41] and [71] to find the 2D pose detections and 3D relative pose detections . The results are reported in Table 4. We outperform the baselines significantly for the real image dataset MPI-INF-3DHP. For the synthetic images, somes we are outperformed by random, but its error has much higher standard deviation and the difference between ours and random is within 1 standard deviation.

    We outperform the baselines significantly in the real image dataset as compared to the synthetic datasets because the error of network [41] for real data is much lower than for synthetic data. We verify this by comparing the normalized 2D-pose estimation errors of a synthetic sequence and a sequence taken from the MPI-INF-3DHP dataset. We find that the normalized average error of [41] of the synthetic sequence is with standard deviation, whereas the normalized average error of the real image sequence is with standard deviation. Therefore, the unrealistically high noise of OpenPose on the synthetic data deprives strong conclusions from the first three columns of Table 4.

    Oracle still performs very well for synthetic images in this case, but oracle makes decisions knowing the results of [41] for all candidate locations. However, this is impossible in practice due to the inherent uncertainty.

    When we our 2D- pose detector is not unreliable, as in the case of Table  of our main document, we outperform random on all cases, well outside 2 standard deviations.

    For the case of the MPI-INF-3DHP dataset, we remove the ceiling cameras for this set of experiments. Since the networks of [41] and [71] were not trained with views from such angles they give highly noisy results which would also add noise to the values we report.

    CMU-Walk CMU-Dance CMU-Run MPI-INF-3DHP. Total
    Oracle 0.130 0.150 0.160.0005 0.170.0005 0.150.0003
    Ours (Active) 0.160.005 0.250.0009 0.250.002 0.210.0008 0.220.002
    Random 0.170.004 0.240.01 0.240.005 0.280.03 0.230.01
    Constant Rot. 0.200.002 0.280.02 0.280.001 0.290.007 0.260.007
    Constant Angle 0.710.50 0.760.37 0.690.22 1.260.53 0.720.4
    Table 4: 3D pose accuracy on toy experiment, using [41, 71] for estimating and . We outperform all predefined baseline trajectories for the real image dataset, MPI-INF-3DHP. As for the cases with synthetic input, we achieve comparable results with random, albeit with much lower standard deviation.

    a.4 Further Details About Simulation Environment

    To test our algorithms we use the AirSim [69] drone simulator, a plug-in built for the Unreal game engine.

    AirSim provides a Python API which can be used to control the drone realistically, since it uses the same flight controllers as used on actual drones. The position and orientation of the drone can be retrieved from the simulator according to the world coordinate system, which takes the drone’s starting point as the origin. The drone can be commanded to move to a with a specified velocity for a specified duration. We have added functionalities to the simulator to control a human character, get ground truth information about the character and animate it with motions from the CMU Graphics Lab Motion Capture Database [78].

    For experiments requiring teleportation we use the simulator in ”ComputerVision” mode, whereas for experiments simulating flight we use ”Multirotor” mode.

    References


    1. A. Aissaoui, A. Ouafi, P. Pudlo, C. Gillet, Z.-E. Baarir, and A. Taleb-Ahmed.


      Designing a Camera Placement Assistance System for Human Motion
      Capture Based on a Guided Genetic Algorithm.


      Virtual reality, 22(1):13–23, 2018.


    2. Z. Cao, T. Simon, S. Wei, and Y. Sheikh.


      Realtime Multi-Person 2D Pose Estimation Using Part Affinity
      Fields.


      In Conference on Computer Vision and Pattern Recognition, pages
      1302–1310, 2017.


    3. Y. Chao, J. Yang, B. Price, S. Cohen, and J. Deng.


      Forecasting Human Dynamics from Static Images.


      In Conference on Computer Vision and Pattern Recognition, 2017.


    4. X. Chen and J. Davis.


      Camera Placement Considering Occlusion for Robust Motion Capture.


      Computer Graphics Laboratory, Stanford University, Tech. Rep,
      2(2.2):2, 2000.


    5. W. Cheng, L. Xu, L. Han, Y. Guo, and L. Fang.


      ihuman3d: Intelligent human body 3d reconstruction using a single
      flying camera.


      In 2018 ACM Multimedia Conference on Multimedia Conference,
      pages 1733–1741. ACM, 2018.


    6. S. Choudhury, A. K. G., Ranade, and D. Dey.


      Learning to gather information via imitation.


      In ICRA, 2017.


    7. J. Daudelin and M. Campbell.


      An adaptable, probabilistic, next-best view algorithm for
      reconstruction of unknown 3-d objects.


      IEEE Robotics and Automation Letters, 2(3):1540–1547, 2017.


    8. A. J. Davison, I. Reid, N. Molton, and O. Stasse.


      Monoslam: Real-Time Single Camera Slam.


      IEEE Transactions on Pattern Analysis and Machine Intelligence,
      29(6):1052–1067, June 2007.


    9. K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik.


      Recurrent Network Models for Human Dynamics.


      In International Conference on Computer Vision, 2015.


    10. C. Gebhardt, S. Stevsic, and O. Hilliges.


      Optimizing for Aesthetically Pleasing Quadrotor Camera Motion.


      ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH),
      37(4):90:1–90:11, 2018.


    11. B. Hepp, D. Dey, S. Sinha, A. Kapoor, N. Joshi, and O. Hilliges.


      Learn-To-Score: Efficient 3D Scene Exploration by Predicting View
      Utility.


      In European Conference on Computer Vision, 2018.


    12. B. Hepp, M. Nießner, and O. Hilliges.


      Plan3D: Viewpoint and Trajectory Optimization for Aerial Multi-View
      Stereo Reconstruction.


      ACM Transactions on Graphics (TOG), 38(1):4, 2018.


    13. S. Isler, R. Sabzevari, J. Delmerico, and D. Scaramuzza.


      An information gain formulation for active volumetric 3d
      reconstruction.


      In ICRA, 2016.


    14. J. Martinez, R. Hossain, J. Romero, and J. Little.


      A Simple Yet Effective Baseline for 3D Human Pose Estimation.


      In International Conference on Computer Vision, 2017.


    15. D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt.


      Monocular 3D Human Pose Estimation in the Wild Using Improved CNN
      Supervision.


      In International Conference on 3D Vision, 2017.


    16. T. Nägeli, L. Meier, A. Domahidi, J. Alonso-Mora, and O. Hilliges.


      Real-time planning for automated multi-view drone cinematography.


      2017.


    17. T. Nägeli, S. Oberholzer, S. Plüss, J. Alonso-Mora, and O. Hilliges.


      Real-Time Environment-Independent Multi-View Human Pose Estimation
      with Aerial Vehicles.


      2018.


    18. E. Palazzolo and C. Stachniss.


      Information-driven autonomous exploration for a vision-based mav.


      ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial
      Information Sciences
      , 4:59, 2017.


    19. G. Pavlakos, X. Zhou, K. Derpanis, G. Konstantinos, and K. Daniilidis.


      Coarse-To-Fine Volumetric Prediction for Single-Image 3D Human
      Pose.


      In Conference on Computer Vision and Pattern Recognition, 2017.


    20. G. Pavlakos, X. Zhou, K. D. G. Konstantinos, and D. Kostas.


      Harvesting Multiple Views for Marker-Less 3D Human Pose
      Annotations.


      In Conference on Computer Vision and Pattern Recognition, 2017.


    21. D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli.


      3D Human Pose Estimation in Video with Temporal Convolutions and
      Semi-Supervised Training.


      In Conference on Computer Vision and Pattern Recognition, 2019.


    22. A. Pirinen, E. Gärtner, and C. Sminchisescu.


      Domes to drones: Self-supervised active triangulation for 3d human
      pose reconstruction.


      In Advances in Neural Information Processing Systems 32, pages
      3907–3917. 2019.


    23. A.-I. Popa, M. Zanfir, and C. Sminchisescu.


      Deep Multitask Architecture for Integrated 2D and 3D Human Sensing.


      In Conference on Computer Vision and Pattern Recognition, 2017.


    24. S. Prokudin, P. Gehler, and S. Nowozin.


      Deep Directional Statistics: Pose Estimation with Uncertainty
      Quantification.


      In European Conference on Computer Vision, pages 534–551,
      2018.


    25. P. Rahimian and J. K. Kearney.


      Optimal Camera Placement for Motion Capture Systems.


      IEEE Transactions on Visualization and Computer Graphics,
      23(3):1209–1221, 2016.


    26. J. Redmon and A. Farhadi.


      YOLOv3: An Incremental Improvement.


      In arXiv Preprint, 2018.


    27. H. Rhodin, C. Richardt, D. Casas, E. Insafutdinov, M. Shafiei, H.-P. Seidel,
      B. Schiele, and C. Theobalt.


      Egocap: Egocentric Marker-Less Motion Capture with Two Fisheye
      Cameras.


      ACM SIGGRAPH Asia, 35(6), 2016.


    28. M. Roberts, D. Dey, A. Truong, S. Sinha, S. Shah, A. Kapoor, P. Hanrahan, and
      N. Joshi.


      Submodular Trajectory Optimization for Aerial 3D Scanning.


      In International Conference on Computer Vision, 2017.


    29. G. Rogez, P. Weinzaepfel, and C. Schmid.


      Lcr-Net: Localization-Classification-Regression for Human Pose.


      In Conference on Computer Vision and Pattern Recognition, 2017.


    30. S. Shah, D. Dey, C. Lovett, and A. Kapoor.


      Airsim: High-fidelity visual and physical simulation for autonomous
      vehicles.


      In Field and Service Robotics, 2017.


    31. X. Sun, J. Shang, S. Liang, and Y. Wei.


      Compositional Human Pose Regression.


      In International Conference on Computer Vision, 2017.


    32. B. Tekin, P. Marquez-Neila, M. Salzmann, and P. Fua.


      Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose
      Estimation.


      In International Conference on Computer Vision, 2017.


    33. A. Tkach, A. Tagliasacchi, E. Remelli, M. Pauly, and A. Fitzgibbon.


      Online generative model personalization for hand tracking.


      ACM Transactions on Graphics (TOG), 36(6):243, 2017.


    34. D. Tome, C. Russell, and L. Agapito.


      Lifting from the Deep: Convolutional 3D Pose Estimation from a
      Single Image.


      In arXiv preprint, arXiv:1701.00295, 2017.


    35. Y. Xu, X. Liu, Y. Liu, and S. Zhu.


      Multi-View People Tracking via Hierarchical Trajectory Composition.


      In Conference on Computer Vision and Pattern Recognition, pages
      4256–4265, 2016.


    36. A. Zanfir, E. Marinoiu, and C. Sminchisescu.


      Monocular 3D Pose and Shape Estimation of Multiple People in Natural
      Scenes – the Importance of Multiple Scene Constraints.


      In Conference on Computer Vision and Pattern Recognition, June
      2018.


    37. X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. We.


      Weakly-Supervised Transfer for 3D Human Pose Estimation in the
      Wild.


      arXiv Preprint, 2017.


    38. X. Zhou, S. Liu, G. Pavlakos, V. Kumar, and K. Daniilidis.


      Human motion capture using a drone.


      In International Conference on Robotics and Automation, 2018.


    39. CMU Graphics Lab Motion Capture Database


    40. A. Aissaoui, A. Ouafi, P. Pudlo, C. Gillet, Z.-E. Baarir, and A. Taleb-Ahmed.


      Designing a Camera Placement Assistance System for Human Motion
      Capture Based on a Guided Genetic Algorithm.


      Virtual reality, 22(1):13–23, 2018.


    41. Z. Cao, T. Simon, S. Wei, and Y. Sheikh.


      Realtime Multi-Person 2D Pose Estimation Using Part Affinity
      Fields.


      In Conference on Computer Vision and Pattern Recognition, pages
      1302–1310, 2017.


    42. Y. Chao, J. Yang, B. Price, S. Cohen, and J. Deng.


      Forecasting Human Dynamics from Static Images.


      In Conference on Computer Vision and Pattern Recognition, 2017.


    43. X. Chen and J. Davis.


      Camera Placement Considering Occlusion for Robust Motion Capture.


      Computer Graphics Laboratory, Stanford University, Tech. Rep,
      2(2.2):2, 2000.


    44. W. Cheng, L. Xu, L. Han, Y. Guo, and L. Fang.


      ihuman3d: Intelligent human body 3d reconstruction using a single
      flying camera.


      In 2018 ACM Multimedia Conference on Multimedia Conference,
      pages 1733–1741. ACM, 2018.


    45. S. Choudhury, A. K. G., Ranade, and D. Dey.


      Learning to gather information via imitation.


      In ICRA, 2017.


    46. J. Daudelin and M. Campbell.


      An adaptable, probabilistic, next-best view algorithm for
      reconstruction of unknown 3-d objects.


      IEEE Robotics and Automation Letters, 2(3):1540–1547, 2017.


    47. A. J. Davison, I. Reid, N. Molton, and O. Stasse.


      Monoslam: Real-Time Single Camera Slam.


      IEEE Transactions on Pattern Analysis and Machine Intelligence,
      29(6):1052–1067, June 2007.


    48. K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik.


      Recurrent Network Models for Human Dynamics.


      In International Conference on Computer Vision, 2015.


    49. C. Gebhardt, S. Stevsic, and O. Hilliges.


      Optimizing for Aesthetically Pleasing Quadrotor Camera Motion.


      ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH),
      37(4):90:1–90:11, 2018.


    50. B. Hepp, D. Dey, S. Sinha, A. Kapoor, N. Joshi, and O. Hilliges.


      Learn-To-Score: Efficient 3D Scene Exploration by Predicting View
      Utility.


      In European Conference on Computer Vision, 2018.


    51. B. Hepp, M. Nießner, and O. Hilliges.


      Plan3D: Viewpoint and Trajectory Optimization for Aerial Multi-View
      Stereo Reconstruction.


      ACM Transactions on Graphics (TOG), 38(1):4, 2018.


    52. S. Isler, R. Sabzevari, J. Delmerico, and D. Scaramuzza.


      An information gain formulation for active volumetric 3d
      reconstruction.


      In ICRA, 2016.


    53. J. Martinez, R. Hossain, J. Romero, and J. Little.


      A Simple Yet Effective Baseline for 3D Human Pose Estimation.


      In International Conference on Computer Vision, 2017.


    54. D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt.


      Monocular 3D Human Pose Estimation in the Wild Using Improved CNN
      Supervision.


      In International Conference on 3D Vision, 2017.


    55. T. Nägeli, L. Meier, A. Domahidi, J. Alonso-Mora, and O. Hilliges.


      Real-time planning for automated multi-view drone cinematography.


      2017.


    56. T. Nägeli, S. Oberholzer, S. Plüss, J. Alonso-Mora, and O. Hilliges.


      Real-Time Environment-Independent Multi-View Human Pose Estimation
      with Aerial Vehicles.


      2018.


    57. E. Palazzolo and C. Stachniss.


      Information-driven autonomous exploration for a vision-based mav.


      ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial
      Information Sciences
      , 4:59, 2017.


    58. G. Pavlakos, X. Zhou, K. Derpanis, G. Konstantinos, and K. Daniilidis.


      Coarse-To-Fine Volumetric Prediction for Single-Image 3D Human
      Pose.


      In Conference on Computer Vision and Pattern Recognition, 2017.


    59. G. Pavlakos, X. Zhou, K. D. G. Konstantinos, and D. Kostas.


      Harvesting Multiple Views for Marker-Less 3D Human Pose
      Annotations.


      In Conference on Computer Vision and Pattern Recognition, 2017.


    60. D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli.


      3D Human Pose Estimation in Video with Temporal Convolutions and
      Semi-Supervised Training.


      In Conference on Computer Vision and Pattern Recognition, 2019.


    61. A. Pirinen, E. Gärtner, and C. Sminchisescu.


      Domes to drones: Self-supervised active triangulation for 3d human
      pose reconstruction.


      In Advances in Neural Information Processing Systems 32, pages
      3907–3917. 2019.


    62. A.-I. Popa, M. Zanfir, and C. Sminchisescu.


      Deep Multitask Architecture for Integrated 2D and 3D Human Sensing.


      In Conference on Computer Vision and Pattern Recognition, 2017.


    63. S. Prokudin, P. Gehler, and S. Nowozin.


      Deep Directional Statistics: Pose Estimation with Uncertainty
      Quantification.


      In European Conference on Computer Vision, pages 534–551,
      2018.


    64. P. Rahimian and J. K. Kearney.


      Optimal Camera Placement for Motion Capture Systems.


      IEEE Transactions on Visualization and Computer Graphics,
      23(3):1209–1221, 2016.


    65. J. Redmon and A. Farhadi.


      YOLOv3: An Incremental Improvement.


      In arXiv Preprint, 2018.


    66. H. Rhodin, C. Richardt, D. Casas, E. Insafutdinov, M. Shafiei, H.-P. Seidel,
      B. Schiele, and C. Theobalt.


      Egocap: Egocentric Marker-Less Motion Capture with Two Fisheye
      Cameras.


      ACM SIGGRAPH Asia, 35(6), 2016.


    67. M. Roberts, D. Dey, A. Truong, S. Sinha, S. Shah, A. Kapoor, P. Hanrahan, and
      N. Joshi.


      Submodular Trajectory Optimization for Aerial 3D Scanning.


      In International Conference on Computer Vision, 2017.


    68. G. Rogez, P. Weinzaepfel, and C. Schmid.


      Lcr-Net: Localization-Classification-Regression for Human Pose.


      In Conference on Computer Vision and Pattern Recognition, 2017.


    69. S. Shah, D. Dey, C. Lovett, and A. Kapoor.


      Airsim: High-fidelity visual and physical simulation for autonomous
      vehicles.


      In Field and Service Robotics, 2017.


    70. X. Sun, J. Shang, S. Liang, and Y. Wei.


      Compositional Human Pose Regression.


      In International Conference on Computer Vision, 2017.


    71. B. Tekin, P. Marquez-Neila, M. Salzmann, and P. Fua.


      Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose
      Estimation.


      In International Conference on Computer Vision, 2017.


    72. A. Tkach, A. Tagliasacchi, E. Remelli, M. Pauly, and A. Fitzgibbon.


      Online generative model personalization for hand tracking.


      ACM Transactions on Graphics (TOG), 36(6):243, 2017.


    73. D. Tome, C. Russell, and L. Agapito.


      Lifting from the Deep: Convolutional 3D Pose Estimation from a
      Single Image.


      In arXiv preprint, arXiv:1701.00295, 2017.


    74. Y. Xu, X. Liu, Y. Liu, and S. Zhu.


      Multi-View People Tracking via Hierarchical Trajectory Composition.


      In Conference on Computer Vision and Pattern Recognition, pages
      4256–4265, 2016.


    75. A. Zanfir, E. Marinoiu, and C. Sminchisescu.


      Monocular 3D Pose and Shape Estimation of Multiple People in Natural
      Scenes – the Importance of Multiple Scene Constraints.


      In Conference on Computer Vision and Pattern Recognition, June
      2018.


    76. X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. We.


      Weakly-Supervised Transfer for 3D Human Pose Estimation in the
      Wild.


      arXiv Preprint, 2017.


    77. X. Zhou, S. Liu, G. Pavlakos, V. Kumar, and K. Daniilidis.


      Human motion capture using a drone.


      In International Conference on Robotics and Automation, 2018.




    78. CMU Graphics Lab Motion Capture Database

    https://www.groundai.com/project/activemocap-optimized-drone-flight-for-active-human-motion-capture/


    CSIT FUN , 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权
    转载请注明原文链接:ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
    喜欢 (0)
    [985016145@qq.com]
    分享 (0)
    scott
    关于作者:
    发表我的评论
    取消评论
    表情 贴图 加粗 删除线 居中 斜体 签到

    Hi,您需要填写昵称和邮箱!

    • 昵称 (必填)
    • 邮箱 (必填)
    • 网址