ActiveMoCap: Optimized Drone Flight for Active Human Motion Capture
Abstract
The accuracy of monocular 3D human pose estimation depends on the viewpoint from which the image is captured. While cameraequipped drones provide control over this viewpoint, automatically positioning them at the location which will yield the highest accuracy remains an open problem. This is the problem that we address in this paper. Specifically, given a short video sequence, we introduce an algorithm that predicts the where a drone should go in the future frame so as to maximize 3D human pose estimation accuracy. A key idea underlying our approach is a method to estimate the uncertainty of the 3D body pose estimates. We integrate several sources of
uncertainty, originating from a deep learning based regressors and temporal smoothness.
The resulting motion planner leads to improved 3D body pose estimates
and outperforms or matches existing planners that are based on person following and orbiting.
/cvprfinalcopy
1 Introduction
Monocular solutions to 3D human pose estimation have become increasingly competent in recent years, but their accuracy remains relatively low. In this paper, we explore the use of a moving camera whose motion we can control to resolve ambiguities inherent to monocular 3D reconstruction and to increase accuracy. This is known as active vision and has received surprisingly little attention in the context of using modern approaches to body pose estimation. Fig. 1 depicts our approach for a camera carried by a drone, a key usecase for it.
In this paper, we introduce an algorithm designed to continuously position a moving camera, possibly mounted on a drone, at optimal locations to maximize the 3D pose estimation accuracy for a freely moving subject. We achieve this by moving the camera in 6D pose space to locations that maximize a utility function designed to predict reconstruction accuracy. This is nontrivial because reconstruction accuracy cannot be directly used in our utility function. Estimating it while planning the motion would require knowing the true person and drone position, a chicken and egg problem. Instead we use prediction uncertainty as a surrogate for accuracy. This is a common strategy to handle rigid scenes in robotics, for example to send the robot to locations where its internal map is most incomplete [57]. However, in our situation, estimating uncertainty is much more difficult since multiple sources of uncertainty need to be considered. They include uncertainties about what the subject will do next, the reliability of the pose estimation algorithm, and the accuracy of distance estimation along the camera’s line of sight.
Our key contribution is therefore a formal model that provides an estimate of the posterior variance and probabilistically fuses these sources of uncertainty with appropriate prior distributions. This has enabled us to develop an active motion capture technique that takes raw video footage as input from a moving aerial camera and continuously computes future target locations for positioning the camera, in a way that is optimized for human motion capture. We demonstrate our algorithm in two different scenarios and compare it against standard heuristics, such as constantly rotating around the subject and remaining at a constant angle with respect to the subject. We find that when allowed to choose the next location without constraints, our algorithm outperforms the baselines consistently. For simulated drone flight, our results are on par with constant rotation, which we conclude is the best trajectory to choose in the case of no obstacles blocking the circular flight path.
2 Related work
Most recent approaches to markerless motion capture rely on deep networks that regress 3D pose from single images [53, 54, 58, 73, 62, 68, 59, 76, 71, 70, 75]. While a few try to increase robustness by enforcing temporal consistency [60], none considers the effect that actively controlling the camera may have on accuracy. The methods most closely related to what we do are therefore those that optimize camera placement in multicamera setups and those that are used to guide robots in a previouslyunknown environment.
Optimal Camera Placement for Motion Capture. Optimal camera placement is a wellstudied problem in the context of static multiview setups. Existing solutions rely on maximizing image resolution while minimizing selfocclusion of body parts [43, 40] or target point occlusion and triangulation errors [64]. However, these methods operate offline and on prerecorded exemplar motions. This makes them unsuitable for motion capture using a single moving camera that films a priori unknown motions in a much larger scene where estimation noise can be high.
Concurrent work [61] optimizes multiple cameras poses for triangulation of joints in a dome environment using a selfsupervised reinforcement learning approach. In our case, we consider the monocular problem. Our method is not learning based, we try to obtain the next best view from the loss function itself.
View Planning for Static and People Reconstruction. There has been much robotics work on active reconstruction and view planning. This usually involves moving so as to maximize information gain while minimizing motion cost, for example by a discretizing space into a volumetric grid and counting previously unseen voxels [52, 46] or by accumulating estimation uncertainty [57]. When a coarse scene model is available, an optimal trajectory can be found using offline optimization [67, 51]. This has also been done to achieve desired aesthetic properties in cinematography [49]. Another approach is to use reinforcement learning to define policies [45] or to learn a metric [50] for later online path planning. These methods deal with rigid unchanging scenes, except the one in [44] that performs volumetric scanning of people during information gain maximization. However, this approach can only deal with very slowly moving people who stay where they are.
Human Motion Capture on Drones. Drones can be viewed as flying cameras and are therefore natural targets for our approach. One problem, however, is that the drone must keep the person in its field of view. To achieve this, the algorithm of [77] uses 2D human pose estimation in a monocular video and nonrigid structure from motion to reconstruct the articulated 3D pose of a subject, while that of [55] reacts online to the subject’s motion to keep them in view and to optimize for screenspace framing objectives.
In [56], this was integrated into an autonomous system that actively directs a swarm of drones and simultaneously reconstructs 3D human and drone poses from onboard cameras. This strategy implements a predefined policy to stay at constant distance to the subject and uses predefined view angles ( between two drones) to maximize triangulation accuracy. This enables mobile largescale motion capture, but relies on markers for accurate 2D pose estimation. In [74], three drones are used for markerless motion capture, using an RGBD video input for tracking the subject.
In short, existing methods either optimize for drone placement but for mostly rigid scenes, or estimate 3D human pose but without optimizing the camera placement. [61] performs optimal camera placement for multiple cameras. Here, we propose an approach that aims to find the best next drone location for monocular view so as to maximize 3D human pose estimation accuracy.
3 Active Human Motion Capture
Our goal is to continually preposition the camera in 6D pose space so that the images it acquires can be used to achieve the best overall human pose estimation accuracy. What makes this problem challenging is that, when we decide where to send the camera, we do not yet know where the subject will be and in what position exactly. We therefore have to guess. To this end, we propose the following threestep approach depicted by Fig. 1:

Estimate the 3D human pose up to the current time instant.

Predict the person’s location and pose by the time the camera acquires the next image, including an uncertainty estimate.

Cause the camera to assume the optimal 6D pose given that estimate. i j
We will consider two ways the camera can move. In the first, the camera can teleport from one location to the next without restriction. This can be simulated using a multicamera setup and allows us to explore the theoretical limits of this approach. In the second, more realistic scenario, the camera is carried by a drone, and we must take into account physical limits about the motion it can undertake.
3.1 3D Pose Estimation
The 3D pose estimation step takes as input the video feed from the onboard camera over the past frames and outputs for each frame, , the 3D human pose, represented as 15 3D points , and the drone pose, as 3D position and rotation angles .
The drone pose is computed relative to the static background with monocular structure from motion. Our focus is on estimating the human pose, on the basis of the deeplearningbased realtime method of
[41], which detects the 2D locations of the human’s major joints in the image plane, , and the subsequent use of [71], which lifts these 2D predictions to 3D pose, . However, these perframe estimates are error prone and relative to the camera.
To remedy this, we fuse 2D and 3D predictions with temporal smoothness and bonelength constraints in a spacetime optimization. This exploits the fact that the drone is constantly moving so as to disambiguate the individual estimates. The bone lengths, , of the subject’s skeleton are found through a calibration mode where the subject has to stand still for 20 seconds. This is performed only once for each subject. Formally, we optimize for the global 3D human pose by minimizing an objective function , which we detail below.
Formulation
Our primary goal is to improve the global 3D human pose estimation of a subject changing position and pose. We optimize the timevarying pose trajectories across the last frames. Let be the last observed frame. We capture the trajectory of poses to in the pose matrix .
We then write an energy function
(1) 
The individual terms are defined as follows.
The lift term, , leverages the 3D pose estimates, , from LiftNet [71]. Because these are relative to the hip and without absolute scale, we subtract the hip position from our absolute 3D pose, , and apply a scale factor to to match the bone lengths in the leastsquare sense. We write
(2) 
with its relative weight.
The projection term measures the difference between the detected 2D joint locations and the projection of the estimated 3D pose in the leastsquare sense. We write it as
(3) 
where is the perspective projection function, is the matrix of camera intrinsic parameters, and is a weight that controls the influence of this term.
The smoothness term exploits that we are using a continuous video feed and that the motion is smooth by penalizing velocity computed by finite differences as
(4) 
with as its weight.
To further constrain the solution space, we use our knowledge of the bone lengths found during calibration and penalize deviations in length. The length of each bone in the set of all bones is found as for frame . The bone length term is then defined as
(5) 
with as its weight.
The complete energy is minimized by gradient descent at the beginning of each control cycle, to get a pose estimate for control. The resulting pose estimate is the maximum a posteriori estimate in a probabilistic view.
Calibration Mode
Calibration mode only has to be run once for each subject to find the bone lengths, . In this mode, the scene is assumed to be stationary. The situation is equivalent to having the scene observed from multiple stationary cameras, such as in [66]. We find the single static pose that minimizes
(6) 
In this objective, the projection term, , is akin to the one in our main formulation but acts on all calibration frames.
It can be written as
(7) 
with controlling its influence.
The symmetry term, , ensures that the left and right limbs of the estimated skeleton have the same lengths by penalizing the squared difference between length of the left and right bones.
3.2 Best Next View Selection
Our goal is to find the best next view for the drone at the future time step , .
We will model the uncertainty of the pose estimate in a probabilistic setting. Let be the posterior distribution of poses. Then, is its negative logarithm and its minimization corresponds to maximum a posteriori (MAP) estimation.
In this formalism, the sum of the individual terms in models that our posterior distribution is composed of independent likelihood and prior distributions. For a purely quadratic term, , the corresponding distribution is a Gaussian with mean and standard deviation . Notably, is directly linked to the weight of the energy.
Most of our energy terms involve nonlinear operations, such as perspective projection in , and therefore induce nonGaussian distributions, as visualized in Fig. 2.
Nevertheless, as for the simple quadratic case, the weights and of and can be interpreted as surrogates for the amount of measurement noise in the 2D and 3D pose estimates.
A good measure of uncertainty is the sum of the eigenvalues of the covariance of the underlying distribution .
The sum of the eigenvalues captures the spread of a multivariate distribution with a single variable, similarly to the variance in the univariate case. To exploit this uncertainty estimation for our problem, we now extend to model not only the current and past poses but also the future ones and condition it on the choice of the future drone position.
To determine the best next drone pose, we sample candidate positions and chose the one with the lowest uncertainty. This process is illustrated in Figure 3.
Future pose forecasting.
In our setting, accounting for the dynamic motion of the person is key to successfully positioning the camera. We model the motion of the person from the current frame to the next future frames , linearly, i.e. we aim to keep the velocity of the joints constant across our window of frames. We also constrain the future poses by the bone length term. The future pose vectors are constrained by the smoothness and bone length terms, but for now not by any imagebased term since the future images are not yet available at time . Minimizing this extended for future poses gives the MAP poses . It continues the motion smoothly while maintaining the bone lengths.
As we predict only the near future, we have found this simple extrapolation to be sufficient. While more advanced methods [48] could be applied to forecast further, we leave this as future work.
Future measurement forecasting.
We aim to find the future drone position, , that reduces the posterior uncertainty, but we do not have footage from future viewpoints to condition the posterior on. Instead, we use the predicted future human pose , , as a proxy for and approximate with the projection
(8) 
At first glance, constraining the future pose on these virtual estimates in does not add anything since the terms and are zero at by this construction. However, it changes the energy landscape and models how strong a future observation would constrain the pose posterior. In particular, the projection term, , narrows down the solution space in the direction of the image plane but cannot constrain it in the depth direction, creating an elliptical uncertainty as visualized in Fig 3. The combined influence of all terms is conveniently modeled as the energy landscape of and its corresponding posterior.
In our current implementation we assume that the 2D and 3D detections are affected by poseindependent noise, and their variance is captured by and , respectively.
These factors could, in principle, be view dependent and in relation to the person’s pose. For instance, [42] may be more accurate at reconstructing a front view than a side view.
However, while estimating the uncertainty in deep networks is an active research field [63], predicting the expected uncertainty for an unobserved view has not yet been attempted in the pose estimation literature. It is an interesting avenue for future work.
Variance estimator.
and its corresponding posterior has a complex form due to the projection and prior terms. Hence, the soughtafter covariance cannot be expressed in closed form and approximating it by sampling the space of all possible poses would be expensive. Instead, for the sake of uncertainty estimation, we approximate locally with a Gaussian distribution , such that
(9) 
with and the Gaussians mean and covariance matrix, respectively. Such an approximation is exemplified in Figure 2.
For a Gaussian, the covariance of can be computed in closed form as the inverse of the Hessian of the negative log likelihood, , where .
Under the Gaussian assumption, is thereby well approximated by the second order gradients, , of .
Our experiments show that this simplification holds well for all of the introduced error terms, except for the bone length one, which we therefore exclude from uncertainty estimation.
To select the view with minimum uncertainty among a set of candidate drone positions, we therefore

optimize once to forecast human poses , for

use these forecasted poses to set and for each for each candidate position ,

compute the second order derivatives of for each , which form , and

compute and sum up the respective eigenvalues to select the candidate with the least uncertainty.
Discussion.
In principle, , i.e. the probability of the most likely pose, could also act as a measure of certainty, as implicitly used in [64] on a known motion trajectory to minimize triangulation error. However, the term of is zero for the future time step , because the projection of is by construction equal to and therefore uninformative.
Another alternative that has been proposed in the literature is to approximate the covariance through first order estimates [72], as a function of the Jakobi matrix. However, as also the first order gradients of vanish at the MAP estimate, this approximation is not possible in our case.
3.3 Drone Control Policies and Flight Model
We control the flight of our drone by passing it the desired velocity vector and the desired yaw rotation amount with the maximum speed kept constant at m/s. The drone is sent new commands once every seconds.
Modeling the flight of the drone allows us to foresee the possible positions the drone will be able to reach when we give it various commands. Since our future locations By forecasting the future locations of the drone, we can predict the 2D pose estimations for each more accurately.
We model the drone flight in the following manner. We assume that the drone moves with constant acceleration during a time step . If the drone has current position and velocity , then with acceleration , its next position in time will be
(10) 
We model our input to the system as the acceleration, . The direction of the acceleration is assumed to be the direction of the velocity vector we pass as the movement command to the simulator, and the magnitude is a value determined through leastsquare minimization. The current acceleration at time is found as a weighted average of the input acceleration and the acceleration of the previous step . This can be written as
(11) 
By estimating the future positions of the drone, we are able to forecast more accurate future 2D pose estimations, leading to more accurate decision making. Examples of predicted trajectories are shown in Figure 4 Further details are provided in the supplementary material.
4 Evaluation
In this section we, evaluate the improvement on 3D human pose estimation that is achieved through optimization of the drone flight.
Simulation environment. Although [65, 41, 71] run in real time, and online SLAM from a monocular camera [47] is possible, we use a drone simulator since the integration of all components onto constrained drone hardware is difficult and beyond our expertise.
We make simulation realistic by driving our characters with real motion capture data from the CMU Graphics Lab Motion Capture Database [78] and using the AirSim [69] drone simulator that builds upon the Unreal game engine and therefore produces realistic images of natural environments. An image of AirSim is shown in Figure 5. Simulation also has the advantage that the same experiment can be repeated with different parameters and be directly compared to baseline methods and groundtruth motion.
Simulated test set.
We test our approach on three motions from the CMU database that increase in difficulty:
Walking straight (subject 2, trial 1),
Dance with twirling (subject 5, trial 8), and
Running in a circle (subject 38, trial 3). Additionally, we use a validation set consisting of Basketball dribble (subject 6, trial 13), and
Sitting on a stool (subject 13, trial 6), to conduct a grid search for hyperparameters.
Real test set.
To show that our planner also works outside the simulator, we evaluate our approach on a section of the MPIINF3DHP dataset, which includes motions such as running around in a circle and waving arms in the air. The dataset provides fixed viewpoints that are at varying distances from one another and from the subject, as depicted in Figure 7. In this case, the best next view is restricted to one of the fixed viewpoints. This dataset lets us evaluate whether the object detector of [65], the 2D pose estimation method of [42], and the 3D pose regression technique of [71] are reliable enough in real environments. Since we cannot control the camera in this setting, we remove those cameras from the candidate locations where we predict that the subject will be out of the viewpoint.
CMUWalk  CMUDance  CMURun  MPIINF3DHP  Total  

Oracle  0.1010.001  0.1010.001  0.1090.001  0.1090.002  0.1050.001 
Ours (Active)  0.1130.001  0.1160.003  0.1350.002  0.1450.006  0.1270.003 
Random  0.1230.002  0.1250.003  0.1590.003  0.2590.011  0.1670.005 
Constant Rotation  0.1570.002  0.1460.004  0.2230.003  0.2540.008  0.1950.004 
Constant Angle  0.8950.54  0.6830.31  0.9850.24  1.450.63  1.000.43 
Baselines.
Existing dronebased pose estimation methods use predefined policies to control the drone position relative to the human. Either the human is followed from a constant angle and the angle is set externally by the user [56] or the drone undergoes a constant rotation around the human [77]. As another baseline, we use a random decision policy, where the drone picks uniformly randomly among the proposed viewpoints. Finally, the oracle is obtained by moving the drone to the viewpoint where the reconstruction in the next time step will have the lowest average error, which is achieved by exhaustively trying all proposed viewpoints with the corresponding image in the next time frame.
Hyper parameters. We set the weights of the loss term for the reconstruction as follows: (projection), (smoothness), (lift term), (bone length), which were found by grid search. We set the weights for the decision making as , , , . Our reasoning is, we need to set the weights of the projection and lift terms slightly lower because they are estimated with large noise, which is introduced by the neural networks or as additive noise. However, they do not need to be as low for the uncertainty estimation.
4.1 Analyzing Reconstruction Accuracy
We report the mean Euclidean distance per joint in meters in the middle frame of the temporal window we optimize over. In the toy example, the size of the temporal window is set to past frames and future frame, and for the drone flight simulations, to for past frames and future frames.
Simulation Initialization.
The frames are initialized by backprojecting the 2D joint locations estimated in the first frame, , to a distance from the camera that is chosen such that the backprojected bone lengths match with the average human height. We then refine this initialization by running the optimization without the smoothness term, as there is only one frame. All the sequences are evaluated for frames, with the animation sequences played at Hz.
Toy Example: Simulating Teleportation.
To understand whether our uncertainty predictions for potential viewpoints coincide with the actual 3D pose errors we will have at these locations, we run the following simulation: We sample a total of points on a ring around the person, as shown in Fig. 6, and allow the drone to teleport to any of these points. We optimize over a total of past frames and forecast frame into the future. The reasoning behind our choice of window size is we wanted to emphasize the importance of the next choice of frame.
We perform two variants of this experiment. In the first one, we simulate the 2D and 3D pose estimates, , by adding Gaussian noise to the groundtruth data. The mean and standard deviation of this noise is set as the error of [41] and [71], run on the validation set of animations. Figure 8 shows a comparison between the ground truth values, noisy ground truth values and the network results. The results of this experiment are reported in Table 1, where we also provide the standard deviations across 5 trials with varying noise and starting from different viewpoints. As a second variant, we use [41] and [71] on the simulator images to obtain the 2D and 3D pose estimates. The results are in the supplementary material.
Altogether, the results show that our active motion planner achieves consistently lower error values than the baselines and we come the closest to achieving the best possible error for these sequences and viewpoints, despite having no access to the true error. The random baseline also performs quite well in these experiments, as it takes advantage of the drone teleporting to a varied set of viewpoints. The trajectories generated by our active planner and the baselines is depicted in Figure 9. Importantly, Figure 6 evidences that our predicted uncertainties accurately reflect the true pose errors, thus making them well suited to our goal.
Simulating Drone Flight.
To evaluate more realistic cases where the drone is actively controlled and constrained to only move to nearby locations, we simulate the drone flight using the AirSim environment.
While simulating drone flight, we target a fixed radius of m from the subject and therefore provide direction candidates that lead to preserving this distance. We do not provide samples at different distances, as moving closer is unsafe and moving farther leads to more concentrated image projections and thus higher 3D errors.
We also restrict the drone from flying outside the altitude range mm, so as to avoid crashing into the ground and endangering the subject by flying above them.
In this set of experiments, we fly the drone using the simulator’s realistic physics engine. To this end, we sample candidate directions in the directions up, down, left, right, upright, upleft, downright, downleft and center. We then predict the consecutive future locations using our simplified (closed form) physics model, to get and estimate where the drone will be at when continuing in each of the directions. We then estimate the uncertainty at these sampled viewpoints and choose the minimum.
CMUWalk  CMUDance  CMURun  Total  

Ours (Active)  0.260.03  0.220.04  0.440.04  0.310.04 
Constant Rotation  0.280.06  0.210.04  0.410.02  0.300.04 
Random  0.600.13  0.440.19  0.810.16  0.620.16 
Constant Angle  0.410.07  0.63 0.06  1.260.17  0.770.10 
We achieve comparable results to constant rotation on simulated drone flight. In fact, except for the first few frames where the drone starts flying, we observe the same trajectory as constant rotation, only the rotation direction varies. Constant rotation being optimal in this setting is not counterintuitive, as constant rotation is very useful for preserving momentum. This allows the drone to sample viewpoints as far apart from one another as possible, while keeping the subject in view. Figure 10 depicts the different baseline trajectories and the active trajectory.
5 Limitations and Future Work
Our primary goal was to make uncertainty estimation tractable, further improvements are needed to run it on an embedded drone system. The current implementation runs at Hz, however, the optimization is implemented in Python using the convenient but slow automatic differentiation of Pytorch to obtain second derivatives.
Currently, we assume a constant error for the 2D and 3D pose estimates. We will investigate in future work as how to derive situation dependent noise models of deep neural networks.
We have considered a physically plausible drone model but neglected physical obstacles and virtual nogo areas that would restrict the possible flight trajectories. In the case of complex scenes with dynamic obstacles, we expect our algorithm to outperform any simple, predefined policy.
6 Conclusion
We have proposed a theoretical framework for estimating the uncertainty of future measurements from a drone. This permits us to improve 3D human pose estimation by optimizing the drone flight to visit those locations with the lowest expected uncertainty.
We have demonstrated with increasingly complex examples, in simulation with synthetic and real footage, that this theory translates to closedloop drone control and improves pose estimation accuracy. Key to the success of our approach is the integration of several sources of uncertainty. In future work, we would like to find new ways of estimating the uncertainty of the deployed deep learning methods and extend our work to optimize drone trajectories for different computer vision tasks.
References
Appendix A Supplementary Material
a.1 Supplementary Video
The supplementary video provides a short overview of our work and summarizes the methodology and results. It includes video results of our active trajectories for both the teleportation and simulated flight cases. The video is available at https://youtu.be/Dqv7ZJQi28o.
a.2 The Drone Flight Model
As we mention in Section of our main document, in order to accurately predict where the drone will be positioned after passing it a goal velocity, we have formulated a drone flight model.
Ablation Study. We replace our drone flight model with uniform sampling around the drone. This is illustrated in Figure 11. We evaluate the performance of our active decision making policy with the uniform sampling in Table 3. The trajectories found using this sampling policy is shown in Figure 12. We find that the algorithm cannot find the constant rotation policy when we remove the drone flight model and in turn, performs worse.
CMUDribble  CMUSitting  CMUDinosour  Total  

Active with Flight Model  0.280.006  0.150.007  0.120.02  0.180.01 
Active w/o Flight Model  0.650.09  0.480.09  0.220.07  0.450.08 
Constant Rot.  0.300.02  0.150.01  0.150.03  0.200.02 
a.3 Results with Openpose and Liftnet
We evaluate our results on the toy example case, using the networks of [41] and [71] to find the 2D pose detections and 3D relative pose detections . The results are reported in Table 4. We outperform the baselines significantly for the real image dataset MPIINF3DHP. For the synthetic images, somes we are outperformed by random, but its error has much higher standard deviation and the difference between ours and random is within 1 standard deviation.
We outperform the baselines significantly in the real image dataset as compared to the synthetic datasets because the error of network [41] for real data is much lower than for synthetic data. We verify this by comparing the normalized 2Dpose estimation errors of a synthetic sequence and a sequence taken from the MPIINF3DHP dataset. We find that the normalized average error of [41] of the synthetic sequence is with standard deviation, whereas the normalized average error of the real image sequence is with standard deviation. Therefore, the unrealistically high noise of OpenPose on the synthetic data deprives strong conclusions from the first three columns of Table 4.
Oracle still performs very well for synthetic images in this case, but oracle makes decisions knowing the results of [41] for all candidate locations. However, this is impossible in practice due to the inherent uncertainty.
When we our 2D pose detector is not unreliable, as in the case of Table of our main document, we outperform random on all cases, well outside 2 standard deviations.
For the case of the MPIINF3DHP dataset, we remove the ceiling cameras for this set of experiments. Since the networks of [41] and [71] were not trained with views from such angles they give highly noisy results which would also add noise to the values we report.
CMUWalk  CMUDance  CMURun  MPIINF3DHP.  Total  
Oracle  0.130  0.150  0.160.0005  0.170.0005  0.150.0003 
Ours (Active)  0.160.005  0.250.0009  0.250.002  0.210.0008  0.220.002 
Random  0.170.004  0.240.01  0.240.005  0.280.03  0.230.01 
Constant Rot.  0.200.002  0.280.02  0.280.001  0.290.007  0.260.007 
Constant Angle  0.710.50  0.760.37  0.690.22  1.260.53  0.720.4 
a.4 Further Details About Simulation Environment
To test our algorithms we use the AirSim [69] drone simulator, a plugin built for the Unreal game engine.
AirSim provides a Python API which can be used to control the drone realistically, since it uses the same flight controllers as used on actual drones. The position and orientation of the drone can be retrieved from the simulator according to the world coordinate system, which takes the drone’s starting point as the origin. The drone can be commanded to move to a with a specified velocity for a specified duration. We have added functionalities to the simulator to control a human character, get ground truth information about the character and animate it with motions from the CMU Graphics Lab Motion Capture Database [78].
For experiments requiring teleportation we use the simulator in ”ComputerVision” mode, whereas for experiments simulating flight we use ”Multirotor” mode.
References

A. Aissaoui, A. Ouafi, P. Pudlo, C. Gillet, Z.E. Baarir, and A. TalebAhmed.
Designing a Camera Placement Assistance System for Human Motion
Capture Based on a Guided Genetic Algorithm.
Virtual reality, 22(1):13–23, 2018. 
Z. Cao, T. Simon, S. Wei, and Y. Sheikh.
Realtime MultiPerson 2D Pose Estimation Using Part Affinity
Fields.
In Conference on Computer Vision and Pattern Recognition, pages
1302–1310, 2017. 
Y. Chao, J. Yang, B. Price, S. Cohen, and J. Deng.
Forecasting Human Dynamics from Static Images.
In Conference on Computer Vision and Pattern Recognition, 2017. 
X. Chen and J. Davis.
Camera Placement Considering Occlusion for Robust Motion Capture.
Computer Graphics Laboratory, Stanford University, Tech. Rep,
2(2.2):2, 2000. 
W. Cheng, L. Xu, L. Han, Y. Guo, and L. Fang.
ihuman3d: Intelligent human body 3d reconstruction using a single
flying camera.
In 2018 ACM Multimedia Conference on Multimedia Conference,
pages 1733–1741. ACM, 2018. 
S. Choudhury, A. K. G., Ranade, and D. Dey.
Learning to gather information via imitation.
In ICRA, 2017. 
J. Daudelin and M. Campbell.
An adaptable, probabilistic, nextbest view algorithm for
reconstruction of unknown 3d objects.
IEEE Robotics and Automation Letters, 2(3):1540–1547, 2017. 
A. J. Davison, I. Reid, N. Molton, and O. Stasse.
Monoslam: RealTime Single Camera Slam.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
29(6):1052–1067, June 2007. 
K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik.
Recurrent Network Models for Human Dynamics.
In International Conference on Computer Vision, 2015. 
C. Gebhardt, S. Stevsic, and O. Hilliges.
Optimizing for Aesthetically Pleasing Quadrotor Camera Motion.
ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH),
37(4):90:1–90:11, 2018. 
B. Hepp, D. Dey, S. Sinha, A. Kapoor, N. Joshi, and O. Hilliges.
LearnToScore: Efficient 3D Scene Exploration by Predicting View
Utility.
In European Conference on Computer Vision, 2018. 
B. Hepp, M. Nießner, and O. Hilliges.
Plan3D: Viewpoint and Trajectory Optimization for Aerial MultiView
Stereo Reconstruction.
ACM Transactions on Graphics (TOG), 38(1):4, 2018. 
S. Isler, R. Sabzevari, J. Delmerico, and D. Scaramuzza.
An information gain formulation for active volumetric 3d
reconstruction.
In ICRA, 2016. 
J. Martinez, R. Hossain, J. Romero, and J. Little.
A Simple Yet Effective Baseline for 3D Human Pose Estimation.
In International Conference on Computer Vision, 2017. 
D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt.
Monocular 3D Human Pose Estimation in the Wild Using Improved CNN
Supervision.
In International Conference on 3D Vision, 2017. 
T. Nägeli, L. Meier, A. Domahidi, J. AlonsoMora, and O. Hilliges.
Realtime planning for automated multiview drone cinematography.
2017. 
T. Nägeli, S. Oberholzer, S. Plüss, J. AlonsoMora, and O. Hilliges.
RealTime EnvironmentIndependent MultiView Human Pose Estimation
with Aerial Vehicles.
2018. 
E. Palazzolo and C. Stachniss.
Informationdriven autonomous exploration for a visionbased mav.
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial
Information Sciences, 4:59, 2017. 
G. Pavlakos, X. Zhou, K. Derpanis, G. Konstantinos, and K. Daniilidis.
CoarseToFine Volumetric Prediction for SingleImage 3D Human
Pose.
In Conference on Computer Vision and Pattern Recognition, 2017. 
G. Pavlakos, X. Zhou, K. D. G. Konstantinos, and D. Kostas.
Harvesting Multiple Views for MarkerLess 3D Human Pose
Annotations.
In Conference on Computer Vision and Pattern Recognition, 2017. 
D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli.
3D Human Pose Estimation in Video with Temporal Convolutions and
SemiSupervised Training.
In Conference on Computer Vision and Pattern Recognition, 2019. 
A. Pirinen, E. Gärtner, and C. Sminchisescu.
Domes to drones: Selfsupervised active triangulation for 3d human
pose reconstruction.
In Advances in Neural Information Processing Systems 32, pages
3907–3917. 2019. 
A.I. Popa, M. Zanfir, and C. Sminchisescu.
Deep Multitask Architecture for Integrated 2D and 3D Human Sensing.
In Conference on Computer Vision and Pattern Recognition, 2017. 
S. Prokudin, P. Gehler, and S. Nowozin.
Deep Directional Statistics: Pose Estimation with Uncertainty
Quantification.
In European Conference on Computer Vision, pages 534–551,
2018. 
P. Rahimian and J. K. Kearney.
Optimal Camera Placement for Motion Capture Systems.
IEEE Transactions on Visualization and Computer Graphics,
23(3):1209–1221, 2016. 
J. Redmon and A. Farhadi.
YOLOv3: An Incremental Improvement.
In arXiv Preprint, 2018. 
H. Rhodin, C. Richardt, D. Casas, E. Insafutdinov, M. Shafiei, H.P. Seidel,
B. Schiele, and C. Theobalt.
Egocap: Egocentric MarkerLess Motion Capture with Two Fisheye
Cameras.
ACM SIGGRAPH Asia, 35(6), 2016. 
M. Roberts, D. Dey, A. Truong, S. Sinha, S. Shah, A. Kapoor, P. Hanrahan, and
N. Joshi.
Submodular Trajectory Optimization for Aerial 3D Scanning.
In International Conference on Computer Vision, 2017. 
G. Rogez, P. Weinzaepfel, and C. Schmid.
LcrNet: LocalizationClassificationRegression for Human Pose.
In Conference on Computer Vision and Pattern Recognition, 2017. 
S. Shah, D. Dey, C. Lovett, and A. Kapoor.
Airsim: Highfidelity visual and physical simulation for autonomous
vehicles.
In Field and Service Robotics, 2017. 
X. Sun, J. Shang, S. Liang, and Y. Wei.
Compositional Human Pose Regression.
In International Conference on Computer Vision, 2017. 
B. Tekin, P. MarquezNeila, M. Salzmann, and P. Fua.
Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose
Estimation.
In International Conference on Computer Vision, 2017. 
A. Tkach, A. Tagliasacchi, E. Remelli, M. Pauly, and A. Fitzgibbon.
Online generative model personalization for hand tracking.
ACM Transactions on Graphics (TOG), 36(6):243, 2017. 
D. Tome, C. Russell, and L. Agapito.
Lifting from the Deep: Convolutional 3D Pose Estimation from a
Single Image.
In arXiv preprint, arXiv:1701.00295, 2017. 
Y. Xu, X. Liu, Y. Liu, and S. Zhu.
MultiView People Tracking via Hierarchical Trajectory Composition.
In Conference on Computer Vision and Pattern Recognition, pages
4256–4265, 2016. 
A. Zanfir, E. Marinoiu, and C. Sminchisescu.
Monocular 3D Pose and Shape Estimation of Multiple People in Natural
Scenes – the Importance of Multiple Scene Constraints.
In Conference on Computer Vision and Pattern Recognition, June
2018. 
X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. We.
WeaklySupervised Transfer for 3D Human Pose Estimation in the
Wild.
arXiv Preprint, 2017. 
X. Zhou, S. Liu, G. Pavlakos, V. Kumar, and K. Daniilidis.
Human motion capture using a drone.
In International Conference on Robotics and Automation, 2018. 
CMU Graphics Lab Motion Capture Database 
A. Aissaoui, A. Ouafi, P. Pudlo, C. Gillet, Z.E. Baarir, and A. TalebAhmed.
Designing a Camera Placement Assistance System for Human Motion
Capture Based on a Guided Genetic Algorithm.
Virtual reality, 22(1):13–23, 2018.

Z. Cao, T. Simon, S. Wei, and Y. Sheikh.
Realtime MultiPerson 2D Pose Estimation Using Part Affinity
Fields.
In Conference on Computer Vision and Pattern Recognition, pages
1302–1310, 2017.

Y. Chao, J. Yang, B. Price, S. Cohen, and J. Deng.
Forecasting Human Dynamics from Static Images.
In Conference on Computer Vision and Pattern Recognition, 2017.

X. Chen and J. Davis.
Camera Placement Considering Occlusion for Robust Motion Capture.
Computer Graphics Laboratory, Stanford University, Tech. Rep,
2(2.2):2, 2000.

W. Cheng, L. Xu, L. Han, Y. Guo, and L. Fang.
ihuman3d: Intelligent human body 3d reconstruction using a single
flying camera.
In 2018 ACM Multimedia Conference on Multimedia Conference,
pages 1733–1741. ACM, 2018.

S. Choudhury, A. K. G., Ranade, and D. Dey.
Learning to gather information via imitation.
In ICRA, 2017.

J. Daudelin and M. Campbell.
An adaptable, probabilistic, nextbest view algorithm for
reconstruction of unknown 3d objects.
IEEE Robotics and Automation Letters, 2(3):1540–1547, 2017.

A. J. Davison, I. Reid, N. Molton, and O. Stasse.
Monoslam: RealTime Single Camera Slam.
IEEE Transactions on Pattern Analysis and Machine Intelligence,
29(6):1052–1067, June 2007.

K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik.
Recurrent Network Models for Human Dynamics.
In International Conference on Computer Vision, 2015.

C. Gebhardt, S. Stevsic, and O. Hilliges.
Optimizing for Aesthetically Pleasing Quadrotor Camera Motion.
ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH),
37(4):90:1–90:11, 2018.

B. Hepp, D. Dey, S. Sinha, A. Kapoor, N. Joshi, and O. Hilliges.
LearnToScore: Efficient 3D Scene Exploration by Predicting View
Utility.
In European Conference on Computer Vision, 2018.

B. Hepp, M. Nießner, and O. Hilliges.
Plan3D: Viewpoint and Trajectory Optimization for Aerial MultiView
Stereo Reconstruction.
ACM Transactions on Graphics (TOG), 38(1):4, 2018.

S. Isler, R. Sabzevari, J. Delmerico, and D. Scaramuzza.
An information gain formulation for active volumetric 3d
reconstruction.
In ICRA, 2016.

J. Martinez, R. Hossain, J. Romero, and J. Little.
A Simple Yet Effective Baseline for 3D Human Pose Estimation.
In International Conference on Computer Vision, 2017.

D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and C. Theobalt.
Monocular 3D Human Pose Estimation in the Wild Using Improved CNN
Supervision.
In International Conference on 3D Vision, 2017.

T. Nägeli, L. Meier, A. Domahidi, J. AlonsoMora, and O. Hilliges.
Realtime planning for automated multiview drone cinematography.
2017.

T. Nägeli, S. Oberholzer, S. Plüss, J. AlonsoMora, and O. Hilliges.
RealTime EnvironmentIndependent MultiView Human Pose Estimation
with Aerial Vehicles.
2018.

E. Palazzolo and C. Stachniss.
Informationdriven autonomous exploration for a visionbased mav.
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial
Information Sciences, 4:59, 2017.

G. Pavlakos, X. Zhou, K. Derpanis, G. Konstantinos, and K. Daniilidis.
CoarseToFine Volumetric Prediction for SingleImage 3D Human
Pose.
In Conference on Computer Vision and Pattern Recognition, 2017.

G. Pavlakos, X. Zhou, K. D. G. Konstantinos, and D. Kostas.
Harvesting Multiple Views for MarkerLess 3D Human Pose
Annotations.
In Conference on Computer Vision and Pattern Recognition, 2017.

D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli.
3D Human Pose Estimation in Video with Temporal Convolutions and
SemiSupervised Training.
In Conference on Computer Vision and Pattern Recognition, 2019.

A. Pirinen, E. Gärtner, and C. Sminchisescu.
Domes to drones: Selfsupervised active triangulation for 3d human
pose reconstruction.
In Advances in Neural Information Processing Systems 32, pages
3907–3917. 2019.

A.I. Popa, M. Zanfir, and C. Sminchisescu.
Deep Multitask Architecture for Integrated 2D and 3D Human Sensing.
In Conference on Computer Vision and Pattern Recognition, 2017.

S. Prokudin, P. Gehler, and S. Nowozin.
Deep Directional Statistics: Pose Estimation with Uncertainty
Quantification.
In European Conference on Computer Vision, pages 534–551,
2018.

P. Rahimian and J. K. Kearney.
Optimal Camera Placement for Motion Capture Systems.
IEEE Transactions on Visualization and Computer Graphics,
23(3):1209–1221, 2016.

J. Redmon and A. Farhadi.
YOLOv3: An Incremental Improvement.
In arXiv Preprint, 2018.

H. Rhodin, C. Richardt, D. Casas, E. Insafutdinov, M. Shafiei, H.P. Seidel,
B. Schiele, and C. Theobalt.
Egocap: Egocentric MarkerLess Motion Capture with Two Fisheye
Cameras.
ACM SIGGRAPH Asia, 35(6), 2016.

M. Roberts, D. Dey, A. Truong, S. Sinha, S. Shah, A. Kapoor, P. Hanrahan, and
N. Joshi.
Submodular Trajectory Optimization for Aerial 3D Scanning.
In International Conference on Computer Vision, 2017.

G. Rogez, P. Weinzaepfel, and C. Schmid.
LcrNet: LocalizationClassificationRegression for Human Pose.
In Conference on Computer Vision and Pattern Recognition, 2017.

S. Shah, D. Dey, C. Lovett, and A. Kapoor.
Airsim: Highfidelity visual and physical simulation for autonomous
vehicles.
In Field and Service Robotics, 2017.

X. Sun, J. Shang, S. Liang, and Y. Wei.
Compositional Human Pose Regression.
In International Conference on Computer Vision, 2017.

B. Tekin, P. MarquezNeila, M. Salzmann, and P. Fua.
Learning to Fuse 2D and 3D Image Cues for Monocular Body Pose
Estimation.
In International Conference on Computer Vision, 2017.

A. Tkach, A. Tagliasacchi, E. Remelli, M. Pauly, and A. Fitzgibbon.
Online generative model personalization for hand tracking.
ACM Transactions on Graphics (TOG), 36(6):243, 2017.

D. Tome, C. Russell, and L. Agapito.
Lifting from the Deep: Convolutional 3D Pose Estimation from a
Single Image.
In arXiv preprint, arXiv:1701.00295, 2017.

Y. Xu, X. Liu, Y. Liu, and S. Zhu.
MultiView People Tracking via Hierarchical Trajectory Composition.
In Conference on Computer Vision and Pattern Recognition, pages
4256–4265, 2016.

A. Zanfir, E. Marinoiu, and C. Sminchisescu.
Monocular 3D Pose and Shape Estimation of Multiple People in Natural
Scenes – the Importance of Multiple Scene Constraints.
In Conference on Computer Vision and Pattern Recognition, June
2018.

X. Zhou, Q. Huang, X. Sun, X. Xue, and Y. We.
WeaklySupervised Transfer for 3D Human Pose Estimation in the
Wild.
arXiv Preprint, 2017.

X. Zhou, S. Liu, G. Pavlakos, V. Kumar, and K. Daniilidis.
Human motion capture using a drone.
In International Conference on Robotics and Automation, 2018.

CMU Graphics Lab Motion Capture Database
https://www.groundai.com/project/activemocapoptimizeddroneflightforactivehumanmotioncapture/