EvoPose2D: Pushing the Boundaries of 2D Human Pose Estimation using Neuroevolution
Neural architecture search has proven to be highly effective in the design of computationally efficient, task-specific convolutional neural networks across several areas of computer vision. In 2D human pose estimation, however, its application has been limited by high computational demands. Hypothesizing that neural architecture search holds great potential for 2D human pose estimation, we propose a new weight transfer scheme that relaxes function-preserving mutations, enabling us to accelerate neuroevolution in a flexible manner. Our method produces 2D human pose network designs that are more efficient and more accurate than state-of-the-art hand-designed networks. In fact, the generated networks can process images at higher resolutions using less computation than previous networks at lower resolutions, permitting us to push the boundaries of 2D human pose estimation. Our baseline network designed using neuroevolution, which we refer to as EvoPose2D-S, provides comparable accuracy to SimpleBaseline while using 4.9x fewer floating-point operations and 13.5x fewer parameters. Our largest network, EvoPose2D-L, achieves new state-of-the-art accuracy on the Microsoft COCO Keypoints benchmark while using 2.0x fewer operations and 4.3x fewer parameters than its nearest competitor. The code is available at https://github.com/wmcnally/evopose2d.
Two-dimensional human pose estimation is a visual recognition task dealing with the autonomous localization of anatomical human joints, or more broadly, “keypoints,” in RGB images and video [44, 43, 2]. It is widely considered a fundamental problem in computer vision due to its many downstream applications, including action recognition [10, 32] and human tracking [19, 1, 49]. In particular, it is a precursor to 3D human pose estimation [31, 36], which serves as a potential alternative to invasive marker-based motion capture.
In line with other streams of computer vision, the use of deep learning , and specifically convolutional neural networks  (CNNs), has been prevalent in 2D human pose estimation [44, 43, 35, 7, 9, 49, 40]. State-of-the-art methods use a two-stage, top-down pipeline, where an off-the-shelf person detector is first used to detect human instances in an image, and the 2D human pose network is run on the person detections to obtain keypoint predictions [9, 49, 40]. This paper focuses on the latter stage of this commonly used pipeline.
Recently, there has been a growing interest in the use of machines to help design CNN architectures through a process referred to as neural architecture search (NAS) [53, 3, 47]. These methods eliminate human bias and permit the automated exploration of diverse network architectures that often transcend human intuition, leading to better accuracy and computational efficiency. Despite the widespread success of NAS in many areas of computer vision [42, 8, 28, 12, 34, 52], the design of 2D human pose networks has remained, for the most part, human-principled.
Motivated by the success of NAS in other visual recognition tasks, this paper explores the application of neuroevolution, a form of NAS, to 2D human pose estimation for the first time. First, we propose a new weight transfer scheme that is highly flexible and reduces the computational expense of neuroevolution. Next, we exploit this weight transfer scheme, along with large-batch training on high-bandwidth Tensor Processing Units (TPUs), to accelerate a neuroevolution within a tailor-made search space geared towards 2D human pose estimation. In experiments, our method produces a 2D human pose network that has a relatively simple design, provides state-of-the-art accuracy when scaled, and uses fewer operations and parameters than the best performing networks in the literature (see Fig. 1). We summarize our research contributions as follows.
We propose a new weight transfer scheme to accelerate neuroevolution. In contrast to previous neuroevolution methods that exploit weight transfer, our method is not constrained by complete function preservation [48, 46]. Despite relaxing this constraint, our experiments indicate that the level of functional preservation afforded by our weight transfer scheme is sufficient to provide fitness convergence, thereby simplifying neuroevolution and making it more flexible.
We present empirical evidence that large-batch training can be used in conjunction with the Adam optimizer  to accelerate the training of 2D human pose networks with no loss in accuracy. We reap the benefits of large-batch training in our neuroevolution by maximizing training throughput on high-memory TPUs.
We design a search space conducive to 2D human pose estimation and leverage the above contributions to run a full-scale neuroevolution of 2D human pose networks within a practical time-frame (1 day using eight v2-8 TPUs). As a result, we are able to produce a computationally efficient 2D human pose estimation model that achieves state-of-the-art accuracy on the most widely used benchmark.
2 Related Work
This work draws upon several research areas in deep learning to engineer a high-performing 2D human pose estimation model. We review the three areas of the literature that are most relevant in the following sections.
Large-batch Training of Deep Neural Networks. Recent experiments have indicated that training deep neural networks using large batch sizes (256) with stochastic gradient descent causes a degradation in the quality of the model as measured by its ability to generalize to unseen data [16, 20]. It has been shown that the difference in accuracy on training and test sets, sometimes referred to as the generalization gap, can drop by as much as 5% as a result of using large batch sizes. Goyal et al.  implemented measures for mitigating the training difficulties caused by large batch sizes, including linear scaling of the learning rate, and an initial warmup period where the learning rate was gradually increased. These measures permitted them to train a ResNet-50  on the ImageNet classification task  using a batch size of 8192 with no loss in accuracy, and training took just 1 hour on 256 GPUs.
Maximizing training efficiency using large-batch training is critical in situations where the computational demand of training is very high, such as in neural architecture search. However, deep learning methods are often data-dependent, and so it remains unclear whether the training measures imposed by Goyal et al. apply in the general case. It is also unclear whether the learning rate modifications are applicable to optimizers that use adaptive learning rates. Adam  is an example of such an optimizer and is widely used in 2D human pose estimation. To this end, we empirically investigate the use of large batch sizes in conjunction with the Adam optimizer in the training of 2D human pose networks in Section 4.2.2.
2D Human Pose Estimation using Deep Learning. Interest in human pose estimation dates back to 1975, when Fischler and Elschlager  used pictorial structures to recognize facial attributes in photographs. The first use of deep learning for human pose estimation came in 2014, when Toshev and Svegedy  regressed 2D keypoint coordinates directly from RGB images using a cascade of deep CNNs. Their method laid the foundation for a series of CNN-based methods offering superior performance over part-based models by learning features directly from the data as opposed to using primitive hand-crafted feature descriptors.
Arguing that the direct regression of pose vectors from images was a highly non-linear and difficult to learn mapping, Tompson et al.  introduced the notion of learning a heatmap representation, which represented the per-pixel likelihood for the existence of keypoints. The mean squared error (MSE) was used to minimize the distance between the predicted and target heatmaps, where the targets were generated using Gaussians with small variance centered on the ground-truth keypoint locations. The heatmap representation was highly effective, and continues to be used in the most recent human pose estimation models.
Several of the methods that followed built upon iterative heatmap refinement in a multi-stage fashion including intermediate supervision [45, 7, 35]. Remarking the inefficiencies associated with multi-stage stacking, Chen et al.  designed the Cascaded Pyramid Network (CPN), a holistic network constructed using a ResNet-50  feature pyramid . Xiao et al.  presented yet another holistic architecture that stacked transpose convolutions on top of ResNet. The aptly named SimpleBaseline network outperformed CPN despite having a simple architecture and implementation. Sun et al.  observed that most existing methods recover high-resolution features from low-resolution embeddings. They demonstrated with HRNet that maintaining high-resolution features throughout the entire network could provide greater accuracy. HRNet represents the state-of-the-art in 2D human pose estimation among peer-reviewed works at the time of writing.
An issue surrounding the 2D human pose estimation literature is that it is often difficult to make fair compairsons of model performance due to the heavy use of model-agnostic improvements. Examples include using different learning rate schedules [40, 25], more data augmentation [25, 5], loss functions that target more challenging keypoints , specialized post-processing steps [33, 18], or more accurate person detectors [25, 18]. These discrepancies can potentially account for reported differences in accuracy. To directly compare our method with the state-of-the-art, we re-implement SimpleBaseline  and HRNet  and train all networks under the same settings.
Neuroevolution. Until recently, the design of CNNs has primarily been human-principled, guided by rules of thumb based on previous experimental results. Hand-designing a CNN that performs optimally for a specific task is therefore very time consuming. Consequently, there has been a growing interest in NAS methods . Neuroevolution is a form of neural architecture search that harnesses evolutionary algorithms to search for optimal network architectures . We focus on neuroevolution due to its flexibility and simplicity compared to other approaches using reinforcement learning [53, 3, 54, 41, 42], one-shot NAS [4, 6, 37], or gradient-based NAS [29, 50].
Due to the large size of architectural search spaces, and the fact that sampled architectures need to be trained to convergence to evaluate their performance, NAS requires a substantial amount of computation. In fact, some of the first implementations required several GPU years [53, 54, 38]. This inevitably led to a branch of research aimed at making NAS practical by reducing the search time. Network morphisms  and function-preserving mutations  are techniques used in neuroevolution that tackle this problem. In essence, these methods iteratively mutate networks and transfer weights in such a way that the function of the network is completely preserved upon mutation, i.e., the output of the mutated network is identical to that of the parent network. Ergo, the mutated child networks need only be trained for a relatively small number of steps compared to when training from a randomly initialized state. As a result, these techniques are capable of reducing the search time to a matter of GPU days. However, function-preserving mutations can be challenging to implement and also restricting (e.g., complexity cannot be reduced ).
NAS algorithms have predominantly been developed and evaluated on small-scale image datasets . The use of NAS in more complex visual recognition tasks remains limited, in large part because the computational demands make it infeasible. This is especially true for 2D human pose estimation, where training a single model can take several days . Nevertheless, the use of NAS in the design of 2D human pose networks has been attempted in a few cases [50, 13, 51]. Although some of the resulting networks provided superior computational efficiency as a result of having fewer parameters and operations, none managed to surpass the best performing hand-crafted networks in terms of accuracy.
3 Neuroevolution of 2D Human Pose Networks
The cornerstone of the proposed neuroevolution framework is a simple yet effective weight transfer scheme that enables searching for optimal deep neural networks in a fast and flexible manner. In this paper, we tailor our search space to the task of 2D human pose estimation using prior knowledge of cutting-edge hand-crafted pose networks, but emphasize that our method is generally applicable to all types of deep networks.
Weight transfer. Suppose that a parent network is represented by the function