Does Data Augmentation Benefit from
Data augmentation has emerged as a powerful technique for improving the performance of deep neural networks and led to state-of-the-art results in computer vision. However, state-of-the-art data augmentation strongly distorts training images, leading to a disparity between examples seen during training and inference. In this work, we explore a recently proposed training paradigm in order to correct for this disparity: using an auxiliary BatchNorm for the potentially out-of-distribution, strongly augmented images. Our experiments then focus on how to define the BatchNorm parameters that are used at evaluation. To eliminate the train-test disparity, we experiment with using the batch statistics defined by clean training images only, yet surprisingly find that this does not yield improvements in model performance. Instead, we investigate using BatchNorm parameters defined by weak augmentations and find that this method significantly improves the performance of common image classification benchmarks such as CIFAR-10, CIFAR-100, and ImageNet. We then explore a fundamental trade-off between accuracy and robustness coming from using different BatchNorm parameters, providing greater insight into the benefits of data augmentation on model performance.
Data augmentation has become a common technique for improving the diversity of examples within datasets for machine learning without needing to explicitly label additional examples. The benefits of this strategy have been seen in a number of application domains, including image classification [28, 22, 9, 39], object detection , and semantic segmentation . With respect to images, common examples of data augmentations include MixUp , Cutout , CutMix , and various forms of Gaussian noise . A more diverse set of transformations (such as those from the PIL library
Early work in data augmentation often assumed that beneficial techniques would produce images that would be close to the true data distribution [1, 28]. However, with many of the techniques above, it is clear that the resulting images are unnatural and likely to be out-of-distribution with respect to the test set (see Figure 1 for examples). Images with augmentations are often blended together or modified so strongly that the image semantics are destroyed.
Therefore, despite the improvements in model performance from these techniques, there still exists a train-test disparity in the resulting model. A number of recent works have attempted to adjust for this incongruity, suggesting that models can be improved by the early stopping of various data augmentation strategies [15, 12] or by density matching between the clean (un-augmented) and augmented datasets [25, 14].
However, despite the corrections, these methods have not outperformed augmentation strategies that heavily distort the training images [39, 36, 7].
Instead, this train-test disparity can also be addressed by correcting the parameters in BatchNorm layers, which are commonly thought to capture distributional information about input images. To do so, we adopt a recently proposed training paradigm where an auxiliary BatchNorm is used for the potentially out-of-distribution (OOD) augmented images [31, 38, 32]. This technique re-aligns the BatchNorm parameters to match the distribution of representations seen at inference. For example, Transfer Normalization achieved state-of-the-art results on a number of domain transfer tasks where there is an explicit train-test gap .
Applying this technique to data augmentation however raises a number of questions. For example, how should the BatchNorm parameters used at evaluation be defined? Do the statistics defined by clean images (compared to strongly augmented images) yield improved performance? What happens to the robustness properties of the network from using this separated BatchNorm setup? Our findings from these questions can be summarized as follows:
Using weak augmentations significantly improves the performance of the separated BatchNorm setup on multiple image classification benchmarks. We then explore how this strategy relates to proposed measures for in-distribution data augmentations (Section 5). . Section 5.1 provides ablation studies to locate the improvements from weak augmentations.
2 Related Work
Early examples of data augmentations often focused on creating realistic but ‘different‘ training examples  and often included horizontal flips, crops, and minor color distortions to MNIST and CIFAR-10 images [5, 27, 28]. However, this is clearly no longer the case in modern models as data augmentation techniques that produce out-of-distribution and heavily modified images have been shown to significantly improve performance [2, 13, 41, 39].
Based on similar observations, a number of recent studies have shown interest in understanding the effects of data augmentation and the relation to the true underlying data distribution. For example, Gontijo-Lopes et al.  proposed the Affinity metric to measure how in-distribution a given data augmentation is. The method is based on the evaluation performance of a clean model tested on augmented data. Their work found that beneficial augmentations can often be out-of-distribution. On the other hand, techniques such as AugMix have also shown success correcting for the distributional shift in data augmentation by producing more natural-looking outputs. Similarly, Fast AutoAugment and Faster AutoAugment [25, 14] are both AutoML-based techniques and use core ideas from density matching to minimize the distance between the augmented and clean data. Taken together, these studies suggest that data augmentation faces two competing effects: improvements coming from diverse examples and producing outputs that match the data distribution.
Importance of BatchNorm Statistics
In this work, we choose to address the train-test disparity coming from augmentation by correcting the BatchNorm layers, which are often thought to capture domain specific effects during training . A number of recent works have highlighted the importance of aligning these parameters to the test domain. For example, in the field of domain transfer, AdaBN  and AutoDIAL  have shown that recomputing BatchNorm parameters based on the test domain can improve model performance. Transfer Normalization  introduced the idea of separated batch statistics for different domains and designed an end-to-end trainable layer that achieved state-of-the-art performance on a number of domain transfer benchmarks.
Seperated BatchNorm Layers
This idea of using separated BatchNorms for potentially out-of-distribution data has been adopted by a number of application areas with promising performance. In the field of semi-supervised learning, separated BatchNorm parameters allowed models to better incorporate unlabeled images that did not correspond to any of the labeled classes . Most closely related to our work is AdvProp , which found that strong adversarial noise could be incorporated into image classification models through the use of an auxiliary BatchNorm. EfficientNet  models trained with this approach significantly improved results on the ImageNet  and ImageNet-C benchmarks . While their work showed impressive performance gains, there still remain a number of open questions regarding the use of separated BatchNorms in the context of data augmentation. For example, are the batch statistics defined by clean images optimal? How does the use of separated BatchNorms impact model robustness?
In order to train models with the separated BatchNorm setup, this paper follows prior work ; in particular, the setup can be best thought of as a simplified version of fine-grained AdvProp
Augmentation on the Auxiliary BatchNorm
As a popular and effective benchmark for data augmentation, RandAugment serves as a reasonable starting point for analyzing the separate BatchNorm setup
Datasets and Models
Experiments are provided for three common image classification benchmarks. Two datasets, CIFAR-10 and CIFAR-100 , consist of tiny-images containing 50K training and 10K test images apiece. Our analysis uses WideResNet-28-2 and WideResNet-28-10  models that are trained for 200 epochs with a learning rate of 0.1, batch size of 128, weight decay of 5e-4, and cosine learning rate decay. To ensure that the observed trends extend to larger datasets, experiments are also conducted on ImageNet , which contains approximately 1.2 million colored images. The corresponding ResNet50 models follow default hyper-parameters  and are trained for 180 epochs using an image size of 224×224. The models use a weight decay of 1e-4, a momentum optimizer with parameter value of 0.9, a batch size of 4096. and a learning rate of 0.1 (scaled by the batch size divided by 256).
In general, top-1 accuracy is reported for the clean test set, but for ImageNet top-5 numbers are also included. All metrics for CIFAR-10 and CIFAR-100 are reported from an average of 10 training runs. Robustness results are provided utilizing the Common Corruptions benchmark  of CIFAR-10-C and ImageNet-C, which provides test data with 15 different types of inputted noise at 5 different intensities each. The error rate for a corruption and severity is given by . For CIFAR-10-C, the associated metric is the un-normalized corruption error (), whereas for ImageNet-C, the robustness metrics are normalized by the corruption error of AlexNet  (). Averages over all corruptions are reported. Note that in all cases, lower corruption error is better.
4 Naively Using Separated BatchNorms Yields No Improvement
If the strongly augmented images use the auxiliary BatchNorm during training, a fundamental question becomes how to define the batch statistics that are used for evaluation (those on the main branch). A naive strategy would be to use clean images only, with the expectation that the clean images would best match the “test data”. However, the resulting CIFAR-10 model trained on Wide-ResNet-28-2 displayed significant degradation in performance, with a decrease of 1.6% in accuracy compared to a model trained on RandAugment only (as shown in top row of Table 1). This diminished performance likely arises from the lack of diversity when defining the batch statistics for evaluation and possible over-fitting to the training set.
To counter this effect, we devise a simple change to the separated BatchNorm setup. Instead of using clean images directly, incorporating simple weak augmentations yields significant gains in performance. This strategy is similar to the recently proposed methods in Semi-Supervised Learning such as FixMatch  yet novel in it’s application to multiple BatchNorms. We refer to this method in the text as Weak Augment.
|Performance from using Weak Augmentations|
|CIFAR-10||Affinity||Clean Test Accuracy|
|Flip + Crop||-2.7||0.1|
5 Weak Augmentations Allow Separated BatchNorms to Be Effective
To test for this hypothesized effect, a variety of weak augmentations were applied in the separated BatchNorm setup for WideResNet-28-2 models on CIFAR-10. The results presented in Table 1 show that incorporating the standard flip or crop augmentations into the main BatchNorm can recover the performance of a model trained only with RandAugment that is not using Separated BatchNorms. Applying slightly stronger weak augmentation in the form of Cutout
Nevertheless, this weak augmentation strategy devised on CIFAR-10 can be extended to a variety of benchmark tasks, with results provided in Table 2. Weak augmentations show significant gains in performance across the board for the CIFAR-10, CIFAR-100, and ImageNet benchmarks, over the already strong baselines set by RandAugment.
|Clean Test Accuracy From Using Weak Augmentations|
5.1 Ablations Locate Improvements from Weak Augmentions
While the results from weak augmentations are impressive, the results on the benchmark tasks do not fully elucidate where the benefits of the training strategy are coming from. Table 3 compares model performance when procedurally adding the components of the separated BatchNorm setup. The use of two stochastic applications of RandAugment on each training batch does not change test performance (Table 3, row 3), suggesting that the simple addition of data during training is not impacting model performance (contrasting from prior work ). Similarly, the use of a separated BatchNorm when the augmentations applied to the two branches do not differ does not yield any improvements (Table 3, row 4).
Instead, the gains in model performance arise partially from the inclusion of weak augmentations even without utilizing separate BatchNorms, improving by (Table 3, row 5). Note, using separated BatchNorm layers with shared , parameters (but differing moving means and variances) shows no additional improvement in performance, but two fully independent BatchNorms increases accuracy by (Table 3, row 7). These findings suggest novel insights into the effectiveness of separated BatchNorms. First, part of the benefit comes from the additional diversity of examples. Beyond that, additional improvements do not arise from having separate mean and variance moving averages.
This result is interesting as these moving averages are often thought to capture domain effects , yet in our experiments correcting these mean and variance does not yield improvements. Instead, two fully independent BatchNorms with separate and are required.
|Ablation Study for Clean Test Accuracy|
|(1) Baseline (Flips and Crops)||94.9|
|(3) Two RandAugment Batches||95.8|
|(4) Two RandAugment Batches with separated BatchNorm||95.8|
|(5) Weak Augment without separated BatchNorms||96.1|
|(6) Weak Augment with shared , parameters||96.1|
|(7) Weak Augment||96.3|
5.2 Experimenting with Other Auxiliary BatchNorm Data Augmentation Types
The Weak Augment strategy is generic and extends to a variety of other augmentation setups. Specifically, the strong augmentation provided on the auxiliary BatchNorm does not need to be RandAugment. In Table 4, the results show that weak augmentations are effective at improving the performance for a variety of data augmentation methods on the auxiliary BatchNorm. In Table 4, we experiment with Gaussian noise with a strength of 0.2, PGD adversarial noise with , and AugMix . While the use of a separated BatchNorm appears to yield improvements in all models, the best performing result still arises from the application of Weak Augment on top of RandAugment. This is perhaps not so surprising as RandAugment has the highest baseline score of all tested models, but Weak Augment appears to be a generic strategy that can yield improvements across different data augmentation types.
Clean Test Accuracy for Alternate Augmentation Strategies
Applied to the Auxliary BatchNorm
|Gaussian ( = 0.2)||93.2||(+2.0%) 95.2|
|Adversarial Noise||94.7||(+0.5%) 95.2|
6 Separate BatchNorms Trade-off Between Accuracy and Robustness
Although impressive gains in test-set performance can be achieved from the Weak Augment setup from a separated BatchNorm, optimizing for accuracy naively introduces a new problem in these models: robustness. Table 5 presents the performance of the new Weak Augment models on the Common Corruptions benchmarks of CIFAR-10-C and ImageNet-C . Both CIFAR-10 models display significantly diminished robustness results.
|Standard||RA||Weak Augment (Main BN)|
6.1 Averaging Predictions from Both BatchNorms Leads to Improvements
While the main BatchNorm branch of the models appear to be more sensitive to input noise, this problem can be overcome by also incorporating the predictions coming from the auxiliary BatchNorm. Figure 3 explores the performance of predictions coming from using the main and auxiliary BatchNorms, also including combinations thereof weighted by a factor of .
This graph presents a fundamental trade-off between accuracy and robustness between the predictions coming from the two BatchNorms, yet a simple average shows overall improvement in both metrics. The same holds true for the ImageNet model, where simple averaging yields a improvement of in top-1 accuracy and -2.4 in CE robustness (lower is better). This suggests that utilizing both BatchNorms for evaluation may be a promising method for countering diminished robustness properties, if one is willing to compute both at inference. We hope that future work explores this direction futher, possibly determining if there is a way to distill the information in both BatchNorms into a single forward pass.
6.2 Fourier Sensitivity Provides Another Perspective of Robustness
The robustness studies so far have only focused on a few parameterized noises from the Common Corruptions benchmark. In order to get a more general perspective on model robustness, we turn to analyzing noise in Fourier space using the strategy introduced by Yin et al. . This evaluation strategy involves perturbing each image in the test set with noise sampled from different orientations and frequencies in Fourier space and then determining the difference in test set performance. These results are plotted in a heatmap, where the position indicates the noise in Fourier-space, with the lowest frequencies in the center of the images. The color then represents the test error. The results are presented in Figure 4, specifically for the WideResNet-28-2 models trained on CIFAR-10. For these examples, the magnitude of the noise is set to have a norm to be 8.0, larger than in previous studies [13, 35] but allowing for better visualization of the differences between models. Following the interpolation results presented in the previous section, the figures show the heatmaps for the main BatchNorm, the auxiliary BatchNorm, and the average of these two predictions.
The Frequency analysis showcases the same patterns as the Common Corruptions robustness in the previous section. Specifically, the main BatchNorm appears highly sensitive, particularly to high frequency noise. In contrast, the auxiliary BatchNorm () show greater robustness overall. Interpolating between the two predictions showcases similar benefits as before, maintaining most of the additional benefit gained from strong augmentation.
6.3 Where are the improvements in robustness from separated BatchNorms coming from?
Finally, we analyze which frequencies are most important for model performance and gain insights into why the Weak Augment method is effective. This is achieved by defining a low pass filter of bandwidth as the operation that all of the frequency components outside of a centered square of width in the Fourier spectrum, centered around the lowest frequency, to be zero. Then, an inverse Discrete Fourier Transform is applied to recover the image. Figure 5 presents the effect of applying a low pass filter with increasing bandwidth on 500 examples from the CIFAR-10 test set and measures the difference in accuracy for WideResNet-28-2 models tested with no augmentations, Weak Augment of varying parameters, and RandAugment. Generally, RandAugment uses significantly lower frequency information to achieve peak classifier performance when compared to no augmentations. The Weak Augment model effectively interpolates between these two regimes. The improvements in clean accuracy coming from Weak Augment could potentially be explained by the more effective use of higher frequency information available in the training images.
In this work, we analyze the popular training paradigm of using separated BatchNorm parameters and show that it can be applied to generic data augmentation setups. Specifically, we find that naively applying this approach and using clean (un-augmented) and augmented images to define the two BatchNorms does not lead to any improvements. Instead, we propose using weak augmentations rather than clean ones to define the main BatchNorm and find significant improvements on CIFAR-10, CIFAR-100, and ImageNet benchmarks. Finally, we find that defining BatchNorm parameters based on weak augmentations leads to problems in model robustness but that this problem can be overcome by interpolating between the predictions of the two BatchNorm parameters.
We thank Irwan Bello and the Google Brain Team for helpful feedback on this manuscript.
- Note, AdvProps studies a specific case of the generic setup of separated BatchNorms, where AutoAugment is applied to all images and adversarial noise is limited to the auxiliary BatchNorm.
- RandAugment assume the default augmentations of flips, crops, and Cutout are included, as per Cubuk et al.
- Cutout is performed in addition to flip and crop augmentations and is performed with a pad-size of 16/90 for CIFAR-10/ImageNet models respectively.
Robust speaker adaptation using a piecewise linear acoustic mapping.
In [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing,
Vol. 1, pp. 445–448 vol.1.
Cited by: §1,
Deep learners benefit more from out-of-distribution examples.
In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, G. Gordon, D. Dunson and M. Dudík (Eds.),
Proceedings of Machine Learning Research, Vol. 15, Fort Lauderdale, FL, USA, pp. 164–172.
Cited by: §2.
Improving 3d object detection through progressive population based augmentation.
arXiv preprint arXiv:2004.00831.
Cited by: §1.
Cited by: §1.
Data augmentation revisited: rethinking the distribution gap between clean and augmented data.
Cited by: §1.
Supervised contrastive learning.
arXiv preprint arXiv:2004.11362.
Cited by: §1.
Learning multiple layers of features from tiny images.
Cited by: §3.
Learning to compose domain-specific transformations for data augmentation.
In Advances in neural information processing systems,
Cited by: §1.
Apac: augmented pattern classification with neural networks.
arXiv preprint arXiv:1505.03229.
Cited by: §2.
Transferable normalization: towards improving transferability of deep neural networks.
In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d‘Alché-Buc, E. Fox and R. Garnett (Eds.),
Cited by: §1,
Wide residual networks.
arXiv preprint arXiv:1605.07146.
Cited by: §3.
arXiv preprint arXiv:1912.11188.
Cited by: §1.
Random erasing data augmentation.
arXiv preprint arXiv:1708.04896.
Cited by: §2.
Learning data augmentation strategies for object detection.
arXiv preprint arXiv:1906.11172.
Cited by: §1.