Does Data Augmentation Benefit from Split BatchNorms?

Basic Theroy Deep Talk 2周前 (10-16) 11次浏览 未收录 0个评论 扫描二维码

Does Data Augmentation Benefit from
Split BatchNorms?

Abstract

Data augmentation has emerged as a powerful technique for improving the performance of deep neural networks and led to state-of-the-art results in computer vision. However, state-of-the-art data augmentation strongly distorts training images, leading to a disparity between examples seen during training and inference. In this work, we explore a recently proposed training paradigm in order to correct for this disparity: using an auxiliary BatchNorm for the potentially out-of-distribution, strongly augmented images. Our experiments then focus on how to define the BatchNorm parameters that are used at evaluation. To eliminate the train-test disparity, we experiment with using the batch statistics defined by clean training images only, yet surprisingly find that this does not yield improvements in model performance. Instead, we investigate using BatchNorm parameters defined by weak augmentations and find that this method significantly improves the performance of common image classification benchmarks such as CIFAR-10, CIFAR-100, and ImageNet. We then explore a fundamental trade-off between accuracy and robustness coming from using different BatchNorm parameters, providing greater insight into the benefits of data augmentation on model performance.

1 Introduction

Data augmentation has become a common technique for improving the diversity of examples within datasets for machine learning without needing to explicitly label additional examples. The benefits of this strategy have been seen in a number of application domains, including image classification [28, 22, 9, 39], object detection [11], and semantic segmentation [10]. With respect to images, common examples of data augmentations include MixUp [39], Cutout [9], CutMix [36], and various forms of Gaussian noise [13]. A more diverse set of transformations (such as those from the PIL library1) has been utilized by Ratner and Ehrenberg et al. [26], AutoAugment [6, 42], and RandAugment [7]. Such data augmentation strategies have been used to achieve state-of-the-art results in image classification [6, 7, 40, 32], object detection [42, 4], semi-supervised learning [33, 34, 43], contrastive learning [20], and robustness to test-time image distortions [35, 32, 34, 17].

Early work in data augmentation often assumed that beneficial techniques would produce images that would be close to the true data distribution [1, 28]. However, with many of the techniques above, it is clear that the resulting images are unnatural and likely to be out-of-distribution with respect to the test set (see Figure 1 for examples). Images with augmentations are often blended together or modified so strongly that the image semantics are destroyed.

Therefore, despite the improvements in model performance from these techniques, there still exists a train-test disparity in the resulting model. A number of recent works have attempted to adjust for this incongruity, suggesting that models can be improved by the early stopping of various data augmentation strategies [15, 12] or by density matching between the clean (un-augmented) and augmented datasets [25, 14].
However, despite the corrections, these methods have not outperformed augmentation strategies that heavily distort the training images [39, 36, 7].

Instead, this train-test disparity can also be addressed by correcting the parameters in BatchNorm layers, which are commonly thought to capture distributional information about input images. To do so, we adopt a recently proposed training paradigm where an auxiliary BatchNorm is used for the potentially out-of-distribution (OOD) augmented images [31, 38, 32]. This technique re-aligns the BatchNorm parameters to match the distribution of representations seen at inference. For example, Transfer Normalization achieved state-of-the-art results on a number of domain transfer tasks where there is an explicit train-test gap [31].

Applying this technique to data augmentation however raises a number of questions. For example, how should the BatchNorm parameters used at evaluation be defined? Do the statistics defined by clean images (compared to strongly augmented images) yield improved performance? What happens to the robustness properties of the network from using this separated BatchNorm setup? Our findings from these questions can be summarized as follows:

  • Naively separating all clean (un-augmented) and augmented images into two separate BatchNorms during training is often harmful for model performance. We hypothesize that this lack of improvement occurs as having a portion of the training images being un-augmented leads to overfitting (Section 4).

  • Using weak augmentations significantly improves the performance of the separated BatchNorm setup on multiple image classification benchmarks. We then explore how this strategy relates to proposed measures for in-distribution data augmentations (Section 5). [12]. Section 5.1 provides ablation studies to locate the improvements from weak augmentations.

  • Analyzing the robustness properties of using a separated BatchNorm presents a fundamental trade-off between improvements in accuracy and corruption error, providing a clearer picture on the effects of data augmentation (Section 6).

Does Data Augmentation Benefit from Split BatchNorms?
(a) Clean Image
Does Data Augmentation Benefit from Split BatchNorms?
(b) MixUp
Does Data Augmentation Benefit from Split BatchNorms?
(c) CutMix
Does Data Augmentation Benefit from Split BatchNorms?
(d) AutoAugment
Does Data Augmentation Benefit from Split BatchNorms?
(e) RandAugment
Figure 1: Examples of images produced by a number of common data augmentation techniques. The produced images often appear unnatural, and the examples above highlight the disparity between examples seen during training and the clean (un-augmented) examples seen at test-time.

2 Related Work

Data Augmentation

Early examples of data augmentations often focused on creating realistic but ‘different‘ training examples [1] and often included horizontal flips, crops, and minor color distortions to MNIST and CIFAR-10 images [5, 27, 28]. However, this is clearly no longer the case in modern models as data augmentation techniques that produce out-of-distribution and heavily modified images have been shown to significantly improve performance [2, 13, 41, 39].

Based on similar observations, a number of recent studies have shown interest in understanding the effects of data augmentation and the relation to the true underlying data distribution. For example, Gontijo-Lopes et al. [12] proposed the Affinity metric to measure how in-distribution a given data augmentation is. The method is based on the evaluation performance of a clean model tested on augmented data. Their work found that beneficial augmentations can often be out-of-distribution. On the other hand, techniques such as AugMix have also shown success correcting for the distributional shift in data augmentation by producing more natural-looking outputs. Similarly, Fast AutoAugment and Faster AutoAugment [25, 14] are both AutoML-based techniques and use core ideas from density matching to minimize the distance between the augmented and clean data. Taken together, these studies suggest that data augmentation faces two competing effects: improvements coming from diverse examples and producing outputs that match the data distribution.

Importance of BatchNorm Statistics

In this work, we choose to address the train-test disparity coming from augmentation by correcting the BatchNorm layers, which are often thought to capture domain specific effects during training [19]. A number of recent works have highlighted the importance of aligning these parameters to the test domain. For example, in the field of domain transfer, AdaBN [24] and AutoDIAL [3] have shown that recomputing BatchNorm parameters based on the test domain can improve model performance. Transfer Normalization [31] introduced the idea of separated batch statistics for different domains and designed an end-to-end trainable layer that achieved state-of-the-art performance on a number of domain transfer benchmarks.

Seperated BatchNorm Layers

This idea of using separated BatchNorms for potentially out-of-distribution data has been adopted by a number of application areas with promising performance. In the field of semi-supervised learning, separated BatchNorm parameters allowed models to better incorporate unlabeled images that did not correspond to any of the labeled classes [38]. Most closely related to our work is AdvProp [32], which found that strong adversarial noise could be incorporated into image classification models through the use of an auxiliary BatchNorm. EfficientNet [30] models trained with this approach significantly improved results on the ImageNet [22] and ImageNet-C benchmarks [16]. While their work showed impressive performance gains, there still remain a number of open questions regarding the use of separated BatchNorms in the context of data augmentation. For example, are the batch statistics defined by clean images optimal? How does the use of separated BatchNorms impact model robustness?

3 Methods

Separate BatchNorms

In order to train models with the separated BatchNorm setup, this paper follows prior work [31]; in particular, the setup can be best thought of as a simplified version of fine-grained AdvProp2[32]. The associated training strategy first copies the input mini-batch and separately applies noise or augmentation techniques to each. One set is given to the main branch (with a specific set of BatchNorm parameters), whereas the potentially out-of-distribution images utilize an auxiliary BatchNorm. During training, the losses from these two branches are averaged, whereas evaluation is only performed using the batch statistics from the main branch (as in Figure 2).

Augmentation on the Auxiliary BatchNorm

As a popular and effective benchmark for data augmentation, RandAugment serves as a reasonable starting point for analyzing the separate BatchNorm setup3[7]. This augmentation strategy combines over 15 individual augmentation types, including shears, rotations, and color equalization. For each image, two or three of these augmentations are applied at a user-defined augmentation strength. This method provides a particularly promising starting point as RandAugment has led to state-of-the-art performance on a number of image classification tasks, yet little to no effort has been put in to ensure that the outputs are natural and match the test distribution. The analysis for other common data augmentation techniques is provided in Section 5.2.


Does Data Augmentation Benefit from Split BatchNorms?
Data: mini-batch of images
Produce augmented batch for main branch;
Produce augmented batch for augmented branch;
Compute loss using main BN parameters;
Compute loss using auxiliary BN parameters;
Compute total loss as ;
Update the parameters with respect to ;
Algorithm 1 Single Step with Separated BatchNorms
Figure 2: A schematic diagram (left) and explanation of a single update step (right) for a model with separated BatchNorms. Note how in our setup, the strong data augmentation techniques (e.g. RandAugment, Gaussian Noise) are placed onto the auxiliary BN parameters. In this paper, we vary the augmentations for the main branch, ranging from no augmentation to RandAugment.

Datasets and Models

Experiments are provided for three common image classification benchmarks. Two datasets, CIFAR-10 and CIFAR-100 [21], consist of tiny-images containing 50K training and 10K test images apiece. Our analysis uses WideResNet-28-2 and WideResNet-28-10 [37] models that are trained for 200 epochs with a learning rate of 0.1, batch size of 128, weight decay of 5e-4, and cosine learning rate decay. To ensure that the observed trends extend to larger datasets, experiments are also conducted on ImageNet [8], which contains approximately 1.2 million colored images. The corresponding ResNet50 models follow default hyper-parameters [7] and are trained for 180 epochs using an image size of 224×224. The models use a weight decay of 1e-4, a momentum optimizer with parameter value of 0.9, a batch size of 4096. and a learning rate of 0.1 (scaled by the batch size divided by 256).

Metrics

In general, top-1 accuracy is reported for the clean test set, but for ImageNet top-5 numbers are also included. All metrics for CIFAR-10 and CIFAR-100 are reported from an average of 10 training runs. Robustness results are provided utilizing the Common Corruptions benchmark [16] of CIFAR-10-C and ImageNet-C, which provides test data with 15 different types of inputted noise at 5 different intensities each. The error rate for a corruption and severity is given by . For CIFAR-10-C, the associated metric is the un-normalized corruption error (), whereas for ImageNet-C, the robustness metrics are normalized by the corruption error of AlexNet [22] (). Averages over all corruptions are reported. Note that in all cases, lower corruption error is better.

4 Naively Using Separated BatchNorms Yields No Improvement

If the strongly augmented images use the auxiliary BatchNorm during training, a fundamental question becomes how to define the batch statistics that are used for evaluation (those on the main branch). A naive strategy would be to use clean images only, with the expectation that the clean images would best match the “test data”. However, the resulting CIFAR-10 model trained on Wide-ResNet-28-2 displayed significant degradation in performance, with a decrease of 1.6% in accuracy compared to a model trained on RandAugment only (as shown in top row of Table 1). This diminished performance likely arises from the lack of diversity when defining the batch statistics for evaluation and possible over-fitting to the training set.

To counter this effect, we devise a simple change to the separated BatchNorm setup. Instead of using clean images directly, incorporating simple weak augmentations yields significant gains in performance. This strategy is similar to the recently proposed methods in Semi-Supervised Learning such as FixMatch [29] yet novel in it’s application to multiple BatchNorms. We refer to this method in the text as Weak Augment.

Performance from using Weak Augmentations
CIFAR-10 Affinity Clean Test Accuracy
None 0.0 -1.6
Flip -0.8 0.1
Flip + Crop -2.7 0.1
Cutout -16.1 0.5
AugMix -12.7 0.3
Gaussian () -25.7 -0.3
RandAugment -25.0 0.0
Table 1: Delta () clean test accuracy for a number of augmentation strategies on the main BatchNorm, where RandAugment is applied on the auxiliary BatchNorm. The baseline is a model trained only with RandAugment without separate BatchNorms (accuracy of 95.8%). Our found optimal weak augmentation (ex: Cutout) improves accuracy significantly on these CIFAR-10 WideResNet-28-2 models. Interestingly, we note that this is not the most in-distribution augmentation, as measured by Affinity [12].

5 Weak Augmentations Allow Separated BatchNorms to Be Effective

To test for this hypothesized effect, a variety of weak augmentations were applied in the separated BatchNorm setup for WideResNet-28-2 models on CIFAR-10. The results presented in Table 1 show that incorporating the standard flip or crop augmentations into the main BatchNorm can recover the performance of a model trained only with RandAugment that is not using Separated BatchNorms. Applying slightly stronger weak augmentation in the form of Cutout 4 leads to significant gains in model performance of +0.5 on the CIFAR-10 dataset. Interestingly, note that this finding does not necessarily coincide with the current understanding of in-distribution augmentations and BatchNorms. The provided Affinity metric [13] quantifies the distribution shift arising from a given augmentation by testing a model trained only on clean images on augmented images. Cutout is clearly not an exact-match for the true data distribution, yet empirically performs best in the experimental trials.

Nevertheless, this weak augmentation strategy devised on CIFAR-10 can be extended to a variety of benchmark tasks, with results provided in Table 2. Weak augmentations show significant gains in performance across the board for the CIFAR-10, CIFAR-100, and ImageNet benchmarks, over the already strong baselines set by RandAugment.

Clean Test Accuracy From Using Weak Augmentations
Standard Cutout AA RA Weak Augment
CIFAR-10
Wide-ResNet-28-2 93.8 94.9 95.9 95.8 96.3 0.1
Wide-ResNet-28-10 95.5 96.6 97.4 97.3 97.6 0.1
CIFAR-100
Wide-ResNet-28-2 70.9 75.4 78.5 78.3 79.2 0.2
Wide-ResNet-28-10 78.8 81.2 82.9 83.3 83.8 0.3
ImageNet (Top-1/Top-5)
ResNet50 76.3/93.1 77.6/93.8 77.6/93.8 77.9/93.9
Table 2: Clean accuracy for the CIFAR-10 and CIFAR-100, including Standard (horizontal flips and random crops), Cutout [9], AutoAugment (AA) [6], RandAugment (RA) [7], and the newly proposed weak augmentation setup. Weak Augment is defined by RandAugment on the auxiliary BatchNorm and Cutout on the main BatchNorm. All metrics are provided as the average over 10 runs.

5.1 Ablations Locate Improvements from Weak Augmentions

While the results from weak augmentations are impressive, the results on the benchmark tasks do not fully elucidate where the benefits of the training strategy are coming from. Table 3 compares model performance when procedurally adding the components of the separated BatchNorm setup. The use of two stochastic applications of RandAugment on each training batch does not change test performance (Table 3, row 3), suggesting that the simple addition of data during training is not impacting model performance (contrasting from prior work [18]). Similarly, the use of a separated BatchNorm when the augmentations applied to the two branches do not differ does not yield any improvements (Table 3, row 4).

Instead, the gains in model performance arise partially from the inclusion of weak augmentations even without utilizing separate BatchNorms, improving by (Table 3, row 5). Note, using separated BatchNorm layers with shared , parameters (but differing moving means and variances) shows no additional improvement in performance, but two fully independent BatchNorms increases accuracy by (Table 3, row 7). These findings suggest novel insights into the effectiveness of separated BatchNorms. First, part of the benefit comes from the additional diversity of examples. Beyond that, additional improvements do not arise from having separate mean and variance moving averages.
This result is interesting as these moving averages are often thought to capture domain effects [23], yet in our experiments correcting these mean and variance does not yield improvements. Instead, two fully independent BatchNorms with separate and are required.

Ablation Study for Clean Test Accuracy
CIFAR-10 Accuracy
(1) Baseline (Flips and Crops) 94.9
(2) RandAugment 95.8
(3) Two RandAugment Batches 95.8
(4) Two RandAugment Batches with separated BatchNorm 95.8
(5) Weak Augment without separated BatchNorms 96.1
(6) Weak Augment with shared , parameters 96.1
(7) Weak Augment 96.3
Table 3: Ablation study for the improvements coming from weak augmentations, evaluated using a WideResNet-28-2 model on CIFAR-10. Relative to a RandAugment model, the addition of an extra batch of strongly augmented data does not improve model performance (3). The direct application of an auxiliary BatchNorm without having differing augmentations also yields constant performance (4). The benefits from Weak Augment come first from the inclusion of weakly augmented images (5) and then again from using two separate BatchNorms (7).

5.2 Experimenting with Other Auxiliary BatchNorm Data Augmentation Types

The Weak Augment strategy is generic and extends to a variety of other augmentation setups. Specifically, the strong augmentation provided on the auxiliary BatchNorm does not need to be RandAugment. In Table 4, the results show that weak augmentations are effective at improving the performance for a variety of data augmentation methods on the auxiliary BatchNorm. In Table 4, we experiment with Gaussian noise with a strength of 0.2, PGD adversarial noise with [32], and AugMix [17]. While the use of a separated BatchNorm appears to yield improvements in all models, the best performing result still arises from the application of Weak Augment on top of RandAugment. This is perhaps not so surprising as RandAugment has the highest baseline score of all tested models, but Weak Augment appears to be a generic strategy that can yield improvements across different data augmentation types.


Clean Test Accuracy for Alternate Augmentation Strategies
Applied to the Auxliary BatchNorm
CIFAR-10 Baseline Weak Augment
Gaussian ( = 0.2) 93.2 (+2.0%) 95.2
Adversarial Noise 94.7 (+0.5%) 95.2
AugMix 94.8 (+0.9%) 95.7
RandAugment 95.8 (+0.5%) 96.3
Table 4: Improvement from applying the Weak Augment training strategy for alternate data augmentation methods, compared to baseline models using the augmentation strategy and having only a single BatchNorm. For all of these studies, we provide the result for WideResNet-28-2 models trained on CIFAR-10 data. Weak Augment on top of RandAugment is shown to have the best performance, but the separated BatchNorm setup appears to have benefits for other trained models as well.

6 Separate BatchNorms Trade-off Between Accuracy and Robustness

Although impressive gains in test-set performance can be achieved from the Weak Augment setup from a separated BatchNorm, optimizing for accuracy naively introduces a new problem in these models: robustness. Table 5 presents the performance of the new Weak Augment models on the Common Corruptions benchmarks of CIFAR-10-C and ImageNet-C [16]. Both CIFAR-10 models display significantly diminished robustness results.

Corruption Error
Standard RA Weak Augment (Main BN)
CIFAR-10-C (uCE)
Wide-ResNet-28-2 27.7 15.9 19.1
Wide-ResNet-28-10 24.4 13.2 17.1
ImageNet-C (CE)
ResNet50 77.5 70.8 69.2
Table 5: Corruption error based on the CIFAR-10-C and ImageNet-C tasks. For CIFAR-10-C,the unnormalized corruption error with the mean of 10 runs is reported. For ImageNet, the corruptions errors provided are after scaling by those of AlexNet. Lower scores are better.

6.1 Averaging Predictions from Both BatchNorms Leads to Improvements

While the main BatchNorm branch of the models appear to be more sensitive to input noise, this problem can be overcome by also incorporating the predictions coming from the auxiliary BatchNorm. Figure 3 explores the performance of predictions coming from using the main and auxiliary BatchNorms, also including combinations thereof weighted by a factor of .

This graph presents a fundamental trade-off between accuracy and robustness between the predictions coming from the two BatchNorms, yet a simple average shows overall improvement in both metrics. The same holds true for the ImageNet model, where simple averaging yields a improvement of in top-1 accuracy and -2.4 in CE robustness (lower is better). This suggests that utilizing both BatchNorms for evaluation may be a promising method for countering diminished robustness properties, if one is willing to compute both at inference. We hope that future work explores this direction futher, possibly determining if there is a way to distill the information in both BatchNorms into a single forward pass.

Does Data Augmentation Benefit from Split BatchNorms?
Figure 3: Effects of various on the performance (accuracy and corruption error) of the Weak Augment setup. The figures show a clear trade-off between accuracy and robustness when using either the main or auxiliary BatchNorm ( or ). However, at intermediate values such as , the model shows impressive gains in test accuracy without sacrificing model robustness. Note, higher accuracy is better whereas lower corruption error is better.

6.2 Fourier Sensitivity Provides Another Perspective of Robustness

The robustness studies so far have only focused on a few parameterized noises from the Common Corruptions benchmark. In order to get a more general perspective on model robustness, we turn to analyzing noise in Fourier space using the strategy introduced by Yin et al. [35]. This evaluation strategy involves perturbing each image in the test set with noise sampled from different orientations and frequencies in Fourier space and then determining the difference in test set performance. These results are plotted in a heatmap, where the position indicates the noise in Fourier-space, with the lowest frequencies in the center of the images. The color then represents the test error. The results are presented in Figure 4, specifically for the WideResNet-28-2 models trained on CIFAR-10. For these examples, the magnitude of the noise is set to have a norm to be 8.0, larger than in previous studies [13, 35] but allowing for better visualization of the differences between models. Following the interpolation results presented in the previous section, the figures show the heatmaps for the main BatchNorm, the auxiliary BatchNorm, and the average of these two predictions.

The Frequency analysis showcases the same patterns as the Common Corruptions robustness in the previous section. Specifically, the main BatchNorm appears highly sensitive, particularly to high frequency noise. In contrast, the auxiliary BatchNorm () show greater robustness overall. Interpolating between the two predictions showcases similar benefits as before, maintaining most of the additional benefit gained from strong augmentation.

Does Data Augmentation Benefit from Split BatchNorms?
(a) Main BatchNorm
Does Data Augmentation Benefit from Split BatchNorms?
(b) Average of Predictions
Does Data Augmentation Benefit from Split BatchNorms?
(c) Auxiliary BatchNorm
Figure 4: Heatmaps of the test error Fourier sensitivity of various models, using the method from [35]. Each figure shows the sensitivity to various sinusoidal gratings. We show the results for WideResNet-28-2 models trained on CIFAR-10, where we include the results for the main BatchNorm, the auxiliary BatchNorm, and the average of these two predictions (). The corruption errors scale from 1.0 (red) to 0.0 (dark blue).

6.3 Where are the improvements in robustness from separated BatchNorms coming from?

Does Data Augmentation Benefit from Split BatchNorms?
Figure 5: Effects of increasing the size of a low pass filter on the test performance of various WideResNet-28-2 models trained on CIFAR-10. The distributions show that RandAugment uses lower frequency information than a model with no augmentations. Our model, Weak Augment, appears to interpolate between these two regimes.

Finally, we analyze which frequencies are most important for model performance and gain insights into why the Weak Augment method is effective. This is achieved by defining a low pass filter of bandwidth as the operation that all of the frequency components outside of a centered square of width in the Fourier spectrum, centered around the lowest frequency, to be zero. Then, an inverse Discrete Fourier Transform is applied to recover the image. Figure 5 presents the effect of applying a low pass filter with increasing bandwidth on 500 examples from the CIFAR-10 test set and measures the difference in accuracy for WideResNet-28-2 models tested with no augmentations, Weak Augment of varying parameters, and RandAugment. Generally, RandAugment uses significantly lower frequency information to achieve peak classifier performance when compared to no augmentations. The Weak Augment model effectively interpolates between these two regimes. The improvements in clean accuracy coming from Weak Augment could potentially be explained by the more effective use of higher frequency information available in the training images.

7 Conclusion

In this work, we analyze the popular training paradigm of using separated BatchNorm parameters and show that it can be applied to generic data augmentation setups. Specifically, we find that naively applying this approach and using clean (un-augmented) and augmented images to define the two BatchNorms does not lead to any improvements. Instead, we propose using weak augmentations rather than clean ones to define the main BatchNorm and find significant improvements on CIFAR-10, CIFAR-100, and ImageNet benchmarks. Finally, we find that defining BatchNorm parameters based on weak augmentations leads to problems in model robustness but that this problem can be overcome by interpolating between the predictions of the two BatchNorm parameters.

Acknowledgements

We thank Irwan Bello and the Google Brain Team for helpful feedback on this manuscript.

Footnotes

  1. https://pillow.readthedocs.io/en/5.1.x
  2. Note, AdvProps studies a specific case of the generic setup of separated BatchNorms, where AutoAugment is applied to all images and adversarial noise is limited to the auxiliary BatchNorm.
  3. RandAugment assume the default augmentations of flips, crops, and Cutout are included, as per Cubuk et al.[7]
  4. Cutout is performed in addition to flip and crop augmentations and is performed with a pad-size of 16/90 for CIFAR-10/ImageNet models respectively.

References

  1. J. R. Bellegarda, P. V. de Souza, A. J. Nadas, D. Nahamoo, M. A. Picheny and L. R. Bahl (1992)

    Robust speaker adaptation using a piecewise linear acoustic mapping.

    In [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing,

    Vol. 1, pp. 445–448 vol.1.

    Cited by: §1,
    §2.

  2. Y. Bengio, F. Bastien, A. Bergeron, N. Boulanger–Lewandowski, T. Breuel, Y. Chherawala, M. Cisse, M. Côté, D. Erhan, J. Eustache, X. Glorot, X. Muller, S. P. Lebeuf, R. Pascanu, S. Rifai, F. Savard and G. Sicard (2011-11–13 Apr)

    Deep learners benefit more from out-of-distribution examples.

    In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, G. Gordon, D. Dunson and M. Dudík (Eds.),

    Proceedings of Machine Learning Research, Vol. 15, Fort Lauderdale, FL, USA, pp. 164–172.

    External Links: Link

    Cited by: §2.

  3. F. M. Cariucci, L. Porzi, B. Caputo, E. Ricci and S. R. Bulo (2017)

    Autodial: automatic domain alignment layers.

    In 2017 IEEE International Conference on Computer Vision (ICCV),

    pp. 5077–5085.

    Cited by: §2.

  4. S. Cheng, Z. Leng, E. D. Cubuk, B. Zoph, C. Bai, J. Ngiam, Y. Song, B. Caine, V. Vasudevan and C. Li (2020)

    Improving 3d object detection through progressive population based augmentation.

    arXiv preprint arXiv:2004.00831.

    Cited by: §1.

  5. D. Ciregan, U. Meier and J. Schmidhuber (2012)

    Multi-column deep neural networks for image classification.

    In 2012 IEEE conference on computer vision and pattern recognition,

    pp. 3642–3649.

    Cited by: §2.

  6. E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan and Q. V. Le (2019)

    Autoaugment: learning augmentation strategies from data.

    In Proceedings of the IEEE conference on computer vision and pattern recognition,

    pp. 113–123.

    Cited by: §1,
    Table 2.

  7. E. D. Cubuk, B. Zoph, J. Shlens and Q. V. Le (2019)

    RandAugment: practical data augmentation with no separate search.

    arXiv preprint arXiv:1909.13719.

    Cited by: §1,
    §1,
    §3,
    §3,
    Table 2,
    footnote 3.

  8. J. Deng, W. Dong, R. Socher, L. Li, K. Li and L. Fei-Fei (2009)

    Imagenet: a large-scale hierarchical image database.

    In 2009 IEEE conference on computer vision and pattern recognition,

    pp. 248–255.

    Cited by: §3.

  9. T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout.

    arXiv preprint arXiv:1708.04552.

    Cited by: §1,
    Table 2.

  10. H. Fang, J. Sun, R. Wang, M. Gou, Y. Li and C. Lu (2019)

    Instaboost: boosting instance segmentation via probability map guided copy-pasting.

    In Proceedings of the IEEE International Conference on Computer Vision,

    pp. 682–691.

    Cited by: §1.

  11. R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár and K. He (2018)

    Detectron.

    Cited by: §1.

  12. R. Gontijo-Lopes, S. J. Smullin, E. D. Cubuk and E. Dyer (2020)

    Affinity and diversity: quantifying mechanisms of data augmentation.

    arXiv preprint arXiv:2002.08973.

    Cited by: 2nd item,
    §1,
    §2,
    Table 1.

  13. R. Gontijo-Lopes, D. Yin, B. Poole, J. Gilmer and E. D. Cubuk (2019)

    Improving robustness without sacrificing accuracy with patch gaussian augmentation.

    arXiv preprint arXiv:1906.02611.

    Cited by: §1,
    §2,
    §5,
    §6.2.

  14. R. Hataya, J. Zdenek, K. Yoshizoe and H. Nakayama (2019)

    Faster autoaugment: learning augmentation strategies using backpropagation.

    arXiv preprint arXiv:1911.06987.

    Cited by: §1,
    §2.

  15. Z. He, L. Xie, X. Chen, Y. Zhang, Y. Wang and Q. Tian (2019)

    Data augmentation revisited: rethinking the distribution gap between clean and augmented data.

    External Links: 1909.09148

    Cited by: §1.

  16. D. Hendrycks and T. Dietterich (2019)

    Benchmarking neural network robustness to common corruptions and perturbations.

    arXiv preprint arXiv:1903.12261.

    Cited by: §2,
    §3,
    §6.

  17. D. Hendrycks, N. Mu, E. D. Cubuk, B. Zoph, J. Gilmer and B. Lakshminarayanan (2019)

    AugMix: a simple data processing method to improve robustness and uncertainty.

    arXiv preprint arXiv:1912.02781.

    Cited by: §1,
    §5.2.

  18. E. Hoffer, T. Ben-Nun, I. Hubara, N. Giladi, T. Hoefler and D. Soudry (2019)

    Augment your batch: better training with larger batches.

    arXiv preprint arXiv:1901.09335.

    Cited by: §5.1.

  19. S. Ioffe and C. Szegedy (2015)

    Batch normalization: accelerating deep network training by reducing internal covariate shift.

    arXiv preprint arXiv:1502.03167.

    Cited by: §2.

  20. P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu and D. Krishnan (2020)

    Supervised contrastive learning.

    arXiv preprint arXiv:2004.11362.

    Cited by: §1.

  21. A. Krizhevsky and G. Hinton (2009)

    Learning multiple layers of features from tiny images.

    .

    Cited by: §3.

  22. A. Krizhevsky, I. Sutskever and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks.

    In Advances in neural information processing systems,

    pp. 1097–1105.

    Cited by: §1,
    §2,
    §3.

  23. Y. Li, N. Wang, J. Shi, J. Liu and X. Hou (2016)

    Revisiting batch normalization for practical domain adaptation.

    arXiv preprint arXiv:1603.04779.

    Cited by: §5.1.

  24. Y. Li, N. Wang, J. Shi, J. Liu and X. Hou (2016)

    Revisiting batch normalization for practical domain adaptation.

    arXiv preprint arXiv:1603.04779.

    Cited by: §2.

  25. S. Lim, I. Kim, T. Kim, C. Kim and S. Kim (2019)

    Fast autoaugment.

    In Advances in Neural Information Processing Systems,

    pp. 6662–6672.

    Cited by: §1,
    §2.

  26. A. J. Ratner, H. Ehrenberg, Z. Hussain, J. Dunnmon and C. Ré (2017)

    Learning to compose domain-specific transformations for data augmentation.

    In Advances in neural information processing systems,

    pp. 3236–3246.

    Cited by: §1.

  27. I. Sato, H. Nishimura and K. Yokoi (2015)

    Apac: augmented pattern classification with neural networks.

    arXiv preprint arXiv:1505.03229.

    Cited by: §2.

  28. P. Y. Simard, D. Steinkraus and J. C. Platt (2003)

    Best practices for convolutional neural networks applied to visual document analysis..

    In Icdar,

    Cited by: §1,
    §1,
    §2.

  29. K. Sohn, D. Berthelot, C. Li, Z. Zhang, N. Carlini, E. D. Cubuk, A. Kurakin, H. Zhang and C. Raffel (2020)

    Fixmatch: simplifying semi-supervised learning with consistency and confidence.

    arXiv preprint arXiv:2001.07685.

    Cited by: §4.

  30. M. Tan and Q. V. Le (2019)

    Efficientnet: rethinking model scaling for convolutional neural networks.

    arXiv preprint arXiv:1905.11946.

    Cited by: §2.

  31. X. Wang, Y. Jin, M. Long, J. Wang and M. I. Jordan (2019)

    Transferable normalization: towards improving transferability of deep neural networks.

    In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d‘Alché-Buc, E. Fox and R. Garnett (Eds.),

    pp. 1953–1963.

    External Links: Link

    Cited by: §1,
    §2,
    §3.

  32. C. Xie, M. Tan, B. Gong, J. Wang, A. L. Yuille and Q. V. Le (2020)

    Adversarial examples improve image recognition.

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

    pp. 819–828.

    Cited by: §1,
    §1,
    §2,
    §3,
    §5.2.

  33. Q. Xie, Z. Dai, E. Hovy, M. Luong and Q. V. Le (2019)

    Unsupervised data augmentation for consistency training.

    arXiv preprint arXiv:1904.12848.

    Cited by: §1.

  34. Q. Xie, M. Luong, E. Hovy and Q. V. Le (2020)

    Self-training with noisy student improves imagenet classification.

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

    pp. 10687–10698.

    Cited by: §1.

  35. D. Yin, R. Gontijo-Lopes, J. Shlens, E. D. Cubuk and J. Gilmer (2019)

    A fourier perspective on model robustness in computer vision.

    In Advances in Neural Information Processing Systems,

    pp. 13255–13265.

    Cited by: §1,
    Figure 4,
    §6.2.

  36. S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe and Y. Yoo (2019)

    Cutmix: regularization strategy to train strong classifiers with localizable features.

    In Proceedings of the IEEE International Conference on Computer Vision,

    pp. 6023–6032.

    Cited by: §1,
    §1.

  37. S. Zagoruyko and N. Komodakis (2016)

    Wide residual networks.

    arXiv preprint arXiv:1605.07146.

    Cited by: §3.

  38. M. Zaj/kac, K. Żołna and S. Jastrz/kebski (2019)

    Split batch normalization: improving semi-supervised learning under domain shift.

    arXiv preprint arXiv:1904.03515.

    Cited by: §1,
    §2.

  39. H. Zhang, M. Cisse, Y. N. Dauphin and D. Lopez-Paz (2017)

    Mixup: beyond empirical risk minimization.

    arXiv preprint arXiv:1710.09412.

    Cited by: §1,
    §1,
    §2.

  40. X. Zhang, Q. Wang, J. Zhang and Z. Zhong (2019)

    Adversarial autoaugment.

    arXiv preprint arXiv:1912.11188.

    Cited by: §1.

  41. Z. Zhong, L. Zheng, G. Kang, S. Li and Y. Yang (2017)

    Random erasing data augmentation.

    arXiv preprint arXiv:1708.04896.

    Cited by: §2.

  42. B. Zoph, E. D. Cubuk, G. Ghiasi, T. Lin, J. Shlens and Q. V. Le (2019)

    Learning data augmentation strategies for object detection.

    arXiv preprint arXiv:1906.11172.

    Cited by: §1.

  43. B. Zoph, G. Ghiasi, T. Lin, Y. Cui, H. Liu, E. D. Cubuk and Q. V. Le (2020)

    Rethinking pre-training and self-training.

    arXiv preprint arXiv:2006.06882.

    Cited by: §1.

https://www.groundai.com/project/does-data-augmentation-benefit-from-split-batchnorms/


CSIT FUN , 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:Does Data Augmentation Benefit from Split BatchNorms?
喜欢 (0)
[985016145@qq.com]
分享 (0)
发表我的评论
取消评论
表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址