欢迎,计算机科学与信息计算爱好者!

Partially-Shared Variational Auto-encoders for Unsupervised Domain Adaptation with Target Shift

论文 scott 3个月前 (07-04) 22次浏览 0个评论 扫描二维码

Partially-Shared Variational Auto-encoders for
Unsupervised Domain Adaptation with Target Shift

Abstract

This paper proposes a novel approach for unsupervised domain adaptation (UDA) with target shift. Target shift is a problem of mismatch in label distribution between source and target domains. Typically it appears as class-imbalance in target domain.
In practice, this is an important problem in UDA; as we do not know labels in target domain datasets, we do not know whether or not its distribution is identical to that in the source domain dataset.
Many traditional approaches achieve UDA with distribution matching by minimizing mean maximum discrepancy or adversarial training; however these approaches implicitly assume a coincidence in the distributions and do not work under situations with target shift.
Some recent UDA approaches focus on class boundary and some of them are robust to target shift, but they are only applicable to classification and not to regression.

To overcome the target shift problem in UDA, the proposed method, partially shared variational autoencoders (PS-VAEs), uses pair-wise feature alignment instead of feature distribution matching. PS-VAEs inter-convert domain of each sample by a CycleGAN-based architecture while preserving its label-related content.
To evaluate the performance of PS-VAEs, we carried out two experiments: UDA with class-unbalanced digits datasets (classification), and UDA from synthesized data to real observation in human-pose-estimation (regression). The proposed method presented its robustness against the class-imbalance in the classification task, and outperformed the other methods in the regression task with a large margin.

/cvprfinalcopy

1 Introduction

Partially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift
Figure 1: (best viewed in color) Overview of the proposed approach. When two domains have different label distributions, feature distribution matching causes miss-alignment in feature space (with classifier/regressor trained by source domain samples). The proposed method avoids this by sample-wise feature matching, where pseudo sample pairs with identical label are generated via a CycleGAN-based architecture.

Unsupervised domain adaptation (UDA) is one of the most studied topics in recent years.
One attractive application of UDA is adaptation from computer graphic (CG) data to sensor-observed (Obs.) data.
By constructing a CG-rendering system, we can easily obtain a large amount of supervised data with diversity for training.
We denote this task as CGObs. UDA is typically tested on semantic segmentation of traffic scenes [11, 32] and achieves remarkable performance.

As in ADDA[36], the typical approach for UDA is to match feature distributions between the source and target domains [21, 8, 18].
This approach works impressively with balanced datasets, such as those for digits (MNIST, USPS, and SVHN) and traffic-scene semantic-segmentation (GTA5[28]cityscapes[5]). When the prior label distributions of the source and target domains are mismatched, however, such approaches hardly work without a countermeasure for the mismatch (see Figure 1).
Cluster finding [32, 31, 6] is another approach for UDA, and such approaches are more robust against cases of mismatched distributions; they are not, however, applicable to regression problems.

In this paper, we propose a novel UDA method applicable to both classification and regression problems with mismatched label distributions.
The problem of mismatched label distributions is also known as target shift [39, 10] or prior probability shift [26]. The typical example of this problem is hidden class-imbalance in the target domain dataset.
Previous methods on deep learning models tried to overcome this problem by estimating importance labels for each category [37, 38, 1, 2, 3] or each sample [14].
The former approach is only applicable to classification tasks, while the latter under-samples the source domain data.

In contrast, our method resolves this problem by oversampling with data augmentation via the CycleGAN architecture [41].
More concretely, the basic strategy of the method is to generate pseudo-pairs of source and target samples (with identical labels) by using the CycleGAN architecture.
Unlike other CycleGAN-based UDA methods [11, 30], the proposed method does not match feature distributions.
Instead, it aligns features extracted from each pseudo-pair in feature space as shown in Figure 1.
In addition, the two encoders are designed to share weights. The same encoder is used in encoding any sample of the pseudo-pair.
Naively minimizing distance between the paired samples leads to bad convergence because of the competition with the losses in CycleGAN training; it implicitly forces features on the sourcetarget and targetsource paths to contain different information in addition to the common label-related content.
Hence, we disentangle the features into domain-invariant and domain-specific components to avoid such competition.
We further stabilize training by making the encoders and decoders share weights, giving what we call partially-shared autoencoders (PS-AEs). As a side benefit, this implementation enables us to introduce the mechanism of a variational auto-encoder (VAE) [7], which is known to be effective for DA tasks [18, 20].

The contribution of this paper is three-fold.

  • We propose a novel UDA method that overcomes the target shift problem by oversampling with data augmentation.

  • The proposed method achieved the best performance for UDA with heavily-class-imbalanced digit datasets.

  • We tackled the problem of human-pose estimation by UDA with target shift for the first time and outperformed the baselines with a large margin.

2 Related Work

Balance Imbalance
cat. reg. cat. reg.
ADDA [36], UFDN [18], (✓)
CyCADA
[11]
MCD [32] (✓)
PADA [37] (✓)
SimGAN [33] (✓)
Ours (✓)
Table 1: Representative UDA methods and their supporting situations. The symbol “(✓)” indicates that the method theoretically supports the situation but this was not experimentally confirmed in the original paper.
The abbreviations “cat.” and “reg.” indicate categorization and regression, respectively.

The most popular approach in recent UDA methods is to match the feature distributions of the source and target domains so that a classifier trained with the source domain dataset is applicable to target domain samples.
There are various options to match the distributions, such as minimizing MMD [21, 37], using a gradient-reversal layer with domain discriminators [8], and using alternative adversarial training with domain discriminators [36, 15, 2, 18].
Adversarial training removes domain bias from the feature representation.
To preserve information in features as much as possible, UFDN [18] adds a decoder to the network for a loss-less encoding. Because the features have no domain information, this method feeds a domain code, a one-hot vector representation for domain reference, to the decoder (with the encoded feature). The encoder and decoder in this model compose a VAE [7].
Another approach is feature whitening [29], which whitens features from each domain at domain-specific alignment layers. This approach does not use adversarial training, but it tries to analytically fit a feature distribution from each domain to a common spherical distribution. As shown in Table 1, all these methods are theoretically applicable to both classification and regression, but it is limited to the situations without target shift.

MCD was proposed by Saito /etal[32, 31], which does not use distribution matching.
Instead, the classifier discrepancy is measured based on the difference of decision boundaries between multiple classifiers. DIRT-T [34] and CLAN [22] are additional approaches focusing on boundary adjustment.
These approaches are expected to be robust against class imbalance, because they focus only on the boundaries and do not try to match the distributions.
CAT [6] is a method that aligns clusters found by other backbone methods. These approaches assume an existence of boundaries between clusters.
Hence, they are not applicable to regression problems, which have continuous sample distributions (see the second row in Table 1).

Partial domain adaptation (PDA) is a variant of UDA with several papers on it [37, 38, 1, 2, 3] (see the third row in Table 1).
This problem assumes a situation in which some categories in the source domain do not appear in the target domain.
This problem is a special case of UDA with target shift in two senses: it always assumes the absence of a class rather than class-imbalance, and it does not assume a regression task.
The principle approach for this problem is to estimate the importance weight for each category, and ignore those judged as unimportant (under-sampling).
PADACO [14] is an extension of PADA for a regression problem and was designed to estimate head poses in UDA with target shift.
It first trains the model with the source domain dataset and obtains pseudo-labels for the target domain dataset. Then, using the similarity of estimated labels, it sets a sampling weight for each sample in the source domain (under-sampling).
Finally, it performs UDA training with a weighted sampling strategy for the source domain dataset.
To obtain better results with this method, it is important to obtain good sampling weights at the first stage.
In this sense, like CAT [6], PADACO requires a good backbone method that provides a good label similarity metric.

Label-preserving domain conversion is another important approach and includes the proposed method (see fourth and fifth rows in Table 1).
Shrivastava /etalproposed SimGAN [33], which converts CG images to nearly real images by adversarial training.
This method tries to preserve labels by minimizing the self-regularization loss, the pixel-value difference between images before and after conversion.
In the sense that the method generates source-domain-like samples from the source domain datasets using GAN, we can say it is a method based on over-sampling with data augmentation.
We note that this work can be regarded as the first deep-learning-based UDA method for regression that is theoretically applicable to the task with target shift.

CyCADA[11] combines CycleGAN, ADDA and SimGAN for better performance.
It first generates fake target domain images via CycleGAN. The label-consistency of generated samples are preserved by SimGAN’s self-regularization loss; however it has a discriminator that matches the feature distributions.
Hence, this methods principally has the same weakness against target shift.
SBADAGAN[30] is yet another CycleGAN-based method with discriminator for feature distribution matching.

From the viewpoint of human-pose-estimation, a method has been proposed quite recently that estimates human-pose in a UDA manner [40].
It uses a synthesized depth image dataset as the source domain dataset.
The target domain is given with depth and RGB images.
The final goal is to estimate 3D poses from RGB images.
It performs domain adaptation by transferring knowledge via an additional domain of body-part label representations. The body-part label space are expected to be domain-invariant because of its discrete representation. The method tries to transfer knowledge through this discrete space.
The method was evaluated in UDA, weakly-supervised DA, and fully supervised DA settings.
Target shift was not discussed in this paper because the target domain dataset has enough diversity.

3 Method

3.1 Problem statement

Let be samples and their labels in the source domain dataset ( is the label space), and let be samples in the target domain dataset. The target labels and their distribution are unknown (and shiftable from ) in the problem of UDA with target shift.
The goal of this problem is to obtain a high-accuracy model for predicting the labels of samples obtained in the target domain.

3.2 Overview of the proposed method

The main strategy of the proposed method is to replace the feature distribution matching process with pair-wise feature alignment.
To achieve this, we adopt the CycleGAN architecture shown in Figure 2.
The model is designed to generate pseudo pairs and , each of which are expected to have an identical label.
In addition to CycleGAN’s original losses, we add two new losses: for label prediction and for feature alignment, where the losses are calculated only on the domain-invariant component of the disentangled feature representation (or ).
After the training, prediction in target domain is done by the path, encoderpredictor ().
3.3 describes this modification in detail.

To preserve the label-related content at pseudo pair generation, we further modify the network by sharing weights and introducing VAE’s mechanism (see Figure 3). 3.4 describes this modification in detail.

Partially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift
Figure 2: (best viewed in color) Architecture of CycleGAN with disentangled features. The major changes from the original CycleGAN (variables, losses with their back-propagating paths) are shown in color.
Partially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift
Figure 3: (best viewed in color) Architecture of partially-shared variational auto-encoders and with the path for calculating . VAE’s re-sampling process is applied when calculating but not with other losses.

3.3 Disentangled CycleGAN with feature consistency loss

The model in Figure 2 has pairs of encoders , generators , and discriminators , where . is generated as , and as .
The original CycleGAN [41] is trained by minimizing the cycle consistency loss , the identity loss , and the adversarial loss defined in LSGAN [24]:

(1)

where is the opposite domain of and is a distance function.

(2)
(3)

We note that we used spectral normalization [25] in and for a stable adversarial training.

To successfully achieve pair-wise feature alignment, the model divides the output of into .
Then, it performs feature alignment by using the domain-invariant feature consistency loss , defined as

(4)

where .
Note that gradients are not further back-propagated to over (see the path of in Figure 2) because updating both and in one step leads to bad convergence.

In addition, obtained from is fed into to train the classifier/regressor by minimizing the prediction loss .
The concrete implementation of is task-dependent.

We avoid applying to the whole feature components , as it can hardly reach good local minima because of the competition between the pair-wise feature alignment (by ) and CycleGAN (by and ). Specifically, training to generate must yield a dependency of .
This means that is trained to have in-domain variation information for .
The situation is the same with and .
Hence, and have dependencies on different factors, and , respectively, and it is difficult to match the whole features, and . The disentanglement into and resolves this situation. Note that this architecture is similar to DRIT [17].

Partially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift
Figure 4: Misalignment caused by CycleGAN’s two image-space discriminators. This is typically seen with a model that does not share encoder weights.

3.4 Partially shared VAEs

Next, we expect and to output a domain-invariant feature . Even with this implementation, however, CycleGAN can misalign an image’s label-related content in domain conversions under a severe target shift, because it has discriminators that match not feature- but image-distributions.
Figure 4 shows examples of misalignment caused by image space discriminators.
This happens because the decoders and can convert identical s into different digits, for example, to better minimize with imbalanced distributions. In such cases, the corresponding encoders also extract identical s from images with totally different appearance.

To prevent such misalignment and get more stable results, we make the decoders share weights to generate similar content from , and we make the encoders extract only from similar content.
Figure 3 shows the details of the parameter-sharing architecture, which consists of units called partially-shared auto
-encoders
(PS-AEs).
Formally, the partially shared encoders are described as a function . In our implementation, only the last layer is divided into three parts, which outputs , , and .
can obviously be substituted for and by discarding and from the output, respectively.
Similarly, the generator shares weights other than for the first layer, which consists of three parts, which output , , and . can be substituted for and by inputting and , respectively.

This implementation brings another advantage for UDA tasks: it can disentangle the feature space by consisting of two variational auto-encoders (VAEs), and (Figure 3).
Putting VAE in a model to obtain a domain-invariant feature is reported as an effective option in recent domain adaptation studies [18, 20].
To make PS-AEs a pair of VAEs, we put VAE’s resampling process at calculation of and add the KL loss defined as

(5)

where is the KL divergence between two distributions and , is the distribution of sampled from , and and are standard normal distributions with the same sizes as and , respectively.

Our full model, partially-shared variational auto-encoders (PS-VAEs), is trained by optimizing the weighted sum of the all the above loss functions:

(6)

where , , , , and are hyper-parameters that should be tuned for each task.
For the distance function , we use the smooth L1 distance [9], which is defined as

(7)

4 Evaluation

4.1 Evaluation on class-imbalanced digit dataset

We first evaluated the performance of the proposed method on standard UDA tasks with digit datasets (MNIST[16]USPS[12], and SVHN[27]MNIST). To evaluate the performance under a controlled situation with class-imbalance in the target domain, we adjusted the rate of samples of class ‘1’ from 10% to 50%. When the rate was 10%, the number of samples was exactly the same among the categories. When it was 50%, half the data belonged to category ‘1,’ which was the most imbalanced setting in this experiment. Note that the original data had slight differences in the numbers of samples between categories.
We adjusted these differences by randomly eliminating samples. Table 2 lists the numbers of samples in each dataset and class-imbalance.
Because SVHN was used only as a source domain, it had no imbalanced situation.
Note that USPS had only small numbers of samples (500 to 1000 for each category).
Hence, we over-sampled data from category ‘1’ with data augmentation (horizontal shifts of one or two pixels) to achieve the balance.
For the other datasets, we randomly discarded the samples.

10% 20% 30% 40% 50%
USPS (1) 500 1125 1922 3000 4500
USPS (other) 500 500 500 500 500
MNIST (1) 4000 4500 5400 6000 6300
MNIST (other) 4000 2000 1400 1000 700
SVHN (1) 4000
SVHN (other) 4000
Table 2: Number of samples under each condition. SVHN was only used as a source domain and had no imbalanced setting.

In this task, is simply given as the following categorical cross-entropy loss:

(8)

We compared the proposed method with the following baselines:

ADDA[36] and UFDN[18]

are methods based on simple distribution matching. They are applicable to both classification and regression.

PADA[2]

is also based on distribution matching, but it estimates an importance weight for each category to deal with class-imbalance.

SimGAN[33]

is a method based on image-to-image conversion. To prevent misalignment during conversion, it also minimizes changes in the pixel-values before and after conversion by using a self-regularization loss. The code is borrowed from the implementation of CyCADA.

CyCADA[11]

is a CycleGAN-based UDA method. The self-regularization loss is used in this method, too. In addition, it matches feature distributions, like ADDA.

MCD[32]

is a method that minimizes a discrepancy defined by the boundary differences obtained from multiple classifiers. This method is expected to be more robust against class-imbalance than methods based on distribution matching, because it does not focus on the entire distribution shape. On the other hand, this kind of approach is theoretically applicable only to classification but not to regression.

Note that some of recent state-of-the-art methods for the balanced digit UDA task was not listed in the experiment due to their reproducibility problem.1 The detailed implementations (network architecture, hyper-parameters, and so on) of the proposed method and the above methods appears in the supplementary material.
Tables 3,4, and 5 list the results. The methods based on distribution matching (ADDA and UFDN) were critically affected by class-imbalance in the target domain.
CyCADA was more robust than ADDA and UFDN for the MNISTUSPS tasks, owing to the self-regulation loss, but it did not work for the SVHNMNIST task because of the large pixel-value differences between the MNIST and SVHN samples.
In contrast, the proposed method was more robust against imbalance than the above methods, and it was more accurate than PADA and SimGAN.
As a result, our method achieved the best performance in most imbalance settings, and listed its robustness especially with the heaviest imbalance.

Ref. 10% 20% 30% 40% 50%
Source only 71.0
ADDA 89.4 89.8 86.9 79.3 81.8 78.5
UFDN 97.1 94.0 90.4 83.2 82.3 83.8
PADA 75.3 77.7 79.3 77.8 80.2
SimGAN 72.4 86.5 84.0 84.3 76.3
CyCADA 95.6 91.8 91.0 80.3 86.4 87.6
MCD 94.2 91.2 90.4 79.0 78.5 80.3
Ours 93.9 94.8 93.4 94.6 92.6
Table 3: Accuracy in the MNISTUSPS task. The abbreviation “Ref.” indicates reference scores reported in the original papers.
Ref. 10% 20% 30% 40% 50%
Source only 55.6
ADDA 90.1 96.0 89.0 81.5 78.9 80.5
UFDN 93.7 93.6 81.9 79.2 72.0 69.1
PADA 47.9 39.2 36.0 29.8 25.2
SimGAN 68.3 50.2 49.9 63.8 49.3
CyCADA 96.5 75.3 75.3 75.2 76.7 70.7
MCD 94.1 96.0 81.5 79.1 78.1 77.4
Ours 94.8 94.4 90.8 82.6 82.4
Table 4: Accuracy in the USPSMNIST task. The abbreviation “Ref.” indicates reference scores reported in the original papers.
Ref. 10% 20% 30% 40% 50%
Source only 46.6
ADDA 76.0 75.5 65.0 65.2 50.8 54.3
UFDN 95.0 91.1 70.9 58.7 52.6 43.6
PADA 30.5 39.5 37.3 36.8 36.7
SimGAN 61.4 52.5 57.7 51.8 49.3
CyCADA 90.4 91.4 75.4 69.7 70.7 68.3
MCD 96.2 90.3 89.7 80.2 72.0 65.3
Ours 73.7 72.9 73.8 64.4 68.4
Table 5: Accuracy in the SVHNMNIST task. The abbreviation “Ref.” indicates reference scores reported in the original papers.

4.2 Evaluation on human pose dataset

We also evaluated the proposed method with a regression task on human pose estimation.
For this task, we prepared a synthesized depth image dataset whose poses were sampled with CMU Mocap [4] and rendered with PoserPro2014 [35], as the source domain dataset. Each image had 18 joint positions. In the sampling, we avoided pose duplication by confirming that at least one joint had a position more than 50mm away from its position in any other samples. The total number of source domain samples was 15000. These were rendered with a choice of two human models (male and female), whose heights were sampled from a normal distribution with respective means of 1.707 and 1.579m and standard deviations of 56.0mm and 53.3mm).
For the target dataset, we used depth images from the CMU Panoptic Dataset [13], which were observed with a Microsoft Kinect.
We automatically eliminated the background in the target domain data by preprocessing.2 Finally, we used 15,000 images were used for training and 500 images were used for the test, after manually annotating the joint positions.

Figure 5 shows the target shift between the source and target domains via the differences in joint positions at the head and foot.
In this experiment, we compared the proposed method with SimGAN, CyCADA, and MCD, with an ablation study.
All the methods were implemented with a common network structure, which appears in the supplementary materials.
was defined as

(9)


Head Position


Left Foot Position




CGPartially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift




Obs.Partially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift


CGPartially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift


Obs.Partially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift
Figure 5: Difference in human joint distributions between the source (CG) and target (Obs.) domains. CG images are generated with diverse poses. In contrast, observed images tend to be stagnation.
Partially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift
Figure 6: (best viewed in color) Averaged percentage of joints detected with errors less than pixels. (Higher is better.)
Error less than 10px. Head Neck Chest Waist Shoulder Elbow Wrists Hands Knees Ankles Foots Avg.
Source only 0.4 3.6 1.2 0.4 1.0 2.8 1.0 2.8 1.0 2.6 2.3 1.8
MCD 4.6 7.0 0.2 0.6 1.4 0.2 0.3 0.9 0.4 21.0 16.6 5.3
SimGAN 90.2 68.0 10.8 22.6 38.8 26.3 28.5 33.6 35.9 52.5 52.8 40.4
CyCADA 90.0 69.0 15.4 28.2 39.5 27.3 31.3 32.5 35.4 54.4 53.2 41.0
Ours  
 CycleGAN+ 82.8 79.0 33.8 17.0 40.0 16.4 15.8 28.4 13.8 51.0 51.5 35.5
 D-CycleGAN 93.0 85.8 21.4 47.8 42.5 42.5 35.8 39.2 42.5 66.9 64.1 50.8
 D-CycleGAN+VAE 40.6 34.2 17.6 41.2 10.1 10.2 7.5 6.4 20.0 28.0 20.2 18.6
 PS-AEs 80.6 72.4 40.8 28.0 46.5 28.4 25.2 29.4 25.3 58.9 53.9 42.1
 PS-VAEs(full model) 89.4 84.6 21.4 43.4 51.7 54.4 49.4 43.9 45.6 74.5 74.0 57.0
Table 6: Accuracy in human-pose estimation by UDA (higher is better). Results were averaged for joints with left and right entries (e.g., the ”Shoulder” column lists the average scores for the left and right shoulders). The ”Avg.” column lists the average scores over all samples, rather than only the joints appearing in this table.

Partially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift


(a) Source Only

Partially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift


(b) SimGAN



Partially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift


(c) CyCADA

Partially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift


(d) PS-VAEs (Ours)
Figure 7: (best viewed in color) Feature distribution visualized by t-SNE [23]: source domain CG data (blue points) and target domain observed data (red points).

Partially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift
Figure 8: (best viewed in color) Qualitative comparison on the domain conversion. Detailed structure in body region is lost with SimGAN, but reproduced with our model.

Partially-Shared Variational Auto-encoders for 
Unsupervised Domain Adaptation with Target Shift
Figure 9: (best viewed in color) Qualitative results of human-pose estimation. Due to the lack of detailed depth structure as seen in Fig. 9, SimGAN and CyCADA often fail to estimate joints with self-occlusion.

Figure 6 shows the rate of samples whose estimated joint position error is less than thresholds (the horizontal axis shows the threshold in pixels). Table 6 lists the joint-wise results in terms of the rate with threshold of ten pixels.
The full model using the proposed methods achieved the best scores on average and for all the joints other than the head, neck, chest, and waist. These four joints have less target shift than others do (see Figure 5, for example).
SimGAN was originally designed for a similar task (gaze estimation and hand-pose estimation) and achieved relatively good scores. CyCADA is an extension of SimGAN and has additional losses for distribution matching, but it did not boost the accuracy in the tasks of UDA with target shift.
MCD was originally designed for classification tasks and did not work for this regression task, as expected.
Figure 7 shows the feature distributions obtained from four different methods.
Because SimGAN does not have any mechanisms to align features in the feature space, the distributions did not merge well.
CyCADA better mix the distributions, but still the components are separated.
In contrast, the proposed method merged features quite well despite no discriminators or discrepancy minimization was performed.
This indicates that the proposed pair-wise feature alignment by worked well with this UDA task.

A qualitative difference in domain conversion is shown in Figure 9.
SimGAN’s self-regularization loss worked to keep the silhouette of generated samples, but subtle depth differences in the body regions were not reproduced well.
In contrast, the proposed method seems to be able to reproduce such subtle depth differences with the silhouette. This difference contributed to the prediction quality difference shown in Figure 9.

In the ablation study, we compared our full model with the following four different variations (see Table 6).

CycleGAN+

does not divide and into the two components, but applied to and directly.

D-CycleGAN

stands for disentangled CycleGAN, which divides and into the two components, but parameters of encoders and decoders are not shared and not using VAE at the calculation of .

D-CycleGAN+VAE

is a D-CycleGAN with and the resampling trick of VAE at the calculation of .

PS-AEs

stands for Partially-shared Auto-Encoders, whose encoders and decoders partially shares parameters as described in 3.4, but not using VAE.

PS-VAEs

stands for Partially-shared Variational Auto-Encoders and this is the full model of the proposed method.

D-CycleGAN had actually performed the second best result and D-CycleGAN+VAE and PS-AEs did not work well.
First, as UNIT[19] does,3 it seems to be difficult to use VAE with CycleGAN without sharing weights between the encoder-decoder models.
After combining all these modifications, the full model of the proposed method outperformed any other methods with a large margin.

5 Conclusion

In this paper, we proposed a novel approach for unsupervised domain adaptation with target shift.
Our approach generates pseudo-feature pairs (with identical labels) to obtain an encoder that aligns target domain samples into the same locations as source domain samples according to their label similarities. Target shift is a common setting in UDA tasks because target domain datasets are often not well organized as the source domain datasets in practice. To be robust against target shift, the method avoids using feature distribution matching but obtains a common feature space by pair-wise feature alignment. To prevent mis-alignment caused at adversarial training in image-space, a CycleGAN-based model was modified to divide features in domain-invariant and domain-specific components, to share weights between two encoder-decoder parts, and to further disentangle features by the mechanism of a variational auto-encoder.
We evaluated the model with digit classification tasks and achieved the best performance under the most of imbalanced situations.
We also applied the method to a regression task of human-pose estimation and found that it outperformed the previous methods significantly.

Footnotes

  1. The authors of [29] provide no implementation and there are currently no other authorized implementations. Two SBADAGAN[30] implementations were available but it was difficult to customize them for this test and the reported accuracy was not reproducible.
  2. The details of this background subtraction appears in the supplementary material.
  3. Another neural network model that combines CycleGAN and VAE as the proposed model, but for image-to-image translation.

References

  1. Z. Cao, M. Long, J. Wang and M. I. Jordan

    Partial transfer learning with selective adversarial networks.

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

    Cited by: §1,
    §2.

  2. Z. Cao, L. Ma, M. Long and J. Wang (2018)

    Partial adversarial domain adaptation.

    In The European Conference on Computer Vision (ECCV),

    Cited by: §1,
    §2,
    §2,
    item PADA[2].

  3. Z. Cao, K. You, M. Long, J. Wang and Q. Yang (2019)

    Learning to transfer examples for partial domain adaptation.

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

    Cited by: §1,
    §2.

  4. CMU Graphics Lab.

    CMU graphics lab motion capture database.

    Note: /urlhttp://mocap.cs.cmu.edu/(accessed on 11th-Nov-2019)

    Cited by: §4.2.

  5. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding.

    In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR),

    pp. 3213–3223.

    Cited by: §1.

  6. Z. Deng, Y. Luo and J. Zhu (2019)

    Cluster alignment with a teacher for unsupervised domain adaptation.

    In The IEEE International Conference on Computer Vision (ICCV),

    Cited by: §1,
    §2,
    §2.

  7. C. Doersch (2016)

    Tutorial on variational autoencoders.

    Cited by: §1,
    §2.

  8. M. Ghifary, W. B. Kleijn and M. Zhang (2014)

    Domain adaptive neural networks for object recognition.

    In Pacific Rim international conference on artificial intelligence,

    pp. 898–904.

    Cited by: §1,
    §2.

  9. R. Girshick

    Fast R-CNN.

    In The IEEE International Conference on Computer Vision (ICCV),

    Cited by: §3.4.

  10. M. Gong, K. Zhang, T. Liu, D. Tao, C. Glymour and B. Schölkopf (2016)

    Domain adaptation with conditional transferable components.

    In International Conference on Machine Learning (ICML),

    pp. 2839–2848.

    Cited by: §1.

  11. J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. Efros and T. Darrell (2018)

    CyCADA: cycle-consistent adversarial domain adaptation.

    Cited by: §1,
    §1,
    Table 1,
    §2,
    item CyCADA[11].

  12. J. J. Hull (1994)

    A database for handwritten text recognition research.

    IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 16 (5), pp. 550–554.

    Cited by: §4.1.

  13. H. Joo, H. Liu, L. Tan, L. Gui, B. Nabbe, I. Matthews, T. Kanade, S. Nobuhara and Y. Sheikh (2015)

    Panoptic studio: a massively multiview system for social motion capture.

    In The IEEE International Conference on Computer Vision (ICCV),

    Cited by: §4.2.

  14. F. Kuhnke and J. Ostermann

    Deep head pose estimation using synthetic images and partial adversarial domain adaption for continuous label spaces.

    In The IEEE International Conference on Computer Vision (ICCV),

    Cited by: §1,
    §2.

  15. I. Laradji and R. Babanezhad (2018)

    M-adda: unsupervised domain adaptation with deep metric learning.

    Cited by: §2.

  16. Y. LeCun and C. Cortes (2010)

    MNIST handwritten digit database.

    External Links: Link

    Cited by: §4.1.

  17. H. Lee, H. Tseng, J. Huang, M. Singh and M. Yang

    Diverse image-to-image translation via disentangled representations.

    In The European Conference on Computer Vision (ECCV),

    Cited by: §3.3.

  18. A. H. Liu, Y. Liu, Y. Yeh and Y. F. Wang (2018)

    A unified feature disentangler for multi-domain image translation and manipulation.

    In Advances in Neural Information Processing Systems 31,

    pp. 2590–2599.

    Cited by: §1,
    §1,
    Table 1,
    §2,
    §3.4,
    item ADDA[36] and UFDN[18].

  19. M. Liu, T. Breuel and J. Kautz (2017)

    Unsupervised image-to-image translation networks.

    In Advances in Neural Information Processing Systems 30 (NIPS, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (Eds.),

    pp. 700–708.

    External Links: Link

    Cited by: §4.2.

  20. Y. Liu, Z. Wang, H. Jin and I. Wassell (2018)

    Multi-task adversarial network for disentangled feature learning.

    In In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

    pp. 3743–3751.

    Cited by: §1,
    §3.4.

  21. M. Long, Y. Cao, J. Wang and M. I. Jordan (2015)

    Learning transferable features with deep adaptation networks.

    In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML),

    pp. 97–105.

    Cited by: §1,
    §2.

  22. Y. Luo, L. Zheng, T. Guan, J. Yu and Y. Yang

    Taking a closer look at domain shift: category-level adversaries for semantics consistent domain adaptation.

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

    Cited by: §2.

  23. L. v. d. Maaten and G. Hinton (2008)

    Visualizing data using t-SNE.

    Journal of Machine Learning Research 9 (Nov), pp. 2579–2605.

    Cited by: Figure 7.

  24. X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang and S. Paul Smolley (2017)

    Least squares generative adversarial networks.

    In Proceedings of the IEEE International Conference on Computer Vision (ICCV),

    pp. 2794–2802.

    Cited by: §3.3.

  25. T. Miyato, T. Kataoka, M. Koyama and Y. Yoshida (2018)

    Spectral normalization for generative adversarial networks.

    In International Conference on Learning Representations,

    Cited by: §3.3.

  26. J. G. Moreno-Torres, T. Raeder, R. Alaiz-RodríGuez, N. V. Chawla and F. Herrera (2012)

    A unifying view on dataset shift in classification.

    Pattern Recognition 45 (1), pp. 521–530.

    Cited by: §1.

  27. Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu and A. Y. Ng (2011)

    Reading digits in natural images with unsupervised feature learning.

    Cited by: §4.1.

  28. S. R. Richter, V. Vineet, S. Roth and V. Koltun (2016)

    Playing for data: ground truth from computer games.

    In European conference on computer vision (ECCV),

    pp. 102–118.

    Cited by: §1.

  29. S. Roy, A. Siarohin, E. Sangineto, S. R. Bulo, N. Sebe and E. Ricci (2019)

    Unsupervised domain adaptation using feature-whitening and consensus loss.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

    pp. 9471–9480.

    Cited by: §2,
    footnote 1.

  30. P. Russo, F. M. Carlucci, T. Tommasi and B. Caputo (2018)

    From source to target and back: symmetric bi-directional adaptive gan.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

    pp. 8099–8108.

    Cited by: §1,
    §2,
    footnote 1.

  31. K. Saito, Y. Ushiku, T. Harada and K. Saenko (2018)

    Adversarial dropout regularization.

    In The International Conference on Learning Representations (ICLR),

    Cited by: §1,
    §2.

  32. K. Saito, K. Watanabe, Y. Ushiku and T. Harada (2018)

    Maximum classifier discrepancy for unsupervised domain adaptation.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

    pp. 3723–3732.

    Cited by: §1,
    §1,
    Table 1,
    §2,
    item MCD[32].

  33. A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang and R. Webb (2017)

    Learning from simulated and unsupervised images through adversarial training.

    In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR),

    pp. 2107–2116.

    Cited by: Table 1,
    §2,
    item SimGAN[33].

  34. R. Shu, H. Bui, H. Narui and S. Ermon (2018)

    A DIRT-T approach to unsupervised domain adaptation.

    In International Conference on Learning Representations (ICLR),

    External Links: Link

    Cited by: §2.

  35. P. Software

    Poser pro 2014.

    Note: /urlhttps://www.renderosity.com/mod/bcs/poser-pro-2014/102000(accessed on 10th-Nov-2019)

    Cited by: §4.2.

  36. E. Tzeng, J. Hoffman, K. Saenko and T. Darrell (2017)

    Adversarial discriminative domain adaptation.

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

    Cited by: §1,
    Table 1,
    §2,
    item ADDA[36] and UFDN[18].

  37. H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu and W. Zuo

    Mind the class weight bias: weighted maximum mean discrepancy for unsupervised domain adaptation.

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

    Cited by: §1,
    Table 1,
    §2,
    §2.

  38. J. Zhang, Z. Ding, W. Li and P. Ogunbona

    Importance weighted adversarial nets for partial domain adaptation.

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),

    Cited by: §1,
    §2.

  39. K. Zhang, B. Schölkopf, K. Muandet and Z. Wang (2013)

    Domain adaptation under target and conditional shift.

    In International Conference on Machine Learning (ICML),

    pp. 819–827.

    Cited by: §1.

  40. X. Zhang, Y. Wong, M. S. Kankanhalli and W. Geng (2019)

    Unsupervised domain adaptation for 3D human pose estimation.

    In Proceedings of the 27th ACM International Conference on Multimedia,

    Cited by: §2.

  41. J. Zhu, T. Park, P. Isola and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks.

    In The IEEE International Conference on Computer Vision (ICCV),

    Cited by: §1,
    §3.3.

https://www.groundai.com/project/partially-shared-variational-auto-encoders-for-unsupervised-domain-adaptation-with-target-shift3987/


CSIT FUN , 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:Partially-Shared Variational Auto-encoders for Unsupervised Domain Adaptation with Target Shift
喜欢 (0)
[985016145@qq.com]
分享 (0)
scott
关于作者:
发表我的评论
取消评论
表情 贴图 加粗 删除线 居中 斜体 签到

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址