文章目录
Image2StyleGAN++: How to Edit the Embedded Images?
Abstract
We propose Image2StyleGAN++, a flexible image editing framework with many applications.
Our framework extends the recent Image2StyleGAN [1] in three ways.
First, we introduce noise optimization as a complement to the latent space embedding. Our noise optimization can restore high frequency features in images and thus significantly improves the quality of reconstructed images, /ega big increase of PSNR from 20 dB to 45 dB.
Second, we extend the global latent space embedding to enable local embeddings.
Third, we combine embedding with activation tensor manipulation to perform high quality local edits along with global semantic edits on images.
Such edits motivate various high quality image editing applications, /egimage reconstruction, image inpainting, image crossover, local style transfer, image editing using scribbles, and attribute level feature transfer.
Examples of the edited images are shown across the paper for visual inspection.
(a) (b) (c) (d)
1 Introduction
Recent GANs [18, 6] demonstrated that synthetic images can be generated with very high quality. This motivates research into embedding algorithms that embed a given photograph into a GAN latent space. Such embedding algorithms can be used to analyze the limitations of GANs [5], do image inpainting [8, 39, 40, 36], local image editing [42, 16], global image transformations such as image morphing and expression transfer [1], and fewshot video generation [35, 34].
In this paper, we propose to extend a very recent embedding algorithm, Image2StyleGAN [1]. In particular, we would like to improve this previous algorithm in three aspects. First, we noticed that the embedding quality can be further improved by including Noise space optimization into the embedding framework. The key insight here is that stable Noise space optimization can only be conducted if the optimization is done sequentially with space and not jointly. Second, we would like to improve the capabilities of the embedding algorithm to increase the local control over the embedding. One way to improve local control is to include masks in the embedding algorithm with undefined content. The goal of the embedding algorithm should be to find a plausible embedding for everything outside the mask, while filling in reasonable semantic content in the masked pixels. Similarly, we would like to provide the option of approximate embeddings, where the specified pixel colors are only a guide for the embedding. In this way, we aim to achieve high quality embeddings that can be controlled by user scribbles. In the third technical part of the paper, we investigate the combination of embedding algorithm and direct manipulations of the activation maps (called activation tensors in our paper).
Our main contributions are:

We propose Noise space optimization to restore the high frequency features in an image that cannot be reproduced by other latent space optimization of GANs. The resulting images are very faithful reconstructions of up to 45 dB compared to about 20 dB (PSNR) for the previously best results.

We propose an extended embedding algorithm into the space of StyleGAN that allows for local modifications such as missing regions and locally approximate embeddings.

We investigate the combination of embedding and activation tensor manipulation to perform high quality local edits along with global semantic edits on images.

We apply our novel framework to multiple image editing and manipulation applications. The results show that the method can be successfully used to develop a stateoftheart image editing software.
2 Related Work
Generative Adversarial Networks (GANs) [13, 29] are one of the most popular generative models that have been successfully applied to many computer vision applications, /egobject detection [22],
texture synthesis [21, 37, 31],
imagetoimage translation [15, 43, 28, 24] and video generation [33, 32, 35, 34].
Backing these applications are the massive improvements on GANs in terms of architecture [18, 6, 28, 15], loss function design [25, 2], and regularization [27, 14].
On the bright side, such improvements significantly boost the quality of the synthesized images.
To date, the two highest quality GANs are StyleGAN [18] and BigGAN [6].
Between them, StyleGAN produces excellent results for unconditional image synthesis tasks, especially on face images; BigGAN produces the best results for conditional image synthesis tasks (/egImageNet [9]).
While on the dark side, these improvements make the training of GANs more and more expensive that nowadays it is almost a privilege of wealthy institutions to compete for the best performance.
As a result, methods built on pretrained generators start to attract attention very recently.
In the following, we would like to discuss previous work of two such approaches: embedding images into a GAN latent space and the manipulation of GAN activation tensors.
Latent Space Embedding.
The embedding of an image into the latent space is a longstanding topic in both machine learning and computer vision.
In general, the embedding can be implemented in two ways: i) passing the input image through an encoder neural network (/egthe Variational AutoEncoder [19]); ii) optimizing a random initial latent code to match the input image [41, 7].
Between them, the first approach dominated for a long time.
Although it has an inherent problem to generalize beyond the training dataset, it produces higher quality results than the naive latent code optimization methods [41, 7].
While recently, Abdal /etal[1] obtained excellent embedding results by optimizing the latent codes in an enhanced latent space instead of the initial latent space.
Their method suggests a new direction for various image editing applications and makes the second approach interesting again.
Activation Tensor Manipulation.
With fixed neural network weights, the expression power of a generator can be fully utilized by manipulating its activation tensors.
Based on this observation, Bau [4] /etalinvestigated what a GAN can and cannot generate by locating and manipulating relevant neurons in the activation tensors [4, 5].
Built on the understanding of how an object is “drawn” by the generator, they further designed a semantic image editing system that can add, remove or change the appearance of an object in an input image [3].
Concurrently, Frühstück /etal[11] investigated the potential of activation tensor manipulation in image blending. Observing that boundary artifacts can be eliminated by by cropping and combining activation tensors at early layers of a generator, they proposed an algorithm to create largescale texture maps of hundreds of megapixels by combining outputs of GANs trained on a lower resolution.
3 Overview
Our paper is structured as follows. First, we describe an extended version of the Image2StyleGAN [1] embedding algorithm (See Sec. 4). We propose two novel modifications:
1) to enable local edits, we integrate various spatial masks into the optimization framework. Spatial masks enable embeddings of incomplete images with missing values and embeddings of images with approximate color values such as user scribbles. In addition to spatial masks, we explore layer masks that restrict the embedding into a set of selected layers. The early layers of StyleGAN [18] encode content and the later layers control the style of the image. By restricting embeddings into a subset of layers we can better control what attributes of a given image are extracted.
2) to further improve the embedding quality, we optimize for an additional group of variables that control additive noise maps. These noise maps encode high frequency details and enable embedding with very high reconstruction quality.
Second, we explore multiple operations to directly manipulate activation tensors (See Sec. 5). We mainly explore spatial copying, channelwise copying, and averaging,
Interesting applications can be built by combining multiple embedding steps and direct manipulation steps. As a stepping stone towards building interesting application, we describe in Sec. 6 common building blocks that consist of specific settings of the extended optimization algorithm.
4 An Extended Embedding Algorithm
We implement our embedding algorithm as a gradientbased optimization that iteratively updates an image starting from some initial latent code.
The embedding is performed into two spaces using two groups of variables; the semantically meaningful space and a Noise space encoding high frequency details. The corresponding groups of variables we optimize for are and .
The inputs to the embedding algorithm are target RGB images and (they can also be the same image), and up to three spatial masks (, , and )
Algorithm 1 is the generic embedding algorithm used in the paper.
4.1 Objective Function
Our objective function consists of three different types of loss terms, /iethe pixelwise MSE loss, the perceptual loss [17, 10], and the style loss [12].
(1) 
Where , , denote the spatial masks, denotes the Hadamard product, is the StyleGAN generator, are the Noise space variables, are the space variables, denotes style loss from layer of an ImageNet pretrained VGG16 network [30], is the perceptual loss defined in Image2StyleGAN [1]. Here, we use layers , , and of VGG16 for the perceptual loss. Note that the perceptual loss is computed for four layers of the VGG network. Therefore, needs to be downsampled to match the resolutions of the corresponding VGG16 layers in the computation of the loss function.
4.2 Optimization Strategies
Optimization of the variables and is not a trivial task. Since only encodes semantically meaningful information, we need to ensure that as much information as possible is encoded in and only high frequency details in the Noise space.
The first possible approach is the joint optimization of both groups of variables and . Fig.2 (b) shows the result using the perceptual and the pixelwise MSE loss. We can observe that many details are lost and were replaced with high frequency image artifacts. This is due to the fact that the perceptual loss is incompatible with optimizing noise maps. Therefore, a second approach is to use pixelwise MSE loss only (see Fig. 2 (c)). Although the reconstruction is almost perfect, the representation is not suitable for image editing tasks. In Fig. 2 (d), we show that too much of the image information is stored in the noise layer, by resampling the noise variables . We would expect to obtain another very good, but slightly noisy embedding. Instead, we obtain a very low quality embedding. Also, we show the result of jointly optimizing the variables and using perceptual and pixelwise MSE loss for variables and pixelwise MSE loss for the noise variable. Fig. 2 (e) shows the reconstructed image is not of high perceptual quality. The PSNR score decreases to 33.3 dB. We also tested these optimizations on other images. Based on our results, we do not recommend using joint optimization.
The second strategy is an alternating optimization of the variables and . In Fig. 3, we show the result of optimizing while keeping fixed and subsequently optimizing while keeping fixed. In this way, most of the information is encoded in which leads to a semantically meaningful embedding.
Performing another iteration of optimizing (Fig. 3 (d)) reveals a smoothing effect on the image and the PSNR reduces from 39.5 dB to 20 dB. Subsequent Noise space optimization does not improve PSNR of the images. Hence, repetitive alternating optimization does not improve the quality of the image further. In summary, we recommend to use alternating optimization, but each set of variables is only optimized once. First we optimize , then .
5 Activation Tensor Manipulations
Due to the progressive architecture of StyleGAN, one can perform meaningful tensor operations at different layers of the network [11, 4]. We consider the following editing operations: spatial copying, averaging, and channelwise copying. We define activation tensor as the output of the th layer in the network initialized with variables of the embedded image . They are stored as tensors . Given two such tensors and , copying replaces highdimensional pixels in by copying from . Averaging forms a linear combination . Channelwise copying creates a new tensor by copying selected channels from and the remaining channels from .
In our tests we found that spatial copying works a bit better than averaging and channelwise copying.
6 Frequently Used Building Blocks
We identify four fundamental building blocks that are used in multiple applications described in Sec. 7. While terms of the loss function can be controlled by spatial masks (), we also use binary masks and to indicate what subset of variables should be optimized during an optimization process. For example, we might set to only update the variables corresponding to the first layers. In general, and contain s for variables that should be updated and s for variables that should remain constant. In addition to the listed parameters, all building blocks need initial variable values and . For all experiments, we use a 32GB Nvidia V100 GPU.
Masked optimization ():
This function optimizes , leaving constant. We use the following parameters in the loss function (L) Eq. 1: , , , .
We denote the function as:
(2) 
where is a mask for space. We either use Adam [20] with learning rate 0.01 or gradient descent with learning rate 0.8, depending on the application. Some common settings for Adam are: , , and . In Sec. 7, we use Adam unless specified.
Masked Noise Optimization ():
This function optimizes , leaving constant. The Noise space has dimensions . In total there are 18 noise maps, two for each resolution.
We set following parameters in the loss function (L) Eq. 1: , , , .
We denote the function as:
(3) 
For this optimization, we use Adam with learning rate 5, , , and . Note that the learning rate is very high.
Masked Style Transfer():
This function optimizes to achieve a given target style defined by style image . We set following parameters in the loss function (L) Eq. 1:
, , , .
We denote the function as:
(4) 
where is the whole space. For this optimization, we use Adam with learning rate 0.01, , , and .
Masked activation tensor operation ():
This function describes an activation tensor operation. Here, we represent the generator as a function of space variable , Noise space variable , and input tensor . The operation is represented by:
(5) 
where and are the activations corresponding to images and at layer , and and are the masks downsampled using nearest neighbour interpolation to match the resolution of the activation tensors.
7 Applications
In the following we describe various applications enabled by our framework.
7.1 Improved Image Reconstruction
As shown in Fig. 4, any image can be embedded by optimizing for variables and . Here we describe the details of this embedding (See Alg. 2). First, we initialize:
is a mean face latent code [18] or random code sampled from depending on whether the embedding image is a face or a nonface, and is sampled from a standard normal distribution [18].
Second, we apply masked optimization () without using spatial masks or masking variables. That means all masks are set to . is the target image we try to reconstruct. Third, we perform masked noise optimization (), again without making use of masks.
The images reconstructed are of high fidelity. The PNSR score range of 39 to 45 dB provides an insight of how expressive the Noise space in StyleGAN is. Unlike the space, the Noise space is used for spatial reconstruction of high frequency features. We use 5000 iterations of and 3000 iterations of to get PSNR scores of 44 to 45 dB. Additional iterations did not improve the results in our tests.


7.2 Image Crossover
We define the image crossover operation as copying parts from a source image into a target image and blending the boundaries.
As initialization, we embed the target image to obtain the code . We then perform masked optimization () with blurred masks to embed the regions in and that contribute to the final image. Blurred masks are obtained by convolution of the binary mask with a Gaussian filter of suitable size. Then, we perform noise optimization. Details are provided in Alg. 3.
7.3 Image Inpainting
In order to perform a semantically meaningful inpainting, we embed into the early layers of the space to predict the missing content and in the later layers to maintain color consistency. We define the image as a defective image (). Also, we use the mask where the value is 1 corresponding to the first 9 (1 to 9), and layer of . As an initialization, we set to the mean face latent code [18]. We consider as the mask describing the defective region. Using these parameters, we perform the masked optimization . Then we perform the masked noise optimization using which is the slightly larger blurred mask used for blending. Here is taken to be . Other notations are the same as described in Sec 7.1. Alg. 4 shows the details of the algorithm. We perform 200 steps of gradient descent optimizer for masked optimization and 1000 iterations of masked noise optimization . Fig.6 shows example inpainting results. The results are comparable with the current state of the art, partial convolution [23]. The partial convolution method frequently suffers from regular artifacts (see Fig.6 (third column)). These artifacts are not present in our method. In Fig.7 we show different inpainting solutions for the same image achieved by using different initializations of , which is an offset to mean face latent code sampled independently from a uniform distribution . The initialization mainly affects layers 10 to 16 that are not altered during optimization. Multiple inpainting solutions cannot be computed with existing stateoftheart methods.
7.4 Local Edits using Scribbles
Another application is performing semantic local edits guided by user scribbles. We show that simple scribbles can be converted to photorealistic edits by embedding into the first 4 to 6 layers of (See Fig.8). This enables us to do local edits without training a network. We define an image as a scribble image (). Here, we also use the mask where the value is 1 corresponding to the first 4,5 or 6 layers of the space. As initialization, we set the to which is the code of the image without scribble. We perform masked optimization using these parameters. Then we perform masked noise optimization using . Other notations are the same as described in Sec 7.1. Alg. 5 shows the details of the algorithm. We perform 1000 iterations using Adam with a learning rate of 0.1 of masked optimization and then 1000 steps of masked noise optimization to output the final image.
7.5 Local Style Transfer
Local style transfer modifies a region in the input image to transform it to the style defined by a style reference image. First, we embed the image in space to obtain the code . Then we apply the masked optimization along with masked style transfer using blurred mask . Finally, we perform the masked noise optimization to output the final image. Alg. 6 shows the details of the algorithm. Results for the application are shown in Fig.9. We perform 1000 steps to obtain of along with and then perform 1000 iterations of .
7.6 Attribute level feature transfer
We extend our work to another application using tensor operations on the images embedded in space. In this application we perform the tensor manipulation corresponding to the tensors at the output of the layer of StyleGAN. We feed the generator with the latent codes (, ) of two images and and store the output of the fourth layer as intermediate activation tensors and .
A mask specifies which values to copy from and which to copy from . The operation can be denoted by . In Fig.10, we show results of the operation. A design parameter of this application is what style code to use for the remaining layers. In the shown example, the first image is chosen to provide the style. Notice, in column 2 of Fig.10, inspite of the different alignment of the two faces and objects, the images are blended well. We also show results of blending for the LSUNcar and LSUNbedroom datasets. Hence, unlike global edits like image morphing, style transfer, and expression transfer [1], here different parts of the image can be edited independently and the edits are localized. Moreover, along with other edits, we show a video in the supplementary material that further shows that other semantic edits e.g. masked image morphing can be performed on such images by linear interpolation of code of one image at a time.
8 Conclusion
We proposed Image2StyleGAN++, a powerful image editing framework built on the recent Image2StyleGAN.
Our framework is motivated by three key insights: first, high frequency image features are captured by the additive noise maps used in StyleGAN, which helps to improve the quality of reconstructed images;
second, local edits are enabled by including masks in the embedding algorithm, which greatly increases the capability of the proposed framework;
third, a variety of applications can be created by combining embedding with activation tensor manipulation.
From the high quality results presented in this paper, it can be concluded that our Image2StyleGAN++ is a promising framework for general image editing.
For future work, in addition to static images, we aim to extend our framework to process and edit videos.
Acknowledgement This work was supported by the KAUST Office of Sponsored Research (OSR) under Award No. OSRCRG20183730.
References

[1]
(2019)
Image2StyleGAN: how to embed images into the stylegan latent space?.
In Proceedings of the IEEE International Conference on Computer Vision,
pp. 4432–4441.
Cited by: Image2StyleGAN++: How to Edit the Embedded Images?,
§1,
§1,
§2,
§3,
§4.1,
§7.6.

[2]
(2017)
Wasserstein generative adversarial networks.
In Proceedings of the 34th International Conference on Machine Learning,
Vol. 70, pp. 214–223.
Cited by: §2.

[3]
(2019)
Semantic photo manipulation with a generative image prior.
ACM Transactions on Graphics (Proceedings of ACM SIGGRAPH) 38 (4).
Cited by: §2.

[4]
(2019)
GAN dissection: visualizing and understanding generative adversarial networks.
In Proceedings of the International Conference on Learning Representations (ICLR),
Cited by: §2,
§5.

[5]
(2019)
Seeing what a gan cannot generate.
In Proceedings of the International Conference Computer Vision (ICCV),
Cited by: §1,
§2.

[6]
(2019)
Large scale GAN training for high fidelity natural image synthesis.
In International Conference on Learning Representations,
Cited by: §1,
§2.

[7]
(2018)
Inverting the generator of a generative adversarial network.
IEEE Transactions on Neural Networks and Learning Systems.
Cited by: §2.

[8]
(2018)
Patchbased image inpainting with generative adversarial networks.
arXiv preprint arXiv:1803.07422.
Cited by: §1.

[9]
(2009)
ImageNet: A LargeScale Hierarchical Image Database.
In CVPR09,
Cited by: §2.

[10]
(2016)
Generating images with perceptual similarity metrics based on deep networks.
In Advances in neural information processing systems,
pp. 658–666.
Cited by: §4.1.

[11]
(201907)
TileGAN.
ACM Transactions on Graphics 38 (4), pp. 1–11.
External Links: ISSN 07300301,
Link,
Document
Cited by: §2,
§5.

[12]
(2016)
Image style transfer using convolutional neural networks.
In Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 2414–2423.
Cited by: §4.1.

[13]
(2014)
Generative adversarial nets.
In Advances in neural information processing systems,
Cited by: §2.

[14]
(2017)
Improved training of wasserstein gans.
In Advances in neural information processing systems,
pp. 5767–5777.
Cited by: §2.

[15]
(2017)
Imagetoimage translation with conditional adversarial networks.
CVPR.
Cited by: §2.

[16]
(201910)
SCfegan: face editing generative adversarial network with user’s sketch and color.
In The IEEE International Conference on Computer Vision (ICCV),
Cited by: §1.

[17]
(2016)
Perceptual losses for realtime style transfer and superresolution.
In European conference on computer vision,
Cited by: §4.1.

[18]
(2018)
A stylebased generator architecture for generative adversarial networks.
arXiv preprint arXiv:1812.04948.
Cited by: §1,
§2,
§3,
§7.1,
§7.3,
§9.4.

[19]
(2013)
Autoencoding variational bayes.
arXiv preprint arXiv:1312.6114.
Cited by: §2.

[20]
(2014)
Adam: a method for stochastic optimization.
External Links: 1412.6980
Cited by: §6.

[21]
(2016)
Precomputed realtime texture synthesis with markovian generative adversarial networks.
In Computer Vision – ECCV 2016 – 14th European Conference, Amsterdam,
The Netherlands, October 1114, 2016, Proceedings, Part III,
Cited by: §2.

[22]
(201707)
Perceptual generative adversarial networks for small object detection.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Cited by: §2.

[23]
(2018)
Image inpainting for irregular holes using partial convolutions.
Lecture Notes in Computer Science, pp. 89–105.
External Links: ISBN 9783030012526,
ISSN 16113349,
Link,
Document
Cited by: Figure 6,
§7.3,
Figure 13,
§9.1,
§9.1,
Table 1.

[24]
(2019)
Fewshot unsueprvised imagetoimage translation.
In arxiv,
Cited by: §2.

[25]
(201710)
Least squares generative adversarial networks.
2017 IEEE International Conference on Computer Vision (ICCV).
External Links: ISBN 9781538610329,
Link,
Document
Cited by: §2.

[26]
Microsoft azure face.
Microsoft.
Note: https://azure.microsoft.com/enus/services/cognitiveservices/face/
Cited by: §9.3.

[27]
(2018)
Spectral normalization for generative adversarial networks.
In International Conference on Learning Representations,
Cited by: §2.

[28]
(2019)
Semantic image synthesis with spatiallyadaptive normalization.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Cited by: §2.

[29]
(2015)
Unsupervised representation learning with deep convolutional generative adversarial networks.
arXiv preprint arXiv:1511.06434.
Cited by: §2.

[30]
(2014)
Very deep convolutional networks for largescale image recognition.
External Links: 1409.1556
Cited by: §4.1.

[31]
(2018)
High quality facial surface and texture synthesis via generative adversarial networks.
In European Conference on Computer Vision,
pp. 498–513.
Cited by: §2.

[32]
(201806)
MoCoGAN: decomposing motion and content for video generation.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Cited by: §2.

[33]
(2016)
Generating videos with scene dynamics.
In Advances in Neural Information Processing Systems 29,
Cited by: §2.

[34]
(2019)
Fewshot videotovideo synthesis.
arXiv preprint arXiv:1910.12713.
Cited by: §1,
§2.

[35]
(2018)
Videotovideo synthesis.
In Advances in Neural Information Processing Systems (NeurIPS),
Cited by: §1,
§2.

[36]
(2019)
Detecting overfitting of deep generative networks via latent recovery.
External Links: 1901.03396
Cited by: §1.

[37]
(201806)
TextureGAN: controlling deep image synthesis with texture patches.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Cited by: §2.

[38]
(2018)
Freeform image inpainting with gated convolution.
arXiv preprint arXiv:1806.03589.
Cited by: Figure 14,
§9.1,
§9.1,
Table 1.

[39]
(2018)
Generative image inpainting with contextual attention.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 5505–5514.
Cited by: §1.

[40]
(2018)
Freeform image inpainting with gated convolution.
External Links: 1806.03589
Cited by: §1.

[41]
(2016)
Generative visual manipulation on the natural image manifold.
Lecture Notes in Computer Science, pp. 597–613.
External Links: ISBN 9783319464541,
ISSN 16113349,
Link,
Document
Cited by: §2.

[42]
(2016)
Generative visual manipulation on the natural image manifold.
In Proceedings of European Conference on Computer Vision (ECCV),
Cited by: §1.

[43]
(2017)
Unpaired imagetoimage translation using cycleconsistent adversarial networkss.
In Computer Vision (ICCV), 2017 IEEE International Conference on,
Cited by: §2.
9 Additional Results
9.1 Image Inpainting
To evaluate the results quantitatively, we use three standard metrics, SSIM, MSE loss and PSNR score to compare our method with the stateoftheart Partial Convolution [23] and Gated Convolution [38] methods.
As different methods produce outputs at different resolutions, we bilinearly interpolate the output images to test the methods at three resolutions , and respectively. We use masks (Fig. 11) and ground truth images (Fig. 12) to create defective images (/ieimages with missing regions) for the evaluation. These masks and images are chosen to make the inpainting a challenging task: i) the masks are selected to contain very large missing regions, up to half of an image; ii) the ground truth images are selected to be of high variety that cover different genders, ages, races, /etc.
Table 1 shows the quantitative comparison results.
It can be observed that our method outperforms both Partial Convolution [23] and Gated Convolution [38] across all the metrics.
More importantly, the advantages of our method can be easily verified by visual inspection.
As Fig. 13 and Fig. 14 show, although previous methods (/egPartial convolution) perform well when the missing region is small, both of them struggle when the missing region covers a significant area (/eghalf) of the image. Specifically, Partial Convolution fails when the mask covers half of the input image (Fig. 13); due to the relatively small resolution () model, Gated Convolution can fill in the details of large missing regions, but of much lower quality compared to the proposed method (Fig. 14).
In addition, our method is flexible and can generate different inpainting results (Fig. 15), which cannot be fulfilled by any of the abovementioned methods. All our inpainting results are of high perceptual quality.
Limitations
Although better than the two stateoftheart methods, our inpainting results still leave room for improvement.
For example in Fig. 13, the lighting condition (first row), age (second row) and skin color (third and last row) are not learnt that well.
We propose to address them in the future work.