Deep Image Prior
Abstract
Deep convolutional networks have become a popular tool for image generation and restoration. Generally, their excellent performance is imputed to their ability to learn realistic image priors from a large number of example images. In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of lowlevel image statistics prior to any learning. In order to do so, we show that a randomlyinitialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, superresolution, and inpainting. Furthermore, the same prior can be used to invert deep neural representations to diagnose them, and to restore images based on flashno flash input pairs.
Apart from its diverse applications, our approach highlights the inductive bias captured by standard generator network architectures. It also bridges the gap between two very popular families of image restoration methods: learningbased methods using deep convolutional networks and learningfree methods based on handcrafted image priors such as selfsimilarity. ^{†}^{†}Code and supplementary material are available at https://dmitryulyanov.github.io/deep_image_prior
1 Introduction
Deep convolutional neural networks (ConvNets) currently set the stateoftheart in inverse image reconstruction problems such as denoising [4, 19] or singleimage superresolution [18, 28, 17]. ConvNets have also been used with great success in more “exotic” problems such as reconstructing an image from its activations within certain deep networks or from its HOG descriptor [7]. More generally, ConvNets with similar architectures are nowadays used to generate images using such approaches as generative adversarial networks [10], variational autoencoders [15], and direct pixelwise error minimization [8].
Stateoftheart ConvNets for image restoration and generation are almost invariably trained on large datasets of images. One may thus assume that their excellent performance is due to their ability to learn realistic image priors from data. However, learning alone is insufficient to explain the good performance of deep networks. For instance, the authors of [32] recently showed that the same image classification network that generalizes well when trained on genuine data can also overfit when presented with random labels. Thus, generalization requires the structure of the network to “resonate” with the structure of the data. However, the nature of this interaction remains unclear, particularly in the context of image generation.
In this work, we show that, contrary to expectations, a great deal of image statistics are captured by the structure of a convolutional image generator rather than by any learned capability. This is particularly true for the statistics required to solve various image restoration problems, where the image prior is required to integrate information lost in the degradation processes.
To show this, we apply untrained ConvNets to the solution of several such problems. Instead of following the common paradigm of training a ConvNet on a large dataset of example images, we fit a generator network to a single degraded image. In this scheme, the network weights serve as a parametrization of the restored image. The weights are randomly initialized and fitted to maximize their likelihood given a specific degraded image and a taskdependent observation model.
We show that this very simple formulation is very competitive for standard image processing problems such as denoising, inpainting and superresolution. This is particularly remarkable because no aspect of the network is learned from data; instead, the weights of the network are always randomly initialized, so that the only prior information is in the structure of the network itself. To the best of our knowledge, this is the first study that directly investigates the prior captured by deep convolutional generative networks independently of learning the network parameters from images.
In addition to standard image restoration tasks, we show an application of our technique to understanding the information contained within the activations of deep neural networks. For this, we consider the “natural preimage” technique of [20], whose goal is to characterize the invariants learned by a deep network by inverting it on the set of natural images. We show that an untrained deep convolutional generator can be used to replace the surrogate natural prior used in [20] (the TV norm) with dramatically improved results. Since the new regularizer, like the TV norm, is not learned from data but is entirely handcrafted, the resulting visualizations avoid potential biases arising form the use of powerful learned regularizers [7].
2 Method
Deep networks are applied to image generation by learning generator/decoder networks that map a random code vector to an image . This approach can be used to sample realistic images from a random distribution [10]. Furthermore, the distribution can be conditioned on a corrupted observation to solve inverse problems such as denoising [4] and superresolution [6].
In this paper, we investigate the prior implicitly captured by the choice of a particular generator network structure, before any of its parameters are learned. We do so by interpreting the neural network as a parametrization of an image . Here is a code tensor/vector and are the network parameters. The network itself alternates filtering operations such as convolution, upsampling and nonlinear activation. In particular, most of our experiments are performed using am encoderdecorder “hourglass” architecture with as many as two million parameters (see Supplementary Material for details of all used architectures).
To demonstrate the power of this parametrization, we consider inverse tasks such as denoising, superresolution and inpainting. These can be expressed as energy minimization problems of the type
(1) 
where is a taskdependent data term, the noisy/lowresolution/occluded image, and a regularizer.
The choice of data term is dictated by the application and will be discussed later. The choice of regularizer, which usually captures a generic prior on natural images, is more difficult and is the subject of much research. As a simple example, may be the Total Variation (TV) of the image, which encourages solutions to contain uniform regions. In this work, we replace the regularizer with the implicit prior captured by the neural network, as follows:
(2) 
The minimizer is obtained using an optimizer such as gradient descent starting from a random initialization of the parameters. Given a (local) minimizer , the result of the restoration process is obtained as . Note that it is also possible to optimize over the code , but we usually initialize it randomly and keep it fixed.
In terms of (1), the prior defined by (2) is an indicator function for all images that can be produced from by a deep ConvNet of a certain architecture, and for all other signals. Since no aspect of the network is pretrained from data, such deep image prior is effectively handcrafted, just like the TV norm. We show that this handcrafted prior works very well for various image restoration tasks.
A parametrization with high noise impedance.
One may wonder why a highcapacity network can be used as a prior at all. In fact, one may expect to be able to find parameters recovering any possible image , including random noise, so that the network should not impose any restriction on the generated image. We now show that, while indeed almost any image can be fitted, the choice of network architecture has a major effect how the solution space is searched by methods such as gradient descent. In particular, we show that the network resists “bad” solutions and descends much more quickly towards naturallylooking images. The result is that minimizing (2) either results in a goodlooking local optimum, or, at least, the optimization trajectory passes near one.
In order to study this effect quantitatively, we consider the most basic reconstruction problem: given a target image , we want to find the value of the parameters that reproduce that image. This can be setup as the optimization of (2) using a data term comparing the generated image to :
(3) 
Plugging this in eq. 2 leads us to the optimization problem
(4) 
Figure 2 shows the value of the energy as a function of the gradient descent iterations for four different choices for the image : 1) a natural image, 2) the same image plus additive noise, 3) the same image after randomly permuting the pixels, and 4) white noise. It is apparent from the figure that optimization is much faster for cases 1) and 2), whereas the parametrization presents significant “inertia” for cases 3) and 4).
Thus, although in the limit the parametrization can fit unstructured noise, it does so very reluctantly. In other words, the parametrization offers high impedance to noise and low impedance to signal. Therefore for most applications, we restrict the number of iterations in the optimization process (2) to a certain number of iterations. The resulting prior then corresponds to projection onto a reduced set of images that can be produced from by ConvNets with parameters that are not too far from the random initialization .
3 Applications
We now show experimentally how the proposed prior works for diverse image reconstruction problems. Due to space limitations, our evaluation in the main text is restricted to few examples and numbers. The reader is therefore strongly encouraged to address the Supplementary material [29] for more extensive evaluation and for extra details.
Corrupted  100 iterations  600 iterations  2400 iterations  50K iterations 
Denoising and generic reconstruction.
As our parametrization presents high impedance to image noise, it can be naturally used to filter out noise from an image. The aim of denoising is to recover a clean image from a noisy observation . Sometimes the degradation model is known: where follows a particular distribution. However, more often in blind denoising the noise model is unknown.
Here we work under the blindness assumption, but the method can be easily modified to incorporate information about noise model. We use the same exact formulation as eqs. 4 and 3 and, given a noisy image , recover a clean image after substituting the minimizer of eq. 4.
Our approach does not require a model for the image degradation process that it needs to revert. This allows it to be applied in a “plugandplay” fashion to image restoration tasks, where the degradation process is complex and/or unknown and where obtaining realistic data for supervised training is highly problematic. We demonstrate this capability by several qualitative examples in fig. 4 and in the supplementary material, where our approach uses the quadratic energy (3) leading to formulation (4) to restore images degraded by complex and unknown compression artifacts. Figure 3 (top row) also demonstrates the applicability of the method beyond natural images (cartoon images in this case).
We evaluate our denoising approach on the standard dataset^{1}^{1}1http://www.cs.tut.fi/~foi/GCFBM3D/index.html#ref_results, consisting of 9 colored images with noise strength of . We achieve a PSNR of after 1800 optimization steps. The score is improved up to if we additionally average the restored images obtained in the last iterations (using exponential sliding window). If averaged over two optimization runs our method further improves up to PSNR. For the reference, the scores for the two popular approaches CMB3D [5] and Nonlocal means [3] that do not require pretraining are and respectively.
Superresolution.
The goal of superresolution is to take a low resolution (LR) image and upsampling factor , and generate a corresponding high resolution (HR) version . To solve this inverse problem, the data term in (2) is set to:
(5) 
where is a downsampling operator that decimates an image by a factor . Hence, the problem is to find the HR image that, when downsampled, is the same as the LR image .
Superresolution is an illposed problem because there are infinitely many HR images that reduce to the same LR image (i.e. the operator is far from surjective). Regularization is required in order to select, among the infinite minimizers of (5), the most plausible ones.
Following eq. 2, we regularize the problem by considering the reparametrization and optimizing the resulting energy w.r.t. . Optimization still uses gradient descent, exploiting the fact that both the neural network and the most common downsampling operators, such as Lanczos, are differentiable.
We evaluate superresolution ability of our approach using Set5 [2] and Set14 [31] datasets. We use a scaling factor of to compare to other works, while we show results with other scaling factors in [29]. We fix the number of optimization steps to be constant for every image.
Qualitative comparison with bicubic upsampling and stateofthe art learningbased methods SRResNet [18], LapSRN [28] is presented in fig. 5. Our method can be fairly compared to bicubic, as both methods never use other data than a given lowresolution image. Visually, we approach the quality of learningbased methods that use the MSE loss. GANbased [10] methods SRGAN [18] and EnhanceNet [27] (not shown in the comparison) intelligently hallucinate fine details of the image, which is impossible with our method that uses absolutely no information about the world of HR images.
We compute PSNRs using center crops of the generated images. Our method achieves and PSNR on Set5 and Set14 datasets respectively. Bicubic upsampling gets a lower score of and , while SRResNet has PSNR of and . While our method is still outperformed by learningbased approaches, it does considerably better than bicubic upsampling. Visually, it seems to close most of the gap between bicubic and stateoftheart trained ConvNets (c.f. fig. 1,fig. 5 and [29]).
Barbara  Boat  House  Lena  Peppers  C.man  Couple  Finger  Hill  Man  Montage  

Papyan et al.  28.14  31.44  34.58  35.04  29.91  27.90  31.18  31.34  32.35  31.92  28.05 
Ours  30.88  32.84  37.52  36.05  31.22  28.83  32.43  34.40  33.15  32.26  29.85 
Inpainting.
In image inpainting, one is given an image with missing pixels in correspondence of a binary mask ; the goal is to reconstruct the missing data. The corresponding data term is given by
(6) 
where is Hadamard’s product. The necessity of a data prior is obvious as this energy is independent of the values of the missing pixels, which would therefore never change after initialization if the objective was optimized directly over pixel values . As before, the prior is introduced by optimizing the data term w.r.t. the reparametrization (2).
In the first example (fig. 7, top row) inpainting is used to remove text overlaid on an image. Our approach is compared to the method of [26] specifically designed for inpainting. We observe almost perfectly transparent inpainting, while for [26] the text mask remains visible in some regions.
Next, fig. 7 (bottom) considers inpainting with masks randomly sampled according to a binary Bernoulli distribution. First, a mask is sampled to drop of pixels at random. We compare our approach to a method of [24] based on convolutional sparse coding. To obtain results for [24] we first decompose the corrupted image into low and high frequency components similarly to [11] and run their method on the high frequency part. For a fair comparison we use the version of their method, where a dictionary is built using the input image (shown to perform better in [24]). The quantitative comparison on the standard data set [13] for our method is given in table 1, showing a strong quantitative advantage of the proposed approach compared to convolutional sparse coding. In fig. 7 (bottom) we present a representative qualitative visual comparison with [24].
We also apply our method to inpainting of large holes. Being nontrainable, our method is not expected to work correctly for “highlysemantical” largehole inpainting (e.g. face inpainting). Yet, it works surprisingly well for other situations. We compare to a learningbased method of [14] in fig. 6. The deep image prior utilizes context of the image and interpolates the unknown region with textures from the known part. Such behaviour highlights the relation between the deep image prior and traditional selfsimilarity priors.
In fig. 8, we compare deep priors corresponding to several architectures. Our findings here (and in other similar comparisons) seem to suggest that having deeper architecture is beneficial, and that having skipconnections that work so well for recognition tasks (such as semantic segmentation) is highly detrimental.
Image  conv1  conv2  conv3  conv4  conv5  fc6  fc7  fc8 
Inversion with deep image prior  
Inversion with TV prior [20]  
Pretrained deep inverting network [7] 
Natural preimage.
The natural preimage method of [20] is a diagnostic tool to study the invariances of a lossy function, such as a deep network, that operates on natural images. Let be the first several layers of a neural network trained to perform, say, image classification. The preimage is the set of images that result in the same representation . Looking at this set reveals which information is lost by the network, and which invariances are gained.
Finding preimage points can be formulated as minimizing the data term
However, optimizing this function directly may find “artifacts”, i.e. nonnatural images for which the behavior of the network is in principle unspecified and that can thus drive it arbitrarily. More meaningful visualization can be obtained by restricting the preimage to a set of natural images, called a natural preimage in [20].
In practice, finding points in the natural preimage can be done by regularizing the data term similarly to the other inverse problems seen above. The authors of [20] prefer to use the TV norm, which is a weak natural image prior, but is relatively unbiased. On the contrary, papers such as [7] learn to invert a neural network from examples, resulting in better looking reconstructions, which however may be biased towards the learn datadriven inversion prior. Here, we propose to use the deep image prior (2) instead. As this is handcrafted like the TVnorm, it is not biased towards a particular training set. On the other hand, it results in inversions at least as interpretable as the ones of [7].
For evaluation, our method is compared to the ones of [21] and [7]. Figure 9 shows the results of inverting representations obtained by considering progressively deeper subsets of AlexNet [16]: conv1, conv2, …, conv5, fc6, fc7, and fc8. Preimages are found either by optimizing (2) using a structured prior.
As seen in fig. 9, our method results in dramatically improved image clarity compared to the simple TVnorm. The difference is particularly remarkable for deeper layers such as fc6 and fc7, where the TV norm still produces noisy images, whereas the structured regularizer produces images that are often still interpretable. Our approach also produces more informative inversions than a learned prior of [7], which have a clear tendency to regress to the mean.
Flashno flash reconstruction.
While in this work we focus on single image restoration, the proposed approach can be extended to the tasks of the restoration of multiple images, e.g. for the task of video restoration. We therefore conclude the set of application examples with a qualitative example demonstrating how the method can be applied to perform restoration based on pairs of images. In particular, we consider flashno flash image pairbased restoration [25], where the goal is to obtain an image of a scene with the lighting similar to a noflash image, while using the flash image as a guide to reduce the noise level.
In general, extending the method to more than one image is likely to involve some coordinated optimization over the input codes that for singleimage tasks in our approach was most often kept fixed and random. In the case of flashnoflash restoration, we found that good restorations were obtained by using the denoising formulation (4), while using flash image as an input (in place of the random vector ). The resulting approach can be seen as a nonlinear generalization of guided image filtering [12]. The results of the restoration are given in the fig. 10.
4 Related work
Our method is obviously related to image restoration and synthesis methods based on learnable ConvNets and referenced above. At the same time, it is as much related to an alternative group of restoration methods that avoid training on the holdout set. This group includes methods based on joint modeling of groups of similar patches inside corrupted image [3, 5, 9], which are particularly useful when the corruption process is complex and highly variable (e.g. spatiallyvarying blur [1]). Also in this group are methods based on fitting dictionaries to the patches of the corrupted image [22, 31] as well as methods based on convolutional sparse coding [30], which can also fit statistical models similar to shallow ConvNets to the reconstructed image [24]. The work [19] investigates the model that combines ConvNet with a selfsimilarity based denoising and thus also bridges the two groups of methods, but still requires training on a holdout set.
Overall, the prior imposed by deep ConvNets and investigated in this work seems to be highly related to selfsimilaritybased and dictionarybased priors. Indeed, as the weights of the convolutional filters are shared across the entire spatial extent of the image this ensures a degree of selfsimilarity of individual patches that a generative ConvNet can potentially produce. The connections between ConvNets and convolutional sparse coding run even deeper and are investigated in [23] in the context of recognition networks, and more recently in [24], where a singlelayer convolutional sparse coding is proposed for reconstruction tasks. The comparison of our approach with [24] (tables 1 and 7) however suggests that using deep ConvNet architectures popular in modern deep learningbased approaches may lead to more accurate restoration results at least in some circumstances.
5 Discussion
We have investigated the role of the convolutional network architecture in the success of recent ConvNetbased image restoration methods. We have teased apart the contribution of
the prior imposed by this architecture from the contribution of the information transfered from external images through learning. Along the way, we have shown that a simple approach of fitting randomlyinitialized ConvNets to corrupted images works as a “Swiss knife” for restoration problems. The use of this “Swiss knife” does not require modeling of the degradation process or pretraining. Admittedly, the approach is computationally heavy (taking several minutes of GPU computation for 512×512 image).
In many ways, our results go against the common narrative that attributes the recent successes of deep learningbased methods in imaging to the shift from using handcrafted priors to learning everything from data. It turns out that much of the success can be also attributed to switching from worse handcrafted priors to better handcrafted priors (hidden inside learnable deep ConvNets). This validates the importance of developing new deep learning architectures.
Acknowledgements
DU and VL are supported by the Ministry of Education and Science of the Russian Federation (grant 14.756.31.0001) and AV is supported by ERC 677195IDIU.
References

[1]
Y. Bahat, N. Efrat, and M. Irani.
Nonuniform blind deblurring by reblurring.
In Proc. CVPR, pages 3286–3294, 2017.

[2]
M. Bevilacqua, A. Roumy, C. Guillemot, and M. AlberiMorel.
Lowcomplexity singleimage superresolution based on nonnegative
neighbor embedding.
In Proc. BMVC, pages 1–10, 2012.

[3]
A. Buades, B. Coll, and J.M. Morel.
A nonlocal algorithm for image denoising.
In Proc. CVPR, volume 2, pages 60–65. IEEE, 2005.

[4]
H. C. Burger, C. J. Schuler, and S. Harmeling.
Image denoising: Can plain neural networks compete with bm3d?
In Proc. CVPR, pages 2392–2399, 2012.

[5]
K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian.
Image denoising by sparse 3d transformdomain collaborative
filtering.
IEEE Transactions on image processing, 16(8):2080–2095, 2007.

[6]
C. Dong, C. C. Loy, K. He, and X. Tang.
Learning a deep convolutional network for image superresolution.
In Proc. ECCV, pages 184–199, 2014.

[7]
A. Dosovitskiy and T. Brox.
Inverting convolutional networks with convolutional networks.
In Proc. CVPR, 2016.

[8]
A. Dosovitskiy, J. Tobias Springenberg, and T. Brox.
Learning to generate chairs with convolutional neural networks.
In Proc. CVPR, pages 1538–1546, 2015.

[9]
D. Glasner, S. Bagon, and M. Irani.
Superresolution from a single image.
In Proc. ICCV, pages 349–356, 2009.

[10]
I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair,
A. Courville, and Y. Bengio.
Generative adversarial nets.
In Proc. NIPS, pages 2672–2680, 2014.

[11]
S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, and L. Zhang.
Convolutional sparse coding for image superresolution.
In ICCV, pages 1823–1831. IEEE Computer Society, 2015.

[12]
K. He, J. Sun, and X. Tang.
Guided image filtering.
TPAMI, 35(6):1397–1409, 2013.

[13]
F. Heide, W. Heidrich, and G. Wetzstein.
Fast and flexible convolutional sparse coding.
In Proc. CVPR, pages 5135–5143, 2015.

[14]
S. Iizuka, E. SimoSerra, and H. Ishikawa.
Globally and Locally Consistent Image Completion.
ACM Transactions on Graphics (Proc. of SIGGRAPH),
36(4):107:1–107:14, 2017.

[15]
D. P. Kingma and M. Welling.
Autoencoding variational bayes.
In Proc. ICLR, 2014.

[16]
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.
In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,
editors, Advances in Neural Information Processing Systems 25, pages
1097–1105. Curran Associates, Inc., 2012.

[17]
W.S. Lai, J.B. Huang, N. Ahuja, and M.H. Yang.
Deep laplacian pyramid networks for fast and accurate
superresolution.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017.

[18]
C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta,
A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi.
Photorealistic single image superresolution using a generative
adversarial network.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017.

[19]
S. Lefkimmiatis.
Nonlocal color image denoising with convolutional neural networks.
In Proc. CVPR, 2016.

[20]
A. Mahendran and A. Vedaldi.
Understanding deep image representations by inverting them.
In Proc. CVPR, 2015.

[21]
A. Mahendran and A. Vedaldi.
Visualizing deep convolutional neural networks using natural
preimages.
IJCV, 2016.

[22]
J. Mairal, F. Bach, J. Ponce, and G. Sapiro.
Online learning for matrix factorization and sparse coding.
Journal of Machine Learning Research, 11(Jan):19–60, 2010.

[23]
V. Papyan, Y. Romano, and M. Elad.
Convolutional neural networks analyzed via convolutional sparse
coding.
Journal of Machine Learning Research, 18(83):1–52, 2017.

[24]
V. Papyan, Y. Romano, J. Sulam, and M. Elad.
Convolutional dictionary learning via local processing.
In Proc. ICCV, 2017.

[25]
G. Petschnigg, R. Szeliski, M. Agrawala, M. F. Cohen, H. Hoppe, and K. Toyama.
Digital photography with flash and noflash image pairs.
ACM Trans. Graph., 23(3):664–672, 2004.

[26]
J. S. J. Ren, L. Xu, Q. Yan, and W. Sun.
Shepard convolutional neural networks.
In Proc. NIPS, pages 901–909, 2015.

[27]
M. S. M. Sajjadi, B. Scholkopf, and M. Hirsch.
Enhancenet: Single image superresolution through automated texture
synthesis.
In The IEEE International Conference on Computer Vision (ICCV),
Oct 2017.

[28]
Y. Tai, J. Yang, and X. Liu.
Image superresolution via deep recursive residual network.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), July 2017.

[29]
D. Ulyanov, A. Vedaldi, and V. Lempitsky.
Supplementary material.
https://dmitryulyanov.github.io/deep_image_prior.

[30]
M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus.
Deconvolutional networks.
In Proc. CVPR, pages 2528–2535, 2010.

[31]
R. Zeyde, M. Elad, and M. Protter.
On single image scaleup using sparserepresentations.
In Curves and Surfaces, volume 6920 of Lecture Notes in
Computer Science, pages 711–730. Springer, 2010.

[32]
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals.
Understanding deep learning requires rethinking generalization.
In Proc. ICLR, 2017.
https://www.groundai.com/project/deepimageprior/