Permute, Quantize, and Finetune: Efficient Compression of Neural Networks
Abstract
Compressing large neural networks is an important step for their deployment in resourceconstrained computational platforms.
In this context, vector quantization is an appealing framework that expresses multiple parameters using a single code, and has recently achieved stateoftheart network compression on a range of core vision and natural language processing tasks.
Key to the success of vector quantization is deciding which parameter groups should be compressed together.
Previous work has relied on heuristics that group the spatial dimension of individual convolutional filters,
but a general solution remains unaddressed.
This is desirable for pointwise convolutions (which dominate modern architectures),
linear layers (which have no notion of spatial dimension),
and convolutions (when more than one filter is compressed to the same codeword).
In this paper we make the observation that the weights of two adjacent layers can be permuted while expressing the same function.
We then establish a connection to ratedistortion theory and search for permutations that result in networks that are easier to compress.
Finally, we rely on an annealed quantization algorithm to better compress the network and achieve higher final accuracy.
We show results on image classification, object detection, and segmentation,
reducing the gap with the uncompressed model by 40 to 70% /wrtthe current state of the art.
We will release code to reproduce all our experiments.
/NewDocumentCommand/LeftComment
s m/IfBooleanF#1#2
/cvprfinalcopy
1 Introduction
Stateoftheart approaches to many computer vision tasks are currently based on deep neural networks.
These networks often have large memory and computational requirements,
limiting the range of hardware platforms on which they can operate.
This poses a challenge for applications such as virtual reality and robotics,
which naturally rely on mobile and lowpower computational platforms for largescale deployment.
At the same time, these networks are often overparameterized [4],
which implies that it is possible to compress them – thereby reducing their memory and computation demands –
without much loss in accuracy.
Scalar quantization is a popular approach to network compression where
each network parameter is compressed individually, thereby limiting the achievable compression rates. To address this limitation, a recent line of work has focused on
vector quantization (VQ) [53, 12, 46],
which compresses multiple parameters into a single code.
Conspicuously, these approaches have recently achieved stateoftheart compressiontoaccuracy ratios on
core computer vision and natural language processing tasks [47, 9].
A key advantage of VQ is that it can naturally exploit redundancies among groups of network parameters,
for example, by grouping the spatial dimensions of convolutional filters in a single vector to achieve high compression rates.
However, finding which network parameters should be compressed jointly can be challenging;
for instance, there is no notion of spatial dimension in fully connected layers,
and it is not clear how vectors should be formed when the vector size is larger than a single convolutional filter –
which is always true for pointwise convolutions.
Current approaches either employ clustering (/eg, means)
using the order of the weights as
obtained by the network [12, 53, 46], which is suboptimal,
or search for groups of parameters that, when compressed jointly,
minimize the reconstruction error of the network activations [53, 47, 9], which is empirically hard to optimize.
In this paper,
we formalize the notion of redundancy among parameter groups using concepts from ratedistortion theory,
and leverage this analysis to search for permutations of the network weights that yield functionally equivalent,
yet easiertoquantize networks.
The result is Permute, Quantize, and Finetune (PQF),
an efficient algorithm that first searches for permutations, codes and codebooks
that minimize the reconstruction error of the network weights,
and then uses gradientbased optimization to recover the accuracy of the uncompressed network.
Our main contributions can be summarized as follows:

We study the invariance of neural networks under permutation of their weights, focusing
on constraints induced by the network topology. We then formulate a permutation optimization problem
to find functionally equivalent networks that are easier to quantize.
Our result focuses on improving a quantization lower bound of the weights; therefore 
We use an efficient annealed quantization algorithm that
reduces quantization error and
leads to higher accuracy of the compressed networks. Finally,
Put together, the above contributions define a novel method that produces
stateoftheart results in terms of model size vs. accuracy.
We benchmark our method by compressing popular architectures for image classification, and
object detection & segmentation,
showcasing the wide applicability of our approach.
Our results show a 4060% relative error reduction on Imagenet object classification over the current stateoftheart
when compressing a ResNet50 [20] down to about 3 MB ( compression).
We also demonstrate a relative 60% (resp. 70%) error reduction in object detection (resp. mask segmentation) on COCO over previous work,
by compressing a MaskRCNN architecture down to about 6.6 MB ( compression).
2 Related Work
There is a vast literature on compressing neural networks.
Efforts in this area can broadly be divided into pruning, lowrank approximations, and quantization.
Weight pruning: In its simplest form, weight pruning can be achieved by removing small weights [17, 15],
or approximating the importance of each parameter using secondorder terms [29, 18, 6].
More sophisticated approaches use metalearning to obtain pruning policies that generalize to multiple models [21],
or use regularization terms during training to reduce parameter count [35].
Most of these methods prune individual weights,
and result in sparse networks that are difficult to accelerate on commonly available hardware.
To address these issues, another line of work aims to remove unimportant channels,
producing networks that are easier to accelerate in practice [22, 37, 30].
Lowrank approximations: These methods can achieve acceleration by design [5, 24, 28, 41],
as they typically factorize the original weight matrix into several smaller matrices.
As a result, the original computationallyheavy forward pass can be replaced by a multiplication of several smaller vectors and matrices.
Scalar quantization: These techniques constrain the number of bits that each parameter may take,
in the extreme case using binary [3, 43, 38, 52] or ternary [56] values.
8bit quantization methods have proven robust and efficient,
which has motivated their native support by popular deep learning libraries such as
PyTorch
Tensorflow Lite
with acceleration often targeting CPUs.
We refer the reader to the survey by [40] for a recent comprehensive overview of the subject.
In this context, reducing each parameter to a single bit yields a theoretical compression ratio of
(although, in practice, fullyconnected and batch norm layers are not quantized [38]).
To obtain higher compression ratios, researchers have turn to vector quantization.
Vector quantization (VQ): VQ of neural networks was pioneered by Gong /etal [12], who investigated scalar, vector,
and product quantization [25] (PQ) of fullyconnected (FC) layers,
which were the most memorydemanding layers of convolutional neural networks (CNNs) at the time.
Wu /etal [53] used PQ to compress both FC and convolutional layers of CNNs;
they noticed that minimizing the quantization error of the network parameters
produces much worse results than minimizing the error of the activations,
so they sequentially quantized the layers to minimize error accumulation.
However, neither Gong /etal [12] nor Wu /etal [53], explored endtoend training,
which is necessary to recover the network accuracy, especially as the compression ratio increases.
Son /etal [46] clustered 33 convolutions using vector quantization, and finetuned the centroids
via gradient descent using additional bits to encode filter rotation, resulting in very compact codebooks.
However, they did not explore the compression of FC layers nor pointwise convolutions (which dominate modern architectures),
and did not explore the relationship of quantization error to accuracy.
Stock /etal [47] use product quantization to compress convolutional and FC layers,
using a clustering technique designed to minimize the reconstruction error of the layer outputs
(which is computationally expensive),
followed by endtoend training of the cluster centroids.
This approach obtains, to the best our our knowledge, stateoftheart results in memorytoaccuracy compression
for image recognition and object detection.
However, it does not optimize the grouping of the network parameters for quantization,
which we find to be crucial to obtain good compression.
Different from previous approaches, our method exploits the invariance of
neural networks under permutation of their weights for the purpose of vector compression.
Based on this observation, we draw connections to rate distortion theory,
and use an efficient permutation optimization algorithm that makes the network easier to quantize.
We also use an annealed clustering algorithm to further reduce quantization error,
and show that there is a direct correlation between the quantization error of a
network weights and its final accuracy after finetuning.
These contributions result in an efficient method that largely outperforms its competitors on a wide range of applications.
3 Learning to Compress a Neural Network
In this paper we compress a neural network by compressing the weights of its layers.
Specifically, instead of storing the weight matrix of a layer explicitly,
we learn an encoding that takes considerably less memory.
Intuitively, we can decode to a matrix
that is “close” to , and use as the weight matrix for the layer.
The idea is that if is similar to , the activations
of the layer should also be similar. Note that the encoding will be different for each of the layers.
3.1 Designing the Encoding
For a desired compression rate, we design the encoding to consist of
a codebook , a set of codes , and a permutation matrix .
The permutation matrix preprocesses the weights so that they are easier to compress without affecting the inputoutput mapping of the network,
while the codes and codebook attempt to express the permuted weights as accurately as possible using limited memory.
Codes and codebook: Let denote the weight matrix of a fullyconnected (FC) layer,
with the input size of the layer, and the size of its output.
We split each column of into column subvectors , which are then compressed individually:
(1) 
where .
Thus, is the total number of subvectors and, intuitively, larger results in fewer subvectors and thus higher compression rates.
The set is thus a collection of dimensional blocks that can be used to construct .
Instead of storing all these subvectors, we approximate them by a smaller set
,
which we call the codebook
for the layer.
We refer to the elements of as centroids.
Let be the index of the element in
that is closest to in Euclidean space:
(2) 
The codes are the indices of the codes in the codebook that best reconstruct every subvector .
The approximation of is thus the matrix obtained by replacing each subvector with :
(3) 
We refer to the process of expressing the weight matrix in terms of codes and a codebook as quantization.
Permutation: The effectiveness of a given set of codes and codebooks depends on their ability to represent the original weight matrix accurately.
Intuitively, this is easier to achieve if the subvectors are similar to one another.
Therefore, it is natural to consider transformations of that make the resulting subvectors easier to compress.
A feedforward network can be thought of as a directed acyclic graph (DAG), where nodes represent layers and edges represent the
flow of information in the network.
We refer to starting node of an edge as a parent layer, and the end node as a child layer.
We note that the network is invariant under permutation of its weights,
as long as the same permutation is applied to the output dimension for parent layers and the input dimension for children layers.
Here, our key insight is that we can search for permutations that make the network easier to quantize.
Formally, consider a network comprised of two layers:
(4)  
(5) 
where represents a nonlinear activation function.
The network can be described as the function
(6) 
where is the input to the network.
Furthermore, from a topological point of view, is the parent of .
Given a permutation of elements ,
we denote as the permutation matrix that results from reordering the rows of the identity matrix according to .
Leftmultiplying with has the effect of reordering the rows of according to .
Let be the layer that results from applying the permutation matrix
to the input dimension of the weights of :
(7) 
Analogously, let
be the layer that results from applying the permutation
to the output dimension of the weights of :
(8) 
Importantly, so long as is an elementwise operator, produces the same output as , only permuted:
(9) 
then we have
(10)  
(11)  
(12)  
(13)  
(14)  
(15) 
This functional equivalence has previously been used to characterize the optimization landscape of neural networks [1, 42].
In contrast, here we focus on quantizing the permuted weight ,
and denote its subvectors as .
We depict the process of applying a permutation and obtaining new subvectors in Figure 1.
Extension to convolutional layers: The encoding of convolutional layers is closely related to that of fullyconnected layers.
Let denote the weights
of a convolutional layer with input channels,
output channels, and a kernel size of .
The idea is to reshape into a 2d matrix of size ,
and then apply the same encoding method that we use with fullyconnected layers.
The result is an approximation to .
We then apply the inverse of the reshaping operation on to get our approximation to .
When , we set the codeword size to a multiple of and
limit the permutation matrix to have a block structure such that the spatial dimensions of filters are quantized together.
For pointwise convolutions (/ie, ), we set to 4 or 8, depending on the desired compression rate.
We have so far considered networks where each layer has a single parent and a single child
(/ie, the topology is a chain).
We now consider architectures where some layers may have more than one child or more than one parent.
Extension beyond chain architectures: AlexNet [27] and VGG [45] are examples of popular architectures with a chain topology.
As a consequence, each layer can have a different permutation.
However, architectures with more complicated topologies have more constraints on the permutations that they admit.
For example, consider Figure 2, which depicts six resblocks as used in the popular ResNet50 architecture.
We start by finding a permutation for layer 4a, and realize that its parent is layer 3c.
We also notice that
layers 3c and 2c must share the same permutation for the residual addition to have matching channels.
By induction, this is also true of layers 1c and 1d, which are now all parents of our initial layer 4a.
These parents have children of their own (layers 2a, 3a and 4d),
so these must be counted as siblings of 4a, and must share the same permutation as 4a.
However, note that all b and c layers are only children, so they can have their own independent permutation.
Operations such as reshaping and concatenation in parent layers may also affect the permutations that a layer can tolerate while preserving functional equivalence.
For example, in the detection head of MaskRCNN [19], the output (of shape ) of a convolutional layer is reshaped (to ) before before entering a FC layer.
Moreover, the same tensor is used in another convolutional layer (without reshaping) for mask prediction.
Therefore, the FC layer and the child convolutional layer must share the same permutation.
In this case, the FC layer must keep blocks of contiguous dimensions together
to respect the channel ordering of its parent (and to match the permutation of its sibling).
Determining the maximum set of independent permutations that an arbitrary network may admit (and finding efficient algorithms to do so)
is a problem we leave for future work.
3.2 Learning the Encoding
Our overarching goal is to learn an encoding of each layer such that the final output of the network is preserved.
Towards this goal, we search for a set of codes, codebook, and permutation that minimizes the quantization error
of every layer of the network:
(16) 
Our optimization procedure consists of three steps:

Permute: We search for a permutation of each layer that results in subvectors that are easier to quantize.
We do this by minimizing the determinant of the covariance of the resulting subvectors.
We have found that minimizing the quantization error of the network weights (Eq. (16))
results in small inaccuracies that accumulate over multiple layers reducing performance;
therefore, it is important to jointly finetune the network so that it can recover its original accuracy with gradient descent.
We have also observed that the quality of the intial reconstruction has a direct impact on the final network accuracy.
We now describe the three steps in detail.
Permute
In this step, our goal is to estimate a permutation such that the permuted weight matrix
has subvectors that are easily quantizable.
Intuitively, we want to minimize the spread of the vectors,
as more compact vectors can be expressed more accurately given a fixed number of centroids.
We now formalize this intuition and propose a simple algorithm to find good permutations.
A quantization lower bound: We assume that the weight subvectors that form the input to the quantization step
come from a Gaussian distribution, , with zeromean
and covariance , which is a positive semidefinite matrix.
Thanks to rate distortion theory [11], we know that the expected reconstruction error must follow
(17) 
in other words, the error is lowerbounded by the determinant of the covariance of the subvectors of .
We assume that we have access to a good minimizer such that, roughly, this bound is equal to the reconstruction
error achieved by our quantization algorithm.
Thus, for a fixed target compression bitrate, we can focus on finding a permutation
that minimizes .
Searching for permutations: We make use of Expression (17) and focus on obtaining a permutation that minimizes the determinant
of the covariance of the set . We follow an argument similar to that of Ge /etal [10], and
note that the determinant of any positive semidefinite matrix ,
with elements , satisfies Hadamard’s inequality:
(18) 
that is, the determinant of is upperbounded by the product of its diagonal elements.
Motivated by this inequality, we greedily obtain an initial that minimizes the product of
the diagonal elements of by creating buckets of row indices, each with capacity to hold elements.
We then compute the variance of each row of , and greedily assign each row index to the nonfull bucket
that results in lowest bucket variance.
Finally, we obtain by interlacing rows from the buckets so that rows from the same bucket are placed rows apart.
convolutions can be handled similarly, assuming that has a block structure, and making use of
the more general Fischer’s inequality. Please refer to the supplementary material for more details.
There are possible permutations of , so greedy algorithms are bound to have
limitations on the quality of the solution that they can find.
Thus, we refine our solution via stochastic local search [23].
Specifically, we iteratively improve the candidate permutation by flipping two dimensions chosen at random, and
keeping the new permutation if it results in a set of subvectors whose covariance has lower determinant .
We repeat this procedure for a fixed number of iterations, and return the best permutation obtained.
Quantize
In this step, we estimate the codes and codebook that approximate the permuted weight .
Given a fixed permutation, this is equivalent to the wellknown means problem.
We use an annealed quantization algorithm called SRC,
originally due to Zeger /etal [55], and recently adapted by Martinez /etal [39] to multicodebook quantization.
SRC empirically achieves lower quantization error than the vanilla means algorithm,
and is thus a better minimizer of Expresion (17).
A stochastic relaxation of clustering: The quantization lower bound from Expression (17) suggests that the means algorithm can be annealed by scheduling a perturbation
such that the covariance of the set decreases over time.
Due to Hadamard’s inequality (/ie, Expresion (18)),
this can be achieved by adding decreasing amounts of noise to sampled from a zeromean Gaussian with diagonal covariance.
In particular, after randomly initializing the codes, we iteratively update the codebook and codes with
a noisy codebook update (which operates on subvectors with additive diagonalized Gaussian noise), and a standard means code update.
We decay the noise according to the schedule ,
where is the current iteration,
is the total number of update iterations, and is a constant.
We use in all our experiments.
For a detailed description, please refer to Algorithm 1.
Finetune
Encoding each layer independently causes errors in the activations to accumulate resulting in degradation of performance.
It is thus important to finetune the encoding in order to recover the original accuracy of the network.
In particular, we fix the codes and permutations for the remainder of the procedure.
Let be the original loss function of the network (/eg, crossentropy for classification).
We note that is differentiable with respect to each of the learned centroids – since these are continuous – so
we use the original training set to finetune the centroids with gradientbased learning:
(19) 
where is an update rule (such as SGD, RMSProp [49] or Adam [26])
with hyperparameters (such as learning rate, momentum, and decay rates).
4 Experiments
We test our method on ResNet [20] architectures for image classification and
Mask RCNN [19] for object detection and instance segmentation.
We compress standard ResNet18 and ResNet50 models that have been pretrained on ImageNet,
taking the weights directly from the PyTorch model zoo.
We train different networks with .
We also clamp the size of the codebook for each layer to .
Model  Regime  

ResNet18 
Small blocks  4  4  
Large blocks  4  4  
ResNet50 
Small blocks  4  4  
Large blocks  8  4 
Small vs. large block sizes: To further assess the tradeoff between compression and accuracy,
we use two compression regimes.
In the large blocks regime, we use a larger subvector size for each layer,
which allows the weight matrix to be encoded with fewer codes, and thus leads to higher compression rates.
To describe the subvector sizes we use for each layer, we let denote the subvector
size for a convolutional layer with filters of size . In the special case when ,
corresponding to a pointwise convolution, we denote the subvector size by . Finally,
fullyconnected layers have a subvector size of .
We summarize our subvector sizes for each model and compression regime in Table 1.
Bit allocation: We compress all the fullyconnected and convolutional layers of a network.
However, following [47], we do not compress the first convolutional layer (since it occupies less than 0.05% of the network size),
the bias of the fullyconnected layers, or the batchnorm layers.
While we train with 32bit floats, we store our final model using 16bit floats,
which has a negligible impact on validation accuracy (less than 0.02%).
Finally, we fuse batchnorm layers into two vectors,
which can be done with algebraic manipulation and is
a trick normally used to speed up inference.
Please refer to the supplementary material for a detailed breakdown of the bit allocation in our models.
Hyperparameters: We use a batch size of 128 for ResNet18 and a batch size of 64 for ResNet50.
For annealed means, we implement SRC in the GPU, and run it for 1 000 iterations.
We finetune the codebooks for 9 epochs using Adam [26]
with an initial learning rate of , which is gradually reduced to using cosine annealing [36].
Finetuning is the most expensive part of this process, and takes around 8 hours both for ResNet18 (with 1 GPU)
and for ResNet50 (with 4 GPUs). In the latter case, we scale the learning rate by a factor of 4, following Goyal /etal [13].
For permutation optimization, we perform 1 000 local search iterations;
this is done in the CPU in parallel for each independent permutation.
This process takes less than 5 minutes for ResNet18, and about 10 minutes for ResNet50 on a 12core CPU.
Baselines: We compare the results of our method against a variety of network compression methods:
Binary Weight Network (BWN) [43], Trained Ternary
Quantization (TTQ) [56], ABCNet [34], LRNet [44],
Deep Compression (DC) [16], HardwareAware Automated Quantization (HAQ) [51],
CLIPQ [50],
Hessian AWare Quantization of Neural Networks with Mixed Precision (HAWQ) [8], and HAWQV2 [7].
We compare extensively against the recentlyproposed Bit Goes Down (BGD) method of [47]
because it is the current state of the art by a large margin.
BGD uses as initialization the method due to Wu /etal [53], and thus subsumes it.
All results presented are taken either from the original papers,
or from two additional surveys [14, 2].
4.1 Image Classification
A summary of our results can be found in Figure 3.
From the Figure, it is clear that our method outperforms all its competitors.
On ResNet18 for example, we can surpass the performance of ABCNet (M=5)
with our small blocks models at roughly the compression rate.
Our biggest improvement
generally comes from higher compression rates, and is especially apparent for the larger ResNet50.
When using large blocks and centroids, we obtain a top1 accuracy of 72.18% using
only 3 MB of memory.
This represents an absolute 4% improvement over the state of the art.
On ResNet50, our method consistently reduces the remaining error by –% /wrt the state of the art.
Ratio  Size  Acc.  

Semisup R50 [54]  –  97.50 MB  79.30 
BGD [47]  19  5.09 MB  76.12 
Semisup R50 [54]  –  97.50 MB  78.72 
Our PQF  19  5.09 MB  77.15 
ImageNet classification starting from a semisupervised ResNet50.
We set a new state of the art in terms of accuracy vs model size.
Semisupervised ResNet50: We also benchmark our method using a stronger backbone as a starting point.
We start from the recently released ResNet50 model due to Yalniz /etal [54],
which has been pretrained on unlabelled images from the YFCC100M dataset [48],
and finetuned on ImageNet.
While the accuracy of this model is reported to be 79.30%, we obtain a slightly lower 78.72%
after downloading the publiclyavailable model
We use the small blocks compression regime with ,
mirroring the procedure described previously.
We show our results in Table 2.
Our model attains a top1 accuracy of 77.15% in the equalmemory setting.
This means that we are able to outperform previous work by over 1% absolute accuracy,
which defines a new stateoftheart in small models for image classification.
We find this result particularly interesting, as we originally expected
distillation to be necessary to transfer the knowledge of the larger
network pretrained on a large corpus of unlabeled images.
However, our results show that at least part of this knowledge is retained through
the initialization and structure that lowerror clustering imposes on the compressed network.
Ablation study:
Perm.  SRC  Adam  Acc.  

62.29  1.02  
✓  62.55  0.76  
✓  ✓  62.92  0.39  
✓  ✓  ✓  63.31  0.00 
Ablation study.
ResNet18 on ImageNet w/large blocks.
In Table 3, we show results for ResNet18 using large blocks,
for which we obtain a final accuracy of 63.31%.
We add permutation optimization (Sec. 3.2.1),
annealed means, as opposed to plain means (called SRC in Sec. 3.2.2),
and the use of the Adam optimizer with cosine annealing instead of plain SGD, as in previous work.
From the Table, we can see that all our components are important and complementary to achieve top accuracy.
It is also interesting to note that a baseline that simply does means and SGD finetuning
is already 1% better than the current stateoftheart.
Since both annealed means and permutation optimization directly reduce quantization error before finetuning,
these experiments demonstrate that minimizing the quantization error of the weights leads to higher final network accuracy.
4.2 Object Detection and Segmentation
Size  Ratio  AP^{bb}  AP  AP  AP^{mk}  AP  AP  

RetinaNet [32] (uncompressed)  145.00 MB  –  35.6  –  –  –  –  – 
Direct  18.13 MB  8.0  31.5  –  –  –  –  – 
FQN [31]  18.13 MB  8.0  32.5  51.5  34.7  –  –  – 
HAWQV2 [7]  17.9 MB  8.1  34.8  –  –  –  –  – 
MaskRCNN R50 FPN [19] (uncompressed)  169.40 MB  –  37.9  59.2  41.1  34.6  56.0  36.8 
BGD [47]  6.51 MB  26.0  33.9  55.5  36.2  30.8  52.0  32.2 
Our PQF  6.65 MB  25.5  36.3  57.9  39.4  33.5  54.7  35.6 
Object detection results on MS COCO 2017.
We compress a Mask RCNN network with a ResNet50 backbone, and include different object detection architectures used by other baselines.
We report both bounding box (bb) and mask (mk) metrics for Mask RCNN.
We also report the accuracy at different IoU when available.
Ratio  AP^{bb}  AP^{mk}  
MaskRCNN R50 FPN [19]  –  37.9  34.6 
BGD [47]  26.0  33.9  30.8 
Our PQF (no perm., no SRC)  25.5 
35.6  33.0 
Our PQF (no perm.)  35.8  33.1  
Our PQF (full)  36.3  33.5 
Ablation results results on MS COCO 2017. Permutation optimization is particularly important for MaskRCNN
We also benchmark our method on the task of object detection
by compressing the popular ResNet50 MaskRCNN FPN architecture [19] using the
Microsoft COCO 2017 Dataset [33].
As before, we start from the pretrained model available on the PyTorch model zoo, and apply the same procedure
described above for all the convolutional and linear layers.
In this case, one of the headers contains a deconvolutional layer, which we
treat as a convolutional layer for the purpose of compression.
We use the small blocks regime with centroids,
which results in a model of 6.65 MB.
We compress and finetune the network on a single Nvidia GTX 1080Ti GPU
with a batch size of 2 for just 4 epochs. As before, we use Adam [26] and cosine annealing [36],
but with an initial learning rate of .
Our results are presented in Table 4.
We also compare against recent baselines such as the Fully Quantized Network (FQN) [31],
and the second version of Hessian Aware Quantization (HAWQV2) [7],
which showcase results compressing RetinaNet [32].
Our method obtains a box AP of 36.3, and a mask AP of 33.5,
which represent improvements of 2.4% and 2.7% over the best previously reported result,
closing the gap to the uncompressed model by 6070%.
Compared to BGD [47], we also use fewer computational resources,
as they used 8 V100 GPUs and distributed training for compression,
while we use a single 1080Ti GPU.
Based on our inspection, our models have the same code and codebook dimensions for each layer.
However, we were unable to reproduce the 6.51 MB model size they report,
and instead obtain a slightly larger 6.65 MB.
On Table 5, we show again that using both SRC and permutation optimization is crucial to obtain the best results.
These results demonstrate the ability of our method to generalize to more complex tasks beyond image classification.
5 Conclusion and Future Work
We have demonstrated that the quantization error of the weights of a neural network is inversely correlated with
its accuracy after codebook fine tuning.
We have further proposed a method that exploits the functional equivalence of the network under permutation
of its weights to find configurations of the weights that are easier to quantize.
We have also shown that using an annealed means algorithm further reduces quantization error (and improves final network accuracy).
Put together, our contributions result in more accurate models compared to previous work.
When compressing the ResNet50 architecture, our method closes the relative gap to the uncompressed model by 4060%
compared to the previous stateoftheart in classification,
and by 6070% on detection and instance segmentation.
Our optimization method consists of three stages that focus on different variables of the encoding.
Future work may focus on techniques that jointly finetune the codes and the codebooks,
or optimization methods that learn the weight permutation jointly with the codes and codebook.
The determinant of the covariance of the weights is a continuous metric that could be minimized as the
network is trained from scratch, resulting in networks that are easier to compress by design.
Footnotes
References

An Mei Chen, Hawminn Lu, and Robert HechtNielsen.
On the geometry of feedforward neural network error surfaces.
Neural computation, 5(6):910–927, 1993.

Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang.
A survey on model compression and acceleration for deep neural
networks.
CoRR, 2017.

Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran ElYaniv, and Yoshua
Bengio.
Binarized neural networks: Training deep neural networks with weights
and activations constrained to + 1 or – 1.
arXiv preprint arXiv:1602.02830, 2016.

Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de
Freitas.
Predicting parameters in deep learning.
In Advances in neural information processing systems, 2013.

Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus.
Exploiting linear structure within convolutional networks for
efficient evaluation.
In Advances in neural information processing systems, 2014.

Xin Dong, Shangyu Chen, and Sinno Pan.
Learning to prune deep neural networks via layerwise optimal brain
surgeon.
In Advances in Neural Information Processing Systems, 2017.

Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfreen, Amir Gholami, Michael W.
Mahoney, and Kurt Keutzer.
HAWQV2: Hessian aware traceweighted quantization of neural
networks.
arXiv preprint arXiv:1911.03852, 2019.

Zhen Dong, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer.
Hessianaware quantization of neural networks with mixed precision.
In ICCV, 2019.

Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval,
Hervé Jégou, and Armand Joulin.
Training with quantization noise for extreme model compression.
arXiv Prepr. arXiv2004, 2020.

Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun.
Optimized product quantization for approximate nearest neighbor
search.
In CVPR, 2013.

Allen Gersho and Robert M Gray.
Vector quantization and signal compression, chapter 8, pages
228–243.
Springer Science & Business Media, 1991.

Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev.
Compressing deep convolutional networks using vector quantization.
arXiv preprint arXiv:1412.6115, 2014.

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He.
Accurate, large minibatch sgd: Training imagenet in 1 hour.
arXiv preprint arXiv:1706.02677, 2017.

Yunhui Guo.
A survey on methods and theories of quantized neural networks.
arXiv preprint arXiv:1808.04752, 2018.

Yiwen Guo, Anbang Yao, and Yurong Chen.
Dynamic network surgery for efficient DNNs.
In Advances In Neural Information Processing Systems, 2016.

Song Han, Huizi Mao, and William J Dally.
Deep compression: Compressing deep neural networks with pruning,
trained quantization and Huffman coding.
arXiv preprint arXiv:1510.00149, 2015.

Song Han, Jeff Pool, John Tran, and William Dally.
Learning both weights and connections for efficient neural network.
In Advances in neural information processing systems, 2015.

Babak Hassibi and David G Stork.
Second order derivatives for network pruning: Optimal brain surgeon.
In Advances in neural information processing systems, 1993.

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick.
Mask RCNN.
In ICCV, 2017.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In CVPR, 2016.

Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, LiJia Li, and Song Han.
AMC: AutoML for model compression and acceleration on mobile
devices.
In ECCV, 2018.

Yihui He, Xiangyu Zhang, and Jian Sun.
Channel pruning for accelerating very deep neural networks.
In ICCV, 2017.

Holger H Hoos and Thomas Stützle.
Stochastic local search: Foundations and applications.
Elsevier, 2004.

Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman.
Speeding up convolutional neural networks with low rank expansions.
arXiv preprint arXiv:1405.3866, 2014.

Hervé Jégou, Matthijs Douze, and Cordelia Schmid.
Product quantization for nearest neighbor search.
IEEE transactions on pattern analysis and machine intelligence,
33(1), 2010.

Diederik P Kingma and Jimmy Ba.
Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, 2012.

Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor
Lempitsky.
Speedingup convolutional neural networks using finetuned
CPdecomposition.
arXiv preprint arXiv:1412.6553, 2014.

Yann LeCun, John S Denker, and Sara A Solla.
Optimal brain damage.
In Advances in neural information processing systems, 1990.

Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf.
Pruning filters for efficient convnets.
arXiv preprint arXiv:1608.08710, 2016.

Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, Junjie Yan, and Rui Fan.
Fully quantized network for object detection.
In CVPR, 2019.

TsungYi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Dollár.
Focal loss for dense object detection.
In ICCV, pages 2980–2988, 2017.

TsungYi Lin, Michael Maire, Serge Belognie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C Lawrence Zitnich.
Microsoft COCO: Common objects in context.
In ECCV, 2014.

Xiaofan Lin, Cong Zhao, and Wei Pan.
Towards accurate binary convolutional neural network.
In Advances in Neural Information Processing Systems, 2017.

Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui
Zhang.
Learning efficient convolutional networks through network slimming.
In ICCV, 2017.

Ilya Loshchilov and Frank Hutter.
Sgdr: Stochastic gradient descent with warm restarts.
arXiv preprint arXiv:1608.03983, 2016.

JianHao Luo, Jianxin Wu, and Weiyao Lin.
Thinet: A filter level pruning method for deep neural network
compression.
In ICCV, 2017.

Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos.
Training binary neural networks with realtobinary convolutions.
In ICLR, 2020.

Julieta Martinez, Shobhit Zakhmi, Holger H Hoos, and James J Little.
LSQ++: Lower running time and higher recall in multicodebook
quantization.
In Proceedings of the European Conference on Computer Vision
(ECCV), pages 491–506, 2018.

James O’ Neill.
An overview of neural network compression.
arXiv preprint arXiv:2006.03669, 2020.

Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov.
Tensorizing neural networks.
In Advances in Neural Information Processing Systems, 2015.

A Emin Orhan and Xaq Pitkow.
Skip connections eliminate singularities.
In ICLR, 2018.

Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi.
XNORnet: Imagenet classification using binary convolutional neural
networks.
In ECCV, pages 525–542. Springer, 2016.

Oran Shayer, Dan Levi, and Ethan Fetaya.
Learning discrete weights using the local reparameterization trick.
arXiv preprint arXiv:1710.07739, 2017.

Karen Simonyan and Andrew Zisserman.
Very deep convolutional networks for largescale image recognition.
arXiv preprint arXiv:1409.1556, 2014.

Sanghyun Son, Seungjun Nah, and Kyoung Mu Lee.
Clustering convolutional kernels to compress deep neural networks.
In Proceedings of the European Conference on Computer Vision
(ECCV), pages 216–232, 2018.

Pierre Stock, Armand Joulin, Rémi Gribonval, Benjamin Graham, and Hervé
Jégou.
And the bit goes down: Revisiting the quantization of neural
networks.
In ICLR, 2020.

Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni,
Douglas Poland, Damian Borth, and LiJia Li.
YFCC100M: The new data in multimedia research.
Communications of the ACM, 59(2):64–73.

Tijmen Tieleman and Geoffrey Hinton.
Lecture 6.5rmsprop: Divide the gradient by a running average of its
recent magnitude.
COURSERA: Neural networks for machine learning, 4(2):26–31,
2012.

Frederick Tung and Greg Mori.
Deep neural network compression by inparallel pruningquantization.
IEEE Transactions on Patten analysis and Machine Intelligence,
2019.

Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han.
HAQ: Hardwareaware automated quantization with mixed precision.
In CVPR, 2019.

Ziwei Wang, Jiwen Lu, Chenxin Tao, Jie Zhou, and Qi Tian.
Learning channelwise interactions for binary convolutional neural
networks.
In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 568–577, 2019.

Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng.
Quantized convolutional neural networks for mobile devices.
In CVPR, 2016.

I Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv
Mahajan.
Billionscale semisupervised learning for image classification.
arXiv preprint arXiv:1905.00546, 2019.

Kenneth Zeger, Jacques Vaisey, Allen Gersho, et al.
Globally optimal vector quantizer design by stochastic relaxation.
IEEE Transactions on Signal Processing, 40(2):310–322, 1992.

Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally.
Trained ternary quantization.
arXiv preprint arXiv:1612.01064, 2016.