Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks

论文 Deep Talk 4周前 (10-31) 15次浏览 已收录 0个评论 扫描二维码

Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks


Compressing large neural networks is an important step for their deployment in resource-constrained computational platforms.
In this context, vector quantization is an appealing framework that expresses multiple parameters using a single code, and has recently achieved state-of-the-art network compression on a range of core vision and natural language processing tasks.
Key to the success of vector quantization is deciding which parameter groups should be compressed together.
Previous work has relied on heuristics that group the spatial dimension of individual convolutional filters,
but a general solution remains unaddressed.
This is desirable for pointwise convolutions (which dominate modern architectures),
linear layers (which have no notion of spatial dimension),
and convolutions (when more than one filter is compressed to the same codeword).
In this paper we make the observation that the weights of two adjacent layers can be permuted while expressing the same function.
We then establish a connection to rate-distortion theory and search for permutations that result in networks that are easier to compress.
Finally, we rely on an annealed quantization algorithm to better compress the network and achieve higher final accuracy.
We show results on image classification, object detection, and segmentation,
reducing the gap with the uncompressed model by 40 to 70% /wrtthe current state of the art.
We will release code to reproduce all our experiments.


s m/IfBooleanF#1#2


1 Introduction

State-of-the-art approaches to many computer vision tasks are currently based on deep neural networks.
These networks often have large memory and computational requirements,
limiting the range of hardware platforms on which they can operate.
This poses a challenge for applications such as virtual reality and robotics,
which naturally rely on mobile and low-power computational platforms for large-scale deployment.
At the same time, these networks are often overparameterized [4],
which implies that it is possible to compress them – thereby reducing their memory and computation demands –
without much loss in accuracy.

Scalar quantization is a popular approach to network compression where
each network parameter is compressed individually, thereby limiting the achievable compression rates. To address this limitation, a recent line of work has focused on
vector quantization (VQ) [53, 12, 46],
which compresses multiple parameters into a single code.
Conspicuously, these approaches have recently achieved state-of-the-art compression-to-accuracy ratios on
core computer vision and natural language processing tasks [47, 9].

A key advantage of VQ is that it can naturally exploit redundancies among groups of network parameters,
for example, by grouping the spatial dimensions of convolutional filters in a single vector to achieve high compression rates.
However, finding which network parameters should be compressed jointly can be challenging;
for instance, there is no notion of spatial dimension in fully connected layers,
and it is not clear how vectors should be formed when the vector size is larger than a single convolutional filter
which is always true for pointwise convolutions.
Current approaches either employ clustering (/eg, -means)
using the order of the weights as
obtained by the network [12, 53, 46], which is suboptimal,
or search for groups of parameters that, when compressed jointly,
minimize the reconstruction error of the network activations [53, 47, 9], which is empirically hard to optimize.

In this paper,
we formalize the notion of redundancy among parameter groups using concepts from rate-distortion theory,
and leverage this analysis to search for permutations of the network weights that yield functionally equivalent,
yet easier-to-quantize networks.
The result is Permute, Quantize, and Fine-tune (PQF),
an efficient algorithm that first searches for permutations, codes and codebooks
that minimize the reconstruction error of the network weights,
and then uses gradient-based optimization to recover the accuracy of the uncompressed network.
Our main contributions can be summarized as follows:

  1. We study the invariance of neural networks under permutation of their weights, focusing
    on constraints induced by the network topology. We then formulate a permutation optimization problem
    to find functionally equivalent networks that are easier to quantize.
    Our result focuses on improving a quantization lower bound of the weights; therefore

  2. We use an efficient annealed quantization algorithm that
    reduces quantization error and
    leads to higher accuracy of the compressed networks. Finally,

  3. We show that the reconstruction error of the network parameters is inversely correlated
    with the final network accuracy after gradient-based fine-tuning.

Put together, the above contributions define a novel method that produces
state-of-the-art results in terms of model size vs. accuracy.
We benchmark our method by compressing popular architectures for image classification, and
object detection & segmentation,
showcasing the wide applicability of our approach.
Our results show a 40-60% relative error reduction on Imagenet object classification over the current state-of-the-art
when compressing a ResNet-50 [20] down to about 3 MB ( compression).
We also demonstrate a relative 60% (resp. 70%) error reduction in object detection (resp. mask segmentation) on COCO over previous work,
by compressing a Mask-RCNN architecture down to about 6.6 MB ( compression).

2 Related Work

There is a vast literature on compressing neural networks.
Efforts in this area can broadly be divided into pruning, low-rank approximations, and quantization.

Weight pruning: In its simplest form, weight pruning can be achieved by removing small weights [17, 15],
or approximating the importance of each parameter using second-order terms [29, 18, 6].
More sophisticated approaches use meta-learning to obtain pruning policies that generalize to multiple models [21],
or use regularization terms during training to reduce parameter count [35].
Most of these methods prune individual weights,
and result in sparse networks that are difficult to accelerate on commonly available hardware.
To address these issues, another line of work aims to remove unimportant channels,
producing networks that are easier to accelerate in practice [22, 37, 30].

Low-rank approximations: These methods can achieve acceleration by design [5, 24, 28, 41],
as they typically factorize the original weight matrix into several smaller matrices.
As a result, the original computationally-heavy forward pass can be replaced by a multiplication of several smaller vectors and matrices.

Scalar quantization: These techniques constrain the number of bits that each parameter may take,
in the extreme case using binary [3, 43, 38, 52] or ternary [56] values.
8-bit quantization methods have proven robust and efficient,
which has motivated their native support by popular deep learning libraries such as
PyTorch1 and
Tensorflow Lite2,
with acceleration often targeting CPUs.
We refer the reader to the survey by [40] for a recent comprehensive overview of the subject.
In this context, reducing each parameter to a single bit yields a theoretical compression ratio of
(although, in practice, fully-connected and batch norm layers are not quantized [38]).
To obtain higher compression ratios, researchers have turn to vector quantization.

Vector quantization (VQ): VQ of neural networks was pioneered by Gong /etal [12], who investigated scalar, vector,
and product quantization [25] (PQ) of fully-connected (FC) layers,
which were the most memory-demanding layers of convolutional neural networks (CNNs) at the time.
Wu /etal [53] used PQ to compress both FC and convolutional layers of CNNs;
they noticed that minimizing the quantization error of the network parameters
produces much worse results than minimizing the error of the activations,
so they sequentially quantized the layers to minimize error accumulation.
However, neither Gong /etal [12] nor Wu /etal [53], explored end-to-end training,
which is necessary to recover the network accuracy, especially as the compression ratio increases.

Son /etal [46] clustered 33 convolutions using vector quantization, and fine-tuned the centroids
via gradient descent using additional bits to encode filter rotation, resulting in very compact codebooks.
However, they did not explore the compression of FC layers nor pointwise convolutions (which dominate modern architectures),
and did not explore the relationship of quantization error to accuracy.

Stock /etal [47] use product quantization to compress convolutional and FC layers,
using a clustering technique designed to minimize the reconstruction error of the layer outputs
(which is computationally expensive),
followed by end-to-end training of the cluster centroids.
This approach obtains, to the best our our knowledge, state-of-the-art results in memory-to-accuracy compression
for image recognition and object detection.
However, it does not optimize the grouping of the network parameters for quantization,
which we find to be crucial to obtain good compression.

Different from previous approaches, our method exploits the invariance of
neural networks under permutation of their weights for the purpose of vector compression.
Based on this observation, we draw connections to rate distortion theory,
and use an efficient permutation optimization algorithm that makes the network easier to quantize.
We also use an annealed clustering algorithm to further reduce quantization error,
and show that there is a direct correlation between the quantization error of a
network weights and its final accuracy after fine-tuning.
These contributions result in an efficient method that largely outperforms its competitors on a wide range of applications.

3 Learning to Compress a Neural Network

In this paper we compress a neural network by compressing the weights of its layers.
Specifically, instead of storing the weight matrix of a layer explicitly,
we learn an encoding that takes considerably less memory.
Intuitively, we can decode to a matrix
that is “close” to , and use as the weight matrix for the layer.
The idea is that if is similar to , the activations
of the layer should also be similar. Note that the encoding will be different for each of the layers.

3.1 Designing the Encoding

For a desired compression rate, we design the encoding to consist of
a codebook , a set of codes , and a permutation matrix .
The permutation matrix preprocesses the weights so that they are easier to compress without affecting the input-output mapping of the network,
while the codes and codebook attempt to express the permuted weights as accurately as possible using limited memory.

Codes and codebook: Let denote the weight matrix of a fully-connected (FC) layer,
with the input size of the layer, and the size of its output.
We split each column of into column subvectors , which are then compressed individually:


where .
Thus, is the total number of subvectors and, intuitively, larger results in fewer subvectors and thus higher compression rates.
The set is thus a collection of -dimensional blocks that can be used to construct .

Instead of storing all these subvectors, we approximate them by a smaller set
which we call the codebook
for the layer.
We refer to the elements of as centroids.
Let be the index of the element in
that is closest to in Euclidean space:


The codes are the indices of the codes in the codebook that best reconstruct every subvector .
The approximation of is thus the matrix obtained by replacing each subvector with :


We refer to the process of expressing the weight matrix in terms of codes and a codebook as quantization.

Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks
Figure 1:
Permutation optimization of a fully-connected layer. Our goal is to find a permutation
of the weights such that the resulting subvectors are easier to compress.

Permutation: The effectiveness of a given set of codes and codebooks depends on their ability to represent the original weight matrix accurately.
Intuitively, this is easier to achieve if the subvectors are similar to one another.
Therefore, it is natural to consider transformations of that make the resulting subvectors easier to compress.

A feedforward network can be thought of as a directed acyclic graph (DAG), where nodes represent layers and edges represent the
flow of information in the network.
We refer to starting node of an edge as a parent layer, and the end node as a child layer.
We note that the network is invariant under permutation of its weights,
as long as the same permutation is applied to the output dimension for parent layers and the input dimension for children layers.
Here, our key insight is that we can search for permutations that make the network easier to quantize.

Formally, consider a network comprised of two layers:


where represents a non-linear activation function.
The network can be described as the function


where is the input to the network.
Furthermore, from a topological point of view, is the parent of .

Given a permutation of elements ,
we denote as the permutation matrix that results from reordering the rows of the identity matrix according to .
Left-multiplying with has the effect of reordering the rows of according to .

Let be the layer that results from applying the permutation matrix
to the input dimension of the weights of :


Analogously, let
be the layer that results from applying the permutation
to the output dimension of the weights of :


Importantly, so long as is an element-wise operator, produces the same output as , only permuted:


then we have


This functional equivalence has previously been used to characterize the optimization landscape of neural networks [1, 42].
In contrast, here we focus on quantizing the permuted weight ,
and denote its subvectors as .
We depict the process of applying a permutation and obtaining new subvectors in Figure 1.

Extension to convolutional layers: The encoding of convolutional layers is closely related to that of fully-connected layers.
Let denote the weights
of a convolutional layer with input channels,
output channels, and a kernel size of .
The idea is to reshape into a 2d matrix of size ,
and then apply the same encoding method that we use with fully-connected layers.
The result is an approximation to .
We then apply the inverse of the reshaping operation on to get our approximation to .

When , we set the codeword size to a multiple of and
limit the permutation matrix to have a block structure such that the spatial dimensions of filters are quantized together.
For pointwise convolutions (/ie, ), we set to 4 or 8, depending on the desired compression rate.

We have so far considered networks where each layer has a single parent and a single child
(/ie, the topology is a chain).
We now consider architectures where some layers may have more than one child or more than one parent.

Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks
Figure 2:
Parent-child dependencies in the Resblocks of a ResNet-50 architecture.
Purple nodes are children and yellow nodes are parents, and must share the same permutation
(in for children and in for parents) for the network to produce the same output.

Extension beyond chain architectures: AlexNet [27] and VGG [45] are examples of popular architectures with a chain topology.
As a consequence, each layer can have a different permutation.
However, architectures with more complicated topologies have more constraints on the permutations that they admit.

For example, consider Figure 2, which depicts six resblocks as used in the popular ResNet-50 architecture.
We start by finding a permutation for layer 4a, and realize that its parent is layer 3c.
We also notice that
layers 3c and 2c must share the same permutation for the residual addition to have matching channels.
By induction, this is also true of layers 1c and 1d, which are now all parents of our initial layer 4a.
These parents have children of their own (layers 2a, 3a and 4d),
so these must be counted as siblings of 4a, and must share the same permutation as 4a.
However, note that all b and c layers are only children, so they can have their own independent permutation.

Operations such as reshaping and concatenation in parent layers may also affect the permutations that a layer can tolerate while preserving functional equivalence.
For example, in the detection head of Mask-RCNN [19], the output (of shape ) of a convolutional layer is reshaped (to ) before before entering a FC layer.
Moreover, the same tensor is used in another convolutional layer (without reshaping) for mask prediction.
Therefore, the FC layer and the child convolutional layer must share the same permutation.
In this case, the FC layer must keep blocks of contiguous dimensions together
to respect the channel ordering of its parent (and to match the permutation of its sibling).
Determining the maximum set of independent permutations that an arbitrary network may admit (and finding efficient algorithms to do so)
is a problem we leave for future work.

3.2 Learning the Encoding

Our overarching goal is to learn an encoding of each layer such that the final output of the network is preserved.
Towards this goal, we search for a set of codes, codebook, and permutation that minimizes the quantization error
of every layer of the network:


Our optimization procedure consists of three steps:

  1. Permute: We search for a permutation of each layer that results in subvectors that are easier to quantize.
    We do this by minimizing the determinant of the covariance of the resulting subvectors.

  2. Quantize: We obtain codes and codebooks for each layer by minimizing the difference between the approximated weight and the permuted weight.

  3. Fine-tune: Finally, we jointly fine-tune all the codebooks with gradient-based optimization by minimizing the loss
    function of the original network over the training dataset.

We have found that minimizing the quantization error of the network weights (Eq. (16))
results in small inaccuracies that accumulate over multiple layers reducing performance;
therefore, it is important to jointly fine-tune the network so that it can recover its original accuracy with gradient descent.
We have also observed that the quality of the intial reconstruction has a direct impact on the final network accuracy.
We now describe the three steps in detail.


In this step, our goal is to estimate a permutation such that the permuted weight matrix
has subvectors that are easily quantizable.
Intuitively, we want to minimize the spread of the vectors,
as more compact vectors can be expressed more accurately given a fixed number of centroids.
We now formalize this intuition and propose a simple algorithm to find good permutations.

A quantization lower bound: We assume that the weight subvectors that form the input to the quantization step
come from a Gaussian distribution, , with zero-mean
and covariance , which is a positive semi-definite matrix.
Thanks to rate distortion theory [11], we know that the expected reconstruction error must follow


in other words, the error is lower-bounded by the determinant of the covariance of the subvectors of .
We assume that we have access to a good minimizer such that, roughly, this bound is equal to the reconstruction
error achieved by our quantization algorithm.
Thus, for a fixed target compression bit-rate, we can focus on finding a permutation
that minimizes .

Searching for permutations: We make use of Expression (17) and focus on obtaining a permutation that minimizes the determinant
of the covariance of the set . We follow an argument similar to that of Ge /etal [10], and
note that the determinant of any positive semi-definite matrix ,
with elements , satisfies Hadamard’s inequality:


that is, the determinant of is upper-bounded by the product of its diagonal elements.

Motivated by this inequality, we greedily obtain an initial that minimizes the product of
the diagonal elements of by creating buckets of row indices, each with capacity to hold elements.
We then compute the variance of each row of , and greedily assign each row index to the non-full bucket
that results in lowest bucket variance.
Finally, we obtain by interlacing rows from the buckets so that rows from the same bucket are placed rows apart.
convolutions can be handled similarly, assuming that has a block structure, and making use of
the more general Fischer’s inequality. Please refer to the supplementary material for more details.

There are possible permutations of , so greedy algorithms are bound to have
limitations on the quality of the solution that they can find.
Thus, we refine our solution via stochastic local search [23].
Specifically, we iteratively improve the candidate permutation by flipping two dimensions chosen at random, and
keeping the new permutation if it results in a set of subvectors whose covariance has lower determinant .
We repeat this procedure for a fixed number of iterations, and return the best permutation obtained.


In this step, we estimate the codes and codebook that approximate the permuted weight .
Given a fixed permutation, this is equivalent to the well-known -means problem.
We use an annealed quantization algorithm called SR-C,
originally due to Zeger /etal [55], and recently adapted by Martinez /etal [39] to multi-codebook quantization.
SR-C empirically achieves lower quantization error than the vanilla -means algorithm,
and is thus a better minimizer of Expresion (17).

A stochastic relaxation of clustering: The quantization lower bound from Expression (17) suggests that the -means algorithm can be annealed by scheduling a perturbation
such that the covariance of the set decreases over time.
Due to Hadamard’s inequality (/ie, Expresion (18)),
this can be achieved by adding decreasing amounts of noise to sampled from a zero-mean Gaussian with diagonal covariance.

In particular, after randomly initializing the codes, we iteratively update the codebook and codes with
a noisy codebook update (which operates on subvectors with additive diagonalized Gaussian noise), and a standard -means code update.
We decay the noise according to the schedule ,
where is the current iteration,
is the total number of update iterations, and is a constant.
We use in all our experiments.
For a detailed description, please refer to Algorithm 1.

1:procedure SR-C()
2:      InitializeCodes()
3:     for  do
/LeftComment        # Add scheduled noise to subvectors
4:         for  do
7:         end for/LeftComment# Noisy codebook update
/LeftComment# Regular codes update
10:     end for
11:     return
12:end procedure
Algorithm 1 SR-C: Stochastic relaxation of -means.


Encoding each layer independently causes errors in the activations to accumulate resulting in degradation of performance.
It is thus important to fine-tune the encoding in order to recover the original accuracy of the network.
In particular, we fix the codes and permutations for the remainder of the procedure.
Let be the original loss function of the network (/eg, cross-entropy for classification).
We note that is differentiable with respect to each of the learned centroids – since these are continuous – so
we use the original training set to fine-tune the centroids with gradient-based learning:


where is an update rule (such as SGD, RMSProp [49] or Adam [26])
with hyperparameters (such as learning rate, momentum, and decay rates).

4 Experiments

We test our method on ResNet [20] architectures for image classification and
Mask R-CNN [19] for object detection and instance segmentation.
We compress standard ResNet-18 and ResNet-50 models that have been pre-trained on ImageNet,
taking the weights directly from the PyTorch model zoo.
We train different networks with .
We also clamp the size of the codebook for each layer to .

Model Regime

Small blocks 4 4
Large blocks 4 4

Small blocks 4 4
Large blocks 8 4
Table 1: Subvector sizes and compression regimes.

Small vs. large block sizes: To further assess the trade-off between compression and accuracy,
we use two compression regimes.
In the large blocks regime, we use a larger subvector size for each layer,
which allows the weight matrix to be encoded with fewer codes, and thus leads to higher compression rates.
To describe the subvector sizes we use for each layer, we let denote the subvector
size for a convolutional layer with filters of size . In the special case when ,
corresponding to a pointwise convolution, we denote the subvector size by . Finally,
fully-connected layers have a subvector size of .
We summarize our subvector sizes for each model and compression regime in Table 1.

Bit allocation: We compress all the fully-connected and convolutional layers of a network.
However, following [47], we do not compress the first convolutional layer (since it occupies less than 0.05% of the network size),
the bias of the fully-connected layers, or the batchnorm layers.
While we train with 32-bit floats, we store our final model using 16-bit floats,
which has a negligible impact on validation accuracy (less than 0.02%).
Finally, we fuse batchnorm layers into two vectors,
which can be done with algebraic manipulation and is
a trick normally used to speed up inference.
Please refer to the supplementary material for a detailed breakdown of the bit allocation in our models.

Hyperparameters: We use a batch size of 128 for ResNet-18 and a batch size of 64 for ResNet-50.
For annealed -means, we implement SR-C in the GPU, and run it for 1 000 iterations.
We fine-tune the codebooks for 9 epochs using Adam [26]
with an initial learning rate of , which is gradually reduced to using cosine annealing [36].
Fine-tuning is the most expensive part of this process, and takes around 8 hours both for ResNet-18 (with 1 GPU)
and for ResNet-50 (with 4 GPUs). In the latter case, we scale the learning rate by a factor of 4, following Goyal /etal [13].
For permutation optimization, we perform 1 000 local search iterations;
this is done in the CPU in parallel for each independent permutation.
This process takes less than 5 minutes for ResNet-18, and about 10 minutes for ResNet-50 on a 12-core CPU.

Baselines: We compare the results of our method against a variety of network compression methods:
Binary Weight Network (BWN) [43], Trained Ternary
Quantization (TTQ) [56], ABC-Net [34], LR-Net [44],
Deep Compression (DC) [16], Hardware-Aware Automated Quantization (HAQ) [51],
CLIP-Q [50],
Hessian AWare Quantization of Neural Networks with Mixed Precision (HAWQ) [8], and HAWQ-V2 [7].
We compare extensively against the recently-proposed Bit Goes Down (BGD) method of [47]
because it is the current state of the art by a large margin.
BGD uses as initialization the method due to Wu /etal [53], and thus subsumes it.
All results presented are taken either from the original papers,
or from two additional surveys [14, 2].

Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks
Figure 3:
Compression results on ResNet-18 and ResNet-50.
We compare accuracy vs. model size, using models from the PyTorch zoo as a starting point.
In general, our method achieves higher accuracy compared to previous work.

4.1 Image Classification

A summary of our results can be found in Figure 3.
From the Figure, it is clear that our method outperforms all its competitors.
On ResNet-18 for example, we can surpass the performance of ABC-Net (M=5)
with our small blocks models at roughly the compression rate.
Our biggest improvement
generally comes from higher compression rates, and is especially apparent for the larger ResNet-50.
When using large blocks and centroids, we obtain a top-1 accuracy of 72.18% using
only 3 MB of memory.
This represents an absolute 4% improvement over the state of the art.
On ResNet-50, our method consistently reduces the remaining error by % /wrt the state of the art.

Ratio Size Acc.
Semi-sup R50 [54] 97.50 MB 79.30
BGD [47] 19 5.09 MB 76.12
Semi-sup R50 [54] 97.50 MB 78.72
Our PQF 19 5.09 MB 77.15
Table 2:
ImageNet classification starting from a semi-supervised ResNet-50.
We set a new state of the art in terms of accuracy vs model size.

Semi-supervised ResNet-50: We also benchmark our method using a stronger backbone as a starting point.
We start from the recently released ResNet-50 model due to Yalniz /etal [54],
which has been pre-trained on unlabelled images from the YFCC100M dataset [48],
and fine-tuned on ImageNet.
While the accuracy of this model is reported to be 79.30%, we obtain a slightly lower 78.72%
after downloading the publicly-available model3.
We use the small blocks compression regime with ,
mirroring the procedure described previously.

We show our results in Table 2.
Our model attains a top-1 accuracy of 77.15% in the equal-memory setting.
This means that we are able to outperform previous work by over 1% absolute accuracy,
which defines a new state-of-the-art in small models for image classification.
We find this result particularly interesting, as we originally expected
distillation to be necessary to transfer the knowledge of the larger
network pretrained on a large corpus of unlabeled images.
However, our results show that at least part of this knowledge is retained through
the initialization and structure that low-error clustering imposes on the compressed network.

Ablation study:
Perm. SR-C Adam Acc.
62.29 1.02
62.55 0.76
62.92 0.39
63.31 0.00
Table 3:
Ablation study.
ResNet18 on ImageNet w/large blocks.

In Table 3, we show results for ResNet-18 using large blocks,
for which we obtain a final accuracy of 63.31%.
We add permutation optimization (Sec. 3.2.1),
annealed -means, as opposed to plain -means (called SR-C in Sec. 3.2.2),
and the use of the Adam optimizer with cosine annealing instead of plain SGD, as in previous work.
From the Table, we can see that all our components are important and complementary to achieve top accuracy.
It is also interesting to note that a baseline that simply does -means and SGD fine-tuning
is already 1% better than the current state-of-the-art.
Since both annealed -means and permutation optimization directly reduce quantization error before fine-tuning,
these experiments demonstrate that minimizing the quantization error of the weights leads to higher final network accuracy.

4.2 Object Detection and Segmentation

Size Ratio APbb AP AP APmk AP AP
RetinaNet [32] (uncompressed) 145.00 MB 35.6
Direct 18.13 MB 8.0 31.5
FQN [31] 18.13 MB 8.0 32.5 51.5 34.7
HAWQ-V2 [7] 17.9 MB 8.1 34.8
Mask-RCNN R-50 FPN [19] (uncompressed) 169.40 MB 37.9 59.2 41.1 34.6 56.0 36.8
BGD [47] 6.51 MB 26.0 33.9 55.5 36.2 30.8 52.0 32.2
Our PQF 6.65 MB 25.5 36.3 57.9 39.4 33.5 54.7 35.6
Table 4:
Object detection results on MS COCO 2017.
We compress a Mask R-CNN network with a ResNet-50 backbone, and include different object detection architectures used by other baselines.
We report both bounding box (bb) and mask (mk) metrics for Mask R-CNN.
We also report the accuracy at different IoU when available.
Ratio APbb APmk
Mask-RCNN R-50 FPN [19] 37.9 34.6
BGD [47] 26.0 33.9 30.8
Our PQF (no perm., no SR-C)

35.6 33.0
Our PQF (no perm.) 35.8 33.1
Our PQF (full) 36.3 33.5
Table 5:
Ablation results results on MS COCO 2017. Permutation optimization is particularly important for Mask-RCNN

We also benchmark our method on the task of object detection
by compressing the popular ResNet-50 Mask-RCNN FPN architecture [19] using the
Microsoft COCO 2017 Dataset [33].
As before, we start from the pretrained model available on the PyTorch model zoo, and apply the same procedure
described above for all the convolutional and linear layers.
In this case, one of the headers contains a deconvolutional layer, which we
treat as a convolutional layer for the purpose of compression.
We use the small blocks regime with centroids,
which results in a model of 6.65 MB.

We compress and fine-tune the network on a single Nvidia GTX 1080Ti GPU
with a batch size of 2 for just 4 epochs. As before, we use Adam [26] and cosine annealing [36],
but with an initial learning rate of .
Our results are presented in Table 4.
We also compare against recent baselines such as the Fully Quantized Network (FQN) [31],
and the second version of Hessian Aware Quantization (HAWQ-V2) [7],
which showcase results compressing RetinaNet [32].

Our method obtains a box AP of 36.3, and a mask AP of 33.5,
which represent improvements of 2.4% and 2.7% over the best previously reported result,
closing the gap to the uncompressed model by 60-70%.
Compared to BGD [47], we also use fewer computational resources,
as they used 8 V100 GPUs and distributed training for compression,
while we use a single 1080Ti GPU.
Based on our inspection, our models have the same code and codebook dimensions for each layer.
However, we were unable to reproduce the 6.51 MB model size they report,
and instead obtain a slightly larger 6.65 MB.
On Table 5, we show again that using both SR-C and permutation optimization is crucial to obtain the best results.
These results demonstrate the ability of our method to generalize to more complex tasks beyond image classification.

5 Conclusion and Future Work

We have demonstrated that the quantization error of the weights of a neural network is inversely correlated with
its accuracy after codebook fine tuning.
We have further proposed a method that exploits the functional equivalence of the network under permutation
of its weights to find configurations of the weights that are easier to quantize.
We have also shown that using an annealed -means algorithm further reduces quantization error (and improves final network accuracy).
Put together, our contributions result in more accurate models compared to previous work.
When compressing the ResNet-50 architecture, our method closes the relative gap to the uncompressed model by 40-60%
compared to the previous state-of-the-art in classification,
and by 60-70% on detection and instance segmentation.

Our optimization method consists of three stages that focus on different variables of the encoding.
Future work may focus on techniques that jointly fine-tune the codes and the codebooks,
or optimization methods that learn the weight permutation jointly with the codes and codebook.
The determinant of the covariance of the weights is a continuous metric that could be minimized as the
network is trained from scratch, resulting in networks that are easier to compress by design.




  1. An Mei Chen, Haw-minn Lu, and Robert Hecht-Nielsen.

    On the geometry of feedforward neural network error surfaces.

    Neural computation, 5(6):910–927, 1993.

  2. Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang.

    A survey on model compression and acceleration for deep neural

    CoRR, 2017.

  3. Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv, and Yoshua

    Binarized neural networks: Training deep neural networks with weights
    and activations constrained to + 1 or – 1.

    arXiv preprint arXiv:1602.02830, 2016.

  4. Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de

    Predicting parameters in deep learning.

    In Advances in neural information processing systems, 2013.

  5. Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus.

    Exploiting linear structure within convolutional networks for
    efficient evaluation.

    In Advances in neural information processing systems, 2014.

  6. Xin Dong, Shangyu Chen, and Sinno Pan.

    Learning to prune deep neural networks via layer-wise optimal brain

    In Advances in Neural Information Processing Systems, 2017.

  7. Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfreen, Amir Gholami, Michael W.
    Mahoney, and Kurt Keutzer.

    HAWQ-V2: Hessian aware trace-weighted quantization of neural

    arXiv preprint arXiv:1911.03852, 2019.

  8. Zhen Dong, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer.

    Hessian-aware quantization of neural networks with mixed precision.

    In ICCV, 2019.

  9. Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Rémi Gribonval,
    Hervé Jégou, and Armand Joulin.

    Training with quantization noise for extreme model compression.

    arXiv Prepr. arXiv2004, 2020.

  10. Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun.

    Optimized product quantization for approximate nearest neighbor

    In CVPR, 2013.

  11. Allen Gersho and Robert M Gray.

    Vector quantization and signal compression, chapter 8, pages

    Springer Science & Business Media, 1991.

  12. Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev.

    Compressing deep convolutional networks using vector quantization.

    arXiv preprint arXiv:1412.6115, 2014.

  13. Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz
    Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He.

    Accurate, large minibatch sgd: Training imagenet in 1 hour.

    arXiv preprint arXiv:1706.02677, 2017.

  14. Yunhui Guo.

    A survey on methods and theories of quantized neural networks.

    arXiv preprint arXiv:1808.04752, 2018.

  15. Yiwen Guo, Anbang Yao, and Yurong Chen.

    Dynamic network surgery for efficient DNNs.

    In Advances In Neural Information Processing Systems, 2016.

  16. Song Han, Huizi Mao, and William J Dally.

    Deep compression: Compressing deep neural networks with pruning,
    trained quantization and Huffman coding.

    arXiv preprint arXiv:1510.00149, 2015.

  17. Song Han, Jeff Pool, John Tran, and William Dally.

    Learning both weights and connections for efficient neural network.

    In Advances in neural information processing systems, 2015.

  18. Babak Hassibi and David G Stork.

    Second order derivatives for network pruning: Optimal brain surgeon.

    In Advances in neural information processing systems, 1993.

  19. Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick.

    Mask R-CNN.

    In ICCV, 2017.

  20. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.

    Deep residual learning for image recognition.

    In CVPR, 2016.

  21. Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han.

    AMC: AutoML for model compression and acceleration on mobile

    In ECCV, 2018.

  22. Yihui He, Xiangyu Zhang, and Jian Sun.

    Channel pruning for accelerating very deep neural networks.

    In ICCV, 2017.

  23. Holger H Hoos and Thomas Stützle.

    Stochastic local search: Foundations and applications.

    Elsevier, 2004.

  24. Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman.

    Speeding up convolutional neural networks with low rank expansions.

    arXiv preprint arXiv:1405.3866, 2014.

  25. Hervé Jégou, Matthijs Douze, and Cordelia Schmid.

    Product quantization for nearest neighbor search.

    IEEE transactions on pattern analysis and machine intelligence,
    33(1), 2010.

  26. Diederik P Kingma and Jimmy Ba.

    Adam: A method for stochastic optimization.

    arXiv preprint arXiv:1412.6980, 2014.

  27. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.

    Imagenet classification with deep convolutional neural networks.

    In Advances in neural information processing systems, 2012.

  28. Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor

    Speeding-up convolutional neural networks using fine-tuned

    arXiv preprint arXiv:1412.6553, 2014.

  29. Yann LeCun, John S Denker, and Sara A Solla.

    Optimal brain damage.

    In Advances in neural information processing systems, 1990.

  30. Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf.

    Pruning filters for efficient convnets.

    arXiv preprint arXiv:1608.08710, 2016.

  31. Rundong Li, Yan Wang, Feng Liang, Hongwei Qin, Junjie Yan, and Rui Fan.

    Fully quantized network for object detection.

    In CVPR, 2019.

  32. Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Dollár.

    Focal loss for dense object detection.

    In ICCV, pages 2980–2988, 2017.

  33. Tsung-Yi Lin, Michael Maire, Serge Belognie, James Hays, Pietro Perona, Deva
    Ramanan, Piotr Dollár, and C Lawrence Zitnich.

    Microsoft COCO: Common objects in context.

    In ECCV, 2014.

  34. Xiaofan Lin, Cong Zhao, and Wei Pan.

    Towards accurate binary convolutional neural network.

    In Advances in Neural Information Processing Systems, 2017.

  35. Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui

    Learning efficient convolutional networks through network slimming.

    In ICCV, 2017.

  36. Ilya Loshchilov and Frank Hutter.

    Sgdr: Stochastic gradient descent with warm restarts.

    arXiv preprint arXiv:1608.03983, 2016.

  37. Jian-Hao Luo, Jianxin Wu, and Weiyao Lin.

    Thinet: A filter level pruning method for deep neural network

    In ICCV, 2017.

  38. Brais Martinez, Jing Yang, Adrian Bulat, and Georgios Tzimiropoulos.

    Training binary neural networks with real-to-binary convolutions.

    In ICLR, 2020.

  39. Julieta Martinez, Shobhit Zakhmi, Holger H Hoos, and James J Little.

    LSQ++: Lower running time and higher recall in multi-codebook

    In Proceedings of the European Conference on Computer Vision
    , pages 491–506, 2018.

  40. James O’ Neill.

    An overview of neural network compression.

    arXiv preprint arXiv:2006.03669, 2020.

  41. Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov.

    Tensorizing neural networks.

    In Advances in Neural Information Processing Systems, 2015.

  42. A Emin Orhan and Xaq Pitkow.

    Skip connections eliminate singularities.

    In ICLR, 2018.

  43. Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi.

    XNOR-net: Imagenet classification using binary convolutional neural

    In ECCV, pages 525–542. Springer, 2016.

  44. Oran Shayer, Dan Levi, and Ethan Fetaya.

    Learning discrete weights using the local reparameterization trick.

    arXiv preprint arXiv:1710.07739, 2017.

  45. Karen Simonyan and Andrew Zisserman.

    Very deep convolutional networks for large-scale image recognition.

    arXiv preprint arXiv:1409.1556, 2014.

  46. Sanghyun Son, Seungjun Nah, and Kyoung Mu Lee.

    Clustering convolutional kernels to compress deep neural networks.

    In Proceedings of the European Conference on Computer Vision
    , pages 216–232, 2018.

  47. Pierre Stock, Armand Joulin, Rémi Gribonval, Benjamin Graham, and Hervé

    And the bit goes down: Revisiting the quantization of neural

    In ICLR, 2020.

  48. Bart Thomee, David A Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni,
    Douglas Poland, Damian Borth, and Li-Jia Li.

    YFCC100M: The new data in multimedia research.

    Communications of the ACM, 59(2):64–73.

  49. Tijmen Tieleman and Geoffrey Hinton.

    Lecture 6.5-rmsprop: Divide the gradient by a running average of its
    recent magnitude.

    COURSERA: Neural networks for machine learning, 4(2):26–31,

  50. Frederick Tung and Greg Mori.

    Deep neural network compression by in-parallel pruning-quantization.

    IEEE Transactions on Patten analysis and Machine Intelligence,

  51. Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han.

    HAQ: Hardware-aware automated quantization with mixed precision.

    In CVPR, 2019.

  52. Ziwei Wang, Jiwen Lu, Chenxin Tao, Jie Zhou, and Qi Tian.

    Learning channel-wise interactions for binary convolutional neural

    In Proceedings of the IEEE Conference on Computer Vision and
    Pattern Recognition
    , pages 568–577, 2019.

  53. Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng.

    Quantized convolutional neural networks for mobile devices.

    In CVPR, 2016.

  54. I Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, and Dhruv

    Billion-scale semi-supervised learning for image classification.

    arXiv preprint arXiv:1905.00546, 2019.

  55. Kenneth Zeger, Jacques Vaisey, Allen Gersho, et al.

    Globally optimal vector quantizer design by stochastic relaxation.

    IEEE Transactions on Signal Processing, 40(2):310–322, 1992.

  56. Chenzhuo Zhu, Song Han, Huizi Mao, and William J Dally.

    Trained ternary quantization.

    arXiv preprint arXiv:1612.01064, 2016.

CSIT FUN , 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权
转载请注明原文链接:Permute, Quantize, and Fine-tune: Efficient Compression of Neural Networks
喜欢 (0)
分享 (0)
表情 贴图 加粗 删除线 居中 斜体 签到


  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址