Fast Sparse ConvNets
Abstract
Historically, the pursuit of efficient inference has been one of the driving forces behind research into new deep learning architectures and building blocks. Some recent examples include: the squeezeandexcitation module [17], depthwise separable convolutions in Xception [4], and the inverted bottleneck in MobileNet v2 [36]. Notably, in all of these cases, the resulting building blocks enabled not only higher efficiency, but also higher accuracy, and found wide adoption in the field. In this work, we further expand the arsenal of efficient building blocks for neural network architectures; but instead of combining standard primitives (such as convolution), we advocate for the replacement of these dense primitives with their sparse counterparts. While the idea of using sparsity to decrease the parameter count is not new [43], the conventional wisdom is that this reduction in theoretical FLOPs does not translate into realworld efficiency gains. We aim to correct this misconception by introducing a family of efficient sparse kernels for ARM and WebAssembly, which we opensource for the benefit of the community as part of the XNNPACK [8] library. Equipped with our efficient implementation of sparse primitives, we show that sparse versions of MobileNet v1, MobileNet v2 and EfficientNet architectures substantially outperform strong dense baselines on the efficiencyaccuracy curve. On Snapdragon 835 our sparse networks outperform their dense equivalents by – equivalent to approximately one entire generation of MobileNetfamily improvement. We hope that our findings will facilitate wider adoption of sparsity as a tool for creating efficient and accurate deep learning architectures.
1 Introduction
Convolutional neural networks (CNNs) have proven to be excellent at solving
a diverse range of tasks [2]. Standard network architectures are used in classification,
segmentation, object detection and generation tasks [32, 28, 48]. Given their wide utility, there has been significant effort to design
efficient architectures that are capable of being run on mobile and other
low power devices while still achieving high classification accuracy on
benchmarks such as ImageNet [35].
For example, MobileNets [16, 36] employ the depthwise separable convolutions introduced by Sifre in [37] to significantly reduce resource requirements over
previous architectures. Inference time, FLOPs and parameter counts in these architectures
are dominated by the 1 convolutions, which directly map to matrixmatrix multiplications.
Weight sparsity is generally known to lead [3] to theoretically smaller and more computationally efficient (in terms of number of floatingpoint operations)
models, but it is often disregarded
as a practical means of accelerating models because of the misconception that sparse operations cannot be fast enough to achieve actual speedups during inference. To address this misconception and firmly establish sparsity as a tool in the deep learning practitioner’s arsenal, we introduce fast kernels for
sparse matrixdense matrix multiplication (SpMM) specifically targeted at the acceleration of sparse neural networks. The main distinction of our SpMM kernel from prior art [31, 46] is that we focus on a different point in the design space. While prior work focused on extremely sparse problems (typically >99%, found in scientific and graph problems), we target the sparsity range of 7095%, more common when inducing weight sparsity in neural networks.
As a result our kernels significantly outperform the kernels generated by the TACO compiler [24] and the Intel MKL [19].
Using these kernels, we demonstrate the effectiveness of weight sparsity across three generations of
MobileNet [16, 36, 41] architectures. Sparsity leads to an improvement of approximately one whole generation in each architecture, with sparse EfficientNets being significantly more efficient than all previous models. These models represent a new generation of efficient CNNs, which reduces inference times by , parameter counts by over and number of floatingpoint operations (FLOPs) by up to relative to the previous generations.
2 Related Work
Improvements in convolutional network architectures [25, 38, 13, 18], as measured by increased
classification accuracy on benchmark tasks such as ImageNet [35],
have generally been concomitant with increases in model parameter counts, FLOPs
and memory requirements. Recently this evolution has led to networks found through neural architecture
search [51, 34] which can achieve over 82% top1 accuracy, but require
nearly 25 GFLOPs for one inference.
Given these prohibitive inference costs, there have been many lines of work attempting
to improve CNN efficiency, which is often defined as one of three metrics:
These axes are neither parallel nor orthogonal. The effect of (3) and (2) on (1) in
particular can be quite complicated and highly varied depending on the hardware in question.
The MobileNet family of architectures [16, 36]
has focused on improving efficiency by taking advantage of the depthwise separable convolutions introduced
in [37], which can be thought of as a handcrafted sparsification
of full convolutions with a predefined sparse topology, and which are responsible for the parameter efficiency of these
architectures. MobileNet v1 (MBv1) used layers of convolutions followed by depthwise convolutions. MobileNet v2 (MBv2) introduced the inverted residual block which consists of a convolution expanding the channel count, a depthwise convolution on the expanded channel count, and then a convolution reducing the parameter count. Across MobileNet architectures, the depthwise convolutions account for only a small fraction of the total FLOPs, parameters, and inference time of these models. In MBv1, they account for less than 2% of the total FLOPs and in MBv2 less than 3%.
A different line of work attempted to make more efficient CNNs by directly
pruning the weights of full convolutional filters accompanied by the necessary
inference kernels [33, 26]. [33] was not able to accelerate convolutions, [26] did not attempt it. The latter also required generating a new set of kernels for each instance of a model, which is often impractical for deployment. Due to the difficultly of accelerating sparse computation, channel pruning approaches have been preferred [11, 5, 30, 29, 42, 14]. These approaches prune away entire filters leaving the final model dense, and function more as an architecture search over channel counts.
Full Neural Architecture Search has also been applied directly to architectures resembling MBv2 resulting in MobileNet v3 [40],
FBNet [45], and EfficientNet [41].
Alternatively, factorizations of the 11 convolutions have been considered
in ShuffleNet [47] and Learnable Butterfly Factorizations [6].
ShuffleNet factorizes the weight matrix into a product of a permutation matrix
and block diagonal matrix. Butterfly Factorizations factorize the weight matrix into
a sequence of permutation matrices and weight matrices with special structure that
can represent many common transforms such as Fast Fourier Transforms.
Work in TexttoSpeech (TTS) [23] demonstrated that increasing sparsity and concomitant
increase in state size in RNN models lead to increased model quality for a given nonzero parameter count. They additionally demonstrated fast blocksparse matrixvector (SpMV) multiplication routines necessary for RNN inference.
3 Methods
To understand how to design the most efficient convolutional models, we investigate both how to construct and train sparse MBv1, MBv2 and EfficientNet models and also the performance of our SpMM kernels.
3.1 Sparsifying Networks
We train on the ImageNet [35] dataset with standard augmentation and report top1 accuracies on the provided 50k example validation set. To make the networks sparse we use the gradual magnitude pruning technique of [49].
We do not prune the first full convolution at the beginning of all three networks. Its overall contribution to the parameter count, FLOP count, and runtime is small and does not warrant introducing a new sparse operation. Instead, we implement a dense convolutional kernel which takes as input the image in the standard HWC layout and outputs the CHW layout consumed by the sparse operations in the rest of the network. In HWC layout, the values for different channels corresponding to one spatial location are adjacent in memory. In CHW layout, the values of all the spatial locations for one channel are adjacent in memory.
We also do not prune the squeezeexcitation [17] blocks in EfficientNet as they contribute <1% of the total FLOPs to the dense model. The last fullyconnected layer in all models also contributes insignificantly (<1%) to the total FLOP count, but does contribute a significant fraction (2050%) of total parameters, especially after the rest of the model is
pruned. As we are concerned with maximizing top1 accuracy for a given runtime, we do not prune the final layer in MobileNet v1 and v2 as doing so leads to a small decrease in top1 accuracy. Standard EfficientNets do not scale the number of filters in the last convolution by the width of the model, however we find that when introducing sparsity it is beneficial to do this; in all sparse EfficientNet models we double the units from 1280 to 2560. We also find that it is possible to make the fullyconnected layer sparse without loss of accuracy in EfficientNet, so we do so.
3.2 Kernel Implementation
A diagram of the convolution as a SpMM is seen in figure 2. Our scheme requires activation tensors be stored in CHW format, in contrast to dense mobile inference libraries [21, 7, 22] which favor HWC.
There are three key insights enabling the high performance of our kernels:

While the weight matrix is sparse, the activation matrix is dense. This means that we can perform vector loads from the activation matrix and process multiple spatial locations simultaneously.

By processing the matrix in the right order we can keep values that will be randomly accessed in the L1 cache, from which random access is fast and constant time.

When the number of input channels is small enough, prefetching from the activations can further reduce cache misses.
Figure 3 shows the memory read and write patterns of a few steps of the kernel. The figure shows 8 elements being processed together for visualization but 16 is more natural for the actual implementation as it corresponds to one cache line. The outer loop is over columns and the inner loop is over rows; this allows each strip of 16 spatial locations in the activations to remain in the L1 cache until it is no longer needed. In figure 3 steps 1 and 2 prime the cache, while subsequent steps 3 and 4 load all right hand side values from the L1 cache.
Model  Width  Top1  Mega  Mega  Time  Time  Time  
FLOPs  Params  SD835  SD670  Wasm  
MBv1  Dense  1.0  70.9  1120  4.30  125  106  271 
Sparse  1.4  72.0  268  2.28  58  64  97  
MBv1  Dense  .75  68.4  636  2.59  73  64  170 
Sparse  1.0  68.4  146  1.48  31  34  56  
MBv1  Dense  .5  63.3  290  1.34  36  33  96 
Sparse  .75  64.4  90  1.30  21  21  36  
MBv2  Dense  1.4  75.0  1110  6.06  150  129  319 
Sparse  2.0  74.5  406  4.24  93  91  155  
Sparse*  1.8  74.9  411  4.13  102  108  155  
MBv2  Dense  1.0  71.8  580  3.47  83  74  197 
Sparse  1.4  72.0  220  2.68  54  53  95  
MBv2  Dense  .75  69.8  375  2.61  64  57  154 
Sparse  1.15  70.2  165  2.11  40  39  74  
CA Sparse  1.0  69.7  119  1.73  33  35  –  
MBv2  Dense  .5  65.4  182  2.05  33  30  92 
Sparse  .80  65.2  90  1.66  26  24  41  
EN  Dense  ENb0  76.8  730  5.28  158  148  – 
Sparse  ENb1  76.7  220  3.07  110  118  – 
*This model is 80% sparse in all layers (except final fullyconnected) and uses a block size of 1 everywhere.
This is the cache aware MBv2 architecture described in section 4.4. It uses a block size of 1 throughout.
We also tried WebDNN [15], but it does not support some operations necessary to run MobileNet and EfficientNet models.
In addition to the vectorization in the dimension, taking advantage of small amounts of structure in the weight matrix can offer significant performance boosts by increasing data reuse after values are loaded into registers. Constraining the sparsity pattern so that multiple output or input channels all share the same zero/nonzero pattern creates ‘blocks’ in the weight matrix (see figure 3 right). Blocks in the output channel dimension
allow for more data reuse than blocks in the input channel dimension. Experiments (see figure 5) show that either choice has the same effect on accuracy, so we implement output channel blocking with sizes of 2 and 4. Our nomenclature for kernels is to give their spatial vectorization width followed by the output channel block size – 6x2 ~means
6 pixels and 2 output channels are processed in the inner loop.
3.3 Library
We will provide a library that can run sparse models trained with the model pruning library in TensorFlow [1]. This includes conversion from a dense representation to a Block Compressed Sparse Row (BCSR)like representation suitable for inference. In addition to the high performance convolutions, we also provide all supporting CHW kernels – depthwise convolutions, global average pooling and a stride2 dense convolution that takes the standard HWC format as input and outputs CHW – necessary for running all three generations of models. While we provide high performance versions of these kernels, we do not detail them here; they are also available as part of XNNPACK [8]. Times for these kernels are included in endtoend measurements.
3.4 WebAssembly
In some settings it is useful to run inference (and even training) of deep neural networks directly in web browsers. This is supported now by multiple frameworks including WebDNN [15], Tensorflow.js [39] and Webmlpolyfill [20]. Frameworks generally support using WebAssembly (Wasm) [12] to run on CPUs or WebGL to run on GPUs. In this work we target web assembly backends and show that sparse vision networks significantly outperform their dense counterparts in this setting. In this setting our kernel decomposition strategy is similar, however due to current lack of vectorization support in Wasm, they are converted to scalar instructions. Due to the smaller register file, we find that unrolling by 8 instead of 16 is optimal. Scalar versions of all kernels are also provided as part of XNNPACK [8].
4 Results
In the main text we mainly include plots for MBv1 and MBv2 due to space limitations. EfficientNets generally follow the same trends as MBv2 models, plots for EfficientNet can be found in the supplementary material. Table 1 contains the main results comparing inference times for models achieving approximately the same top1.
To breakdown the endtoend results, first we reveal performance results for our SpMM kernels, then we show how the networks respond to sparsity and then finally how we combined this information to find the models with the lowest inference time.
4.1 ARM Kernel Performance
We use Ruy [22], the current TensorFlow Lite ARM64 backend written largely in handcoded assembly, as the dense baseline. For a sparse baseline we use the kernel generated by the TACO compiler [24]. We present results by plotting the FLOPs achieved at each layer in the model, with increasing depth to the right in figure 4. For MBv1 we use a width multiplier of 1.4 and 90% sparse and for MBV2 we use a width multiplier of 1.4 and 85% sparse as these configurations approximately match the top1 accuracy of the width 1 dense models. The kernel variants that process 16 spatial locations at a time (e.g. 6x
, etc.) are the highest performing and all reported numbers are from these kernel variants. TACO results should be compared with the 6x
kernels.
The raw performance of the sparse kernels falls in the range of 40–90% of the dense kernels. And as they must do much less work, when taking the sparsity of the layer into account, the effective FLOPs are in the 2–7 range. In MBv1 performance falls significantly in the last two layers of the model when the number of channels (1024) causes the size of one “strip” of spatial locations to exceed the size of the L1 cache. In MBv2 the sawtooth pattern is caused by the alternating expand and contract operations. The performance is higher for the expand kernels due to greater data reuse of each “strip” that is brought into the L1 cache. The performance of the contract operations drop significantly once the number of expanded channels in each residual block exceeds the size of the L1 cache; a smaller drop is observed when it exceeds the size of half of the L1 cache as this prevents effective prefetching.
4.2 Model Performance
The hyperparameters used to train MBv1 and MBv2 are listed in table 2, they were found with a grid search on dense models with a width multiplier of 1.0 to reproduce the original results, which used RMSProp [44], with SGD with momentum. The same hyperparameters are used to train sparse models. This change allows us to match or exceed the reported accuracies with only 38,000 iterations of training.
MBv1  MBv2  

learning rate  
momentum  0.9  0.92 
l2 coefficient  5e5  4e5 
The hyperparameters used to train EfficientNet are largely unmodified from their code release, with the exception of extending training from 350 to 650 epochs and increasing the learning rate decay exponent to .985 from .97 so that the learning rate decays more slowly. These changes do not improve the dense baseline.
We induce sparsity in MBv1 and MBv2 by extending training by a factor (along with learning rate anchor points) of four, we find this increases the performance of the sparse models, but not the baselines. We start the sparsification process at iteration and stop at with a pruning frequency of 2,000. For EfficientNet we start at iteration 23,000 and end at iteration 105,000, also with a pruning frequency of 2,000. See [49] for the meaning of these hyperparameters.
We train on the ImageNet [35] dataset with standard data augmentation. Top1 accuracies are reported on the validation set with center singlecrops.
To understand the effect of block size, we plot in figure 5 accuracy against flops for different block sizes. In these plots, every sparse tensor in the network uses the same output channel block size. The tradeoff for block sparsity only appears to involve how many elements are in each block, and not their configuration. For example, in MBv1, the , and curves all lie on top of one another. The loss in accuracy due to blocking seems to decrease slightly for larger width models.
To understand how the sparsity level affects the efficiency of the models, we train models at 70%, 80% and 90% unstructured sparsity which is constant throughout the model. The results are plotted in figure 6. MBv1 and MBv2 are more efficient the more sparse they become, confirming that the results of [23] hold not just for RNNs, but also for convolutional models as well.
In figure 1 we plot Top1 accuracy vs. FLOPs for all three generations of sparse and dense models. MobileNet v1 is 90% sparse, the other models are 80% sparse. A sparse MBv1 exceeds MBv2 in terms of FLOP and parameter efficiency; a sparse MBv2 matches EfficientNet in terms of FLOP and parameter efficiency; and a sparse EfficientNet exceeds all other models in both categories.
4.3 Model Design for Block Size
To design the models with the best top1 accuracy vs. inference time frontiers we make the following assumptions to reduce the search space:

We leave the models themselves unchanged.

We induce the same level of sparsity in all convolutions.
Then we do a search at width multiplier 1.4 over models when there are residual blocks in a model. An xaxis location of corresponds to a model in which the first residual blocks are unstructured and the last residual blocks have an output channel block size of 4. We train each model, note its top1 accuracy and then measure its inference time. From this we can calculate the ratio of inference time reduction relative to a fully unstructured model and top1 lost, which are plotted in figure 7. We choose the model with the highest ratio and train models at all widths with this choice. This amounts to making layers 6 and deeper blocked in MBv1 models and layers 11 and deeper blocked in MBv2.
4.4 Cache Aware Model Design
The sawtooth pattern in figure 4 is due to the large number of channels during the contract phase causing the size of stripe to exceed the size of the L1 cache. One possible solution, common with dense kernels, would be to split the matrix into N pieces such that each piece only accesses a number of channels that fit within the cache. However, this introduces additional complexity in the software – in the sparse case it would require repacking the matrix, which isn’t required in the dense case. As another possible solution we examine a simple modification to the MBv2 architecture to make it cache aware.
The inverted residual block introduced by MBv2 expands the number of channels before the depthwise convolution and then reduces them afterwards. The expansion factor is fixed at 6 for all layers. Once there are more than 512 channels after expansion, the size of the 32Kb L1 cache is exceeded during contraction. Additionally, once there are more than 256 channels after expansion, the data exceeds half of the cache precluding effective use of prefetching during contraction. To design an architecture that is aware of these constraints we take MBv2 and increase the expansion factor of early layers while reducing the expansion factor as the number of channels increases so that the number of expanded channels still fits within approximately half of the cache size. To compensate for the decrease in capacity this would otherwise cause, we also increase the depth of the model by adding one more layer with 32 channels, two more with 64 channels and three more each with 96 and 160 channels. The final architecture is in table 3.
Input  Operation  e  c  n  s 

conv2d  –  16  1  2  
Bottleneck  1  16  1  1  
Bottleneck  8  24  2  2  
Bottleneck  8  32  4  2  
Bottleneck  4  64  6  2  
Bottleneck  3  96  6  1  
Bottleneck  2  160  6  2  
Bottleneck  2  320  1  1  
conv2d 1×1  –  1280  1  1  
gavgpool  –  –  1  –  
conv2d 1×1  –  k  1  – 
4.5 Wallclock Times
Table 1 contains the timings for running our sparse models on a single big core of two different processors, a Snapdragon 835 and a Snapdragon 670. We compare them with MBv1 and MBv2 models from their official repositories [9, 10] run on the denseinference TF Lite framework with the standard Ruy backend.^{1}^{1}1The dense EfficientNet b0 model we trained ourselves using the official code with a slight modification to make exporting a TFlite model easier. Model files for all sparse models in table 1 are available here.
5 Conclusion
We demonstrate that for a constant computational budget, sparse convolutional networks are more accurate than dense ones; this corroborates the findings of [23], which demonstrated that for a set number of floatingpoint operations, sparse RNNs are more accurate than dense RNNs.
We enable the use of weight sparsity to accelerate stateoftheart convolutional networks by providing fast SpMM kernels along with all necessary supporting kernels for ARM processors. On a Snapdragon 835 the sparse networks we present in this paper outperform their dense equivalents by in terms of wall clock time for a given top1 accuracy while needing only as many parameters – equivalent to approximately one entire generation of improvement. By overturning the misconception that “sparsity is slow”, we hope to open new avenues of research that would previously not be considered.
References

[1]
(2015)
TensorFlow: largescale machine learning on heterogeneous systems.
Note: Software available from tensorflow.org
External Links: Link
Cited by: §3.3.

[2]
(2016)
Applications of convolutional neural networks.
International Journal of Computer Science and Information Technologies 7 (5), pp. 2206–2215.
Cited by: §1.

[3]
(2017)
A survey of model compression and acceleration for deep neural networks.
arXiv preprint arXiv:1710.09282.
Cited by: §1.

[4]
(201707)
Xception: deep learning with depthwise separable convolutions.
In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Vol. , pp. 1800–1807.
External Links: Document,
ISSN
Cited by: Fast Sparse ConvNets.

[5]
(201810–15 Jul)
Compressing neural networks using the variational information bottleneck.
In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.),
Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 1135–1144.
External Links: Link
Cited by: §2.

[6]
(2019)
Learning fast algorithms for linear transforms using butterfly factorizations.
In Proceedings of the 36th International Conference on Machine Learning,
ICML 2019, 915 June 2019, Long Beach, California, USA,
pp. 1517–1527.
External Links: Link
Cited by: §2.

[7]
(2019)
QNNPACK: open source library for optimized mobile deep learning.
External Links: Link
Cited by: §3.2,
§3.2.

[8]
(2019)
XNNPACK.
Google.
External Links: Link
Cited by: Fast Sparse ConvNets,
§3.2,
§3.3,
§3.4.

[9]
(2018)
External Links: Link
Cited by: §4.5.

[10]
(2018)
External Links: Link
Cited by: §4.5.

[11]
(201806)
MorphNet: fast simple resourceconstrained structure learning of deep networks.
In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Vol. , pp. 1586–1595.
External Links: Document,
ISSN
Cited by: §2.

[12]
(2017)
Bringing the web up to speed with webassembly.
In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation,
PLDI 2017, New York, NY, USA, pp. 185–200.
External Links: ISBN 9781450349888,
Link,
Document
Cited by: §3.4.

[13]
(2016)
Deep Residual Learning for Image Recognition.
In 2016 IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2016, Las Vegas, NV, USA, June 2730, 2016,
pp. 770–778.
Cited by: §2.

[14]
(2018)
AMC: automl for model compression and acceleration on mobile devices.
In Computer Vision – ECCV 2018 – 15th European Conference, Munich,
Germany, September 814, 2018, Proceedings, Part VII,
pp. 815–832.
Cited by: §2.

[15]
(2017)
WebDNN: fastest dnn execution framework on web browser.
In Proceedings of the 25th ACM international conference on Multimedia,
pp. 1213–1216.
Cited by: §3.4,
Table 1.

[16]
(2017)
MobileNets: efficient convolutional neural networks for mobile vision applications.
CoRR abs/1704.04861.
External Links: Link,
1704.04861
Cited by: §1,
§1,
§2.

[17]
(201806)
Squeezeandexcitation networks.
In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Vol. , pp. 7132–7141.
External Links: Document,
ISSN
Cited by: Fast Sparse ConvNets,
§3.1.

[18]
(201707)
Densely connected convolutional networks.
In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Vol. , pp. 2261–2269.
External Links: Document,
ISSN
Cited by: §2.

[19]
(2009)
Intel math kernel library. reference manual.
Intel Corporation, Santa Clara, USA.
Note: ISBN 630813054US
Cited by: §1.

[20]
(2019)
WebMLpolyfill.
External Links: Link
Cited by: §3.4,
Table 1.

[21]
(2017)
Gemmlowp: a small selfcontained lowprecision gemm library.
External Links: Link
Cited by: §3.2,
§3.2.

[22]
(2019)
RUY.
External Links: Link
Cited by: §3.2,
§3.2,
§4.1.

[23]
(2018)
Efficient Neural Audio Synthesis.
In Proceedings of the 35th International Conference on Machine Learning,
ICML 2018, Stockholmsmässan, Stockholm, Sweden, July
1015, 2018,
pp. 2415–2424.
Cited by: §2,
§4.2,
§5.

[24]
(2017)
The tensor algebra compiler.
Proceedings of the ACM on Programming Languages 1 (OOPSLA), pp. 77.
Cited by: §1,
§4.1.

[25]
(2012)
ImageNet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.),
pp. 1097–1105.
External Links: Link
Cited by: §2.

[26]
(2015)
Sparse convolutional neural networks.
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 806–814.
Cited by: §2.

[27]
(2019)
DARTS: differentiable architecture search.
In International Conference on Learning Representations,
External Links: Link
Cited by: §4.5.

[28]
(2015)
Fully convolutional networks for semantic segmentation.
In Proceedings of the IEEE conference on computer vision and pattern recognition,
pp. 3431–3440.
Cited by: §1.

[29]
(2018)
Learning sparse neural networks through l0 regularization.
In International Conference on Learning Representations,
External Links: Link
Cited by: §2.

[30]
(2017)
ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression.
In IEEE International Conference on Computer Vision, ICCV 2017, Venice,
Italy, October 2229, 2017,
pp. 5068–5076.
Cited by: §2.

[31]
(2018)
Highperformance sparse matrixmatrix products on intel knl and multicore architectures.
In Proceedings of the 47th International Conference on Parallel Processing Companion,
ICPP ’18, New York, NY, USA, pp. 34:1–34:10.
External Links: ISBN 9781450365239,
Link,
Document
Cited by: §1.

[32]
(2019)
Recent progress on generative adversarial networks (gans): a survey.
IEEE Access 7 (), pp. 36322–36333.
External Links: Document,
ISSN
Cited by: §1.

[33]
(2016)
Holistic sparsecnn: forging the trident of accuracy, speed, and size.
CoRR abs/1608.01409.
External Links: Link,
1608.01409
Cited by: §2.

[34]
(2019)
Regularized evolution for image classifier architecture search.
In Proceedings of the AAAI Conference on Artificial Intelligence,
Vol. 33, pp. 4780–4789.
Cited by: §2.

[35]
(201512)
ImageNet large scale visual recognition challenge.
Int. J. Comput. Vision 115 (3), pp. 211–252.
External Links: ISSN 09205691,
Link,
Document
Cited by: §1,
§2,
§3.1,
§4.2.

[36]
(201806)
MobileNetV2: inverted residuals and linear bottlenecks.
In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Vol. , pp. 4510–4520.
External Links: Document,
ISSN
Cited by: Fast Sparse ConvNets,
§1,
§1,
§2.

[37]
(2014)
Ecole polytechnique, CMAP phd thesis rigidmotion scattering for image classification.
Cited by: §1,
§2.

[38]
(2015)
Very deep convolutional networks for largescale image recognition.
In International Conference on Learning Representations,
Cited by: §2.

[39]
(2019)
TensorFlow.js: machine learning for the web and beyond.
In SysML Conference,
Palo Alto, CA, USA.
External Links: Link
Cited by: §3.4.

[40]
(2019)
MnasNet: platformaware neural architecture search for mobile.
In IEEE Conference on Computer Vision and Pattern Recognition, CVPR
2019, Long Beach, CA, USA, June 1620, 2019,
pp. 2820–2828.
External Links: Link
Cited by: §2.

[41]
(2019)
EfficientNet: rethinking model scaling for convolutional neural networks.
In Proceedings of the 36th International Conference on Machine Learning,
ICML 2019, 915 June 2019, Long Beach, California, USA,
pp. 6105–6114.
External Links: Link
Cited by: §1,
§2.

[42]
(2018)
Faster gaze prediction with dense networks and Fisher pruning.
CoRR abs/1801.05787.
External Links: Link
Cited by: §2.

[43]
(1995)
Evaluating pruning methods.
In National ChiaoTung University,
pp. 2.
Cited by: Fast Sparse ConvNets.

[44]
(2012)
Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude.
Note: COURSERA: Neural Networks for Machine Learning
Cited by: §4.2.

[45]
(201906)
FBNet: hardwareaware efficient convnet design via differentiable neural architecture search.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
Cited by: §2.

[46]
(201808)
Design principles for sparse matrix multiplication on the gpu.
International European Conference on Parallel and Distributed Computing – EUROPAR.
Cited by: §1.

[47]
(2017)
ShuffleNet: an extremely efficient convolutional neural network for mobile devices.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6848–6856.
Cited by: §2.

[48]
(2019)
Object detection with deep learning: a review.
IEEE transactions on neural networks and learning systems.
Cited by: §1.

[49]
(2018)
To prune, or not to prune: exploring the efficacy of pruning for model compression.
In International Conference on Learning Representations,
Cited by: §3.1,
§4.2.

[50]
(2017)
Neural architecture search with reinforcement learning.
In 5th International Conference on Learning Representations, ICLR 2017,
Toulon, France, April 2426, 2017, Conference Track Proceedings,
External Links: Link
Cited by: §4.5.

[51]
(2017)
Learning transferable architectures for scalable image recognition.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8697–8710.
Cited by: §2.
Appendix A Comparison with Intel MKL
We also implement our scheme with AVX512 intrinsics to compare with the Intel MKL. With minimal tuning we find that it achieves a geometric mean speedup of 1.2 across all layers for both MBv1 and MBv2 compared with the MKL. Results can be found in figure 9.
Appendix B NonUniform Layerwise Sparsity with Variational Dropout
Variational Dropout (see Molchanov et al. 2017) is a Bayesian technique for inducing weight (or activation) sparsity in neural networks. It does not outperform pruning (see Gale et al. 2019) but it prunes globally, so it can induce nonuniform distributions of sparsity across layers without manual intervention. Parameter count and FLOP count do not have a 1:1 relationship in convolutional models – removing parameters from layers when the spatial dimension is large reduces the FLOP count by significantly more than removing parameters from later layers when it is small. Combined with the fact that later layers tend to have many more parameters than early ones means that global pruning approaches such as VD which operate under a parameter constraint find models which have many more FLOPs than techniques such as pruning, which are generally used to induce near uniform sparsity in the model, even if both models have the same number of parameters.
An additional practical difficulty with using VD to prune models is that one cannot specify a desired final sparsity directly. Instead one must tune the weight of the KL penalty term so that after training and thresholding the resulting model has the desired sparsity. This makes finding a model with a desired sparsity a time consuming process.
Despite these limitations, we nonetheless find it insightful to examine the pattern of layer wise sparsity that is found by using VD to prune MBv1, MBv2 and EfficientNet, with the hope of using some of this knowledge when using magnitude pruning to find even more efficient models.
Results for MBv1 can be found in figure 8 and results for MBv2 and EfficientNet in figure 10. As expected under a global parameter constraint, the layer wise sparsity increases with depth as the number of parameters per layer grows. Interestingly, there is a trend, especially evident in MBv2 and EN models, that layers containing a spatial downsampling and concomitant channel increase should be less sparse than those before or after. We were unable to take advantage of this information with hand tuning to find more efficient architectures than those with uniform sparsity, but it may provide a useful prior for a full architecture search.
Appendix C EfficientNet Plots
Figures 11 contain the plots showing how EfficientNet behaves in the presence of blocks and for varying levels of sparsity. It follows generally the same trend as the MobileNet models in the main text. The difference is that sparsities above 80% do not seem to lead to more efficient models.
https://www.groundai.com/project/fastsparseconvnets/