1 Introduction
Convolutional neural networks (CNNs) learn discriminant feature representations from labeled training data, and have achieved stateoftheart accuracy across a wide range of visual recognition tasks, e.g., image classification, object detection, and assisted medical diagnosis. Since the breakthrough results achieved with AlexNet for the 2012 ImageNet Challenge [24], CNN’s accuracy has been continually improved with architectures like VGG [43], ResNet [12] and DenseNet [20], at the expense of growing complexity (deeper and wider networks) that require more training samples and computational resources [21]. In particular, the speed of the CNNs can significantly degrades with such increased complexity.
In order to deploy these powerful CNN architectures on compact platforms with limited resources (e.g., embedded systems, mobile phones, portable devices) and for realtime processing (e.g., video surveillance and monitoring, virtual reality), the time and memory complexity and energy consumption of CNNs should be reduced. For instance, the application of CNNbased architectures to realtime face detection in video surveillance remains a challenging task
[35]– while the more accurate detectors such as region proposal networks are too slow for realtime applications [40, 8], faster detectors such as singleshot detectors are less accurate [38, 29]. Consequently, effective methods to accelerate and compress deep networks, in general, and CNNs in particular, are required to provide a reasonable tradeoff between accuracy and efficiency.Several techniques have recently been proposed to reduce the complexity of CNNs, ranging from the design of specialized compact architectures like MobileNet [17], to the distillation of knowledge from larger architectures to smaller ones [15]. Among these, pruning techniques provide an automated approach to remove insignificant network elements, e.g., filters, channels, etc. This paper focuses on channellevel pruning techniques which can significantly reduce the number of CNN parameters while preserving network accuracy [27, 34]. These techniques attempt to remove the output and input channels at each convolution layer using various criteria based on, e.g., L1 norm [27], or the product of feature maps and gradients computed from a validation dataset [34].
Pruning techniques can be applied under two different scenarios: either (1) from a pretrained network, or (2) from scratch. In the first scenario, pruning is applied as a postprocessing procedure, once the network has already been trained, through an onetime pruning (followed by finetuning) [27] or complex iterative [34] process using a validation dataset [27, 31], or by minimizing the reconstruction error [32]. In the second scenario, pruning is applied from scratch by introducing sparsity constraints and/or modifying the loss function to train the network [53, 30, 47]. The later scenario can have more difficulty converging to accurate network solutions (due to the modified loss function), and thereby increase the computational complexity required for the optimisation process. For greater training efficiency, the progressive soft filter pruning (PSFP) method was introduced [13]
, allowing for iterative pruning from scratch, where channels are set to zero (instead of removing them) so that the network can preserve a greater learning capacity. This method, however, does not account for the optimization of soft pruned weights. Not handling the momentum tensor can have an impact a negative impact on accuracy, because pruned weights are still being optimized with old momentum values accumulated from previous epochs.
In this paper, a new Progressive Gradientbased Pruning (PGP) technique is proposed for iterative channel pruning to provide a better accuracycomplexity tradeoff. To this end, the channels are efficiently pruned in a progressive fashion while training a network from scratch, and accuracy is maintained without requiring validation data and additional optimisation constraints. In particular, PGP integrates effective hard and soft pruning strategies to adapt the momentum tensor during the backward propagation pass. It also integrates an improved version of the Taylor criterion [34] that relies on the gradient with respect to weights (instead of to output feature maps), which translates to a more suitable for a progressive weightbased pruning strategy. For performance evaluation, the accuracy and complexity of proposed and stateoftheart techniques are compared using Resnet, LeNet and VGG networks trained on MNIST and CIFAR10 image classification datasets.
2 Compression and Acceleration of CNNs
In general, time complexity of a CNN depends more on the convolutional layers, while the fully connected layers contain the most of the number of parameters. Therefore, the CNN acceleration methods typically target lowering the complexity of the convolutional layers, while the compression methods usually target reduced complexity of the fully connected layers[10, 11]. This section provides an overview of the recent acceleration and compression approaches for CNNs, namely, quantization, lowrank approximation, knowledge distillation, compact network design and pruning. Finally, a brief survey on the channel pruning methods and challenges is presented.
2.1 Overview of methods:
Quantization:
A deep neural network can be accelerated by reducing the precision of its parameters. Such techniques are often used on general embedded systems, where lowprecision, e.g., 8bit integer, provides faster processing than the higherprecision representation, e.g., 32bit floating point. There are two main approaches to quantizing a neural network – the first focuses on quantizing using weights[10, 52], and the second uses both weights and activations for quantization [9, 6]. These techniques can be either scalable [10, 52] or nonscalable [3, 9, 6, 37], where scalable techniques means that an already quantized network can be further compressed.
Lowrank decomposition:
Lowrank approximation (LRA) can accelerate CNNs by approximating the result of one tensor using lower rank tensor [22, 44, 26].There are different ways of decomposing convolution filters. Techniques like [22, 44] focus on approximating filters/output channels by low rank filters that can be obtained either in a layer by layer fashion [22] or by scanning the whole network [44]. [48] proposes to force filers to coordinate more information into a lower rank space during training and then decompose it once the model is trained. Another technique employed the CPDecomposition (Canonical Polyadic Decomposition), where a good tradeoff between accuracy and efficiency is achieved [26].
Knowledge distillation:
This family of techniques focuses on training a small network, student, using a larger model, called teacher [16]
. Unlike, traditional supervised learning method, the student is trained by the teacher. These methods could obtain considerable improvements in term of sparsity and generalization of the produced networks. Most of distillation techniques use large pretrained models as teachers
[16, 41]. More recently, there has been interest in developing online studentteacher models on the fly [25, 51] or using GANs in order to increase the training speed and accuracy [46]. Knowledge distillation has been applied to multiple problems including object detection [4], NLP [23] and differential privacy [45].Compact network design:
Compact model design is an alternative way to produce fast deep neural networks. The aim of these techniques is to produce light models for highspeed processing. Different methods were applied to produce compact models, for instance, MobileNet [17], MobileNetV2 [42] and Xception [5] can achieve realtime speed using depthwise convolution in order to reduce computation. Other architectures like ShuffleNet [50, 33] and CondenseNet [19] use another convolution locally connected in groups for reducing computation.
Pruning:
Pruning is a family of techniques that removes nonuseful parameters from a neural network. There are several approaches of pruning for deep neural networks. The first is weight pruning, where individual weights are pruned. This approach has proven to significantly compress and accelerates deep neural networks [10, 49, 11]. Weight pruning techniques usually employ sparse convolution algorithms [28, 39].The other approach is filter or channel pruning, where complete filters or channels are pruned [27, 32, 13, 53]. Since this paper proposes a method for channelpruning, we provide more details on this approach in the next section.
2.2 Channel pruning
Channellevel pruning techniques attempt to remove the output and input channels at each convolution layer using various criteria, such as L1norm [27], Entropy[31], L2, APoZ[18] or using a combination of feature maps and gradients [34]. The channelbased pruning methods has the advantage of being independent of a sparse convolution algorithm by keeping the convolution dense, which provides a platformindependent speedup (since a sparse algorithm may not be implemented on all platforms).
Following the work of Optical Brain Damage [7], one of the first papers that showed the efficiency of channellevel pruning was [27], where the weight norm is used to identify and to prune weak channels, channels that do not contribute much to network. Afterwards, several works proposed pruning procedures and channel importance metrics. These methods can be organized in five pruning approaches: 1) Pruning as one time post processing and then fine tune– this approach is simple and easy to apply [27], 2) Pruning in an iterative way once the model was trained– the iterative pruning and finetune increase the chance of recovering directly after a channel is pruned [34], 3) Pruning by minimizing the reconstruction error– minimizing the reconstruction error at each layer allows the model to approximate the original performance [32, 14], 4) Pruning by using sparse constraints with a modified objective function– to let the network consider pruning during optimization [53, 30, 2, 1], 5) Pruning progressively while training from scratch or pretrained model – soft pruning was applied where channels are set to zero instead of actually removing them (hard manner), which leaves the network with more capacity to learn [13].
While first three approaches are capable of reducing the complexity of a model, they are only applied when the model is already trained, it would certainly be more beneficial to be able to start pruning from scratch during training. While, the fourth approach can start the pruning from scratch by adding sparse constraints and by modifying the optimization objective, this can add more complexity and computation for the training process. This can be potentially complicated when the original loss function is hard to optimize since this type of approach modifies the original loss function therefore making it potentially harder for the model to converge to a good solution. The fifth approach eases this process by not removing channels and uses the original loss function. However, we think that this approach can be improved since, currently, this approach does not handle pruning in the backward pass and only set the weak channels to zero. Also, the current approach calculates the criterion separately from when the parameters are updated, i.e. not when we are iterating inside an epoch. For our approach we want to directly compute the criterion during update, i.e. when we are iterating in an epoch and updating parameters.
Another important part of pruning channels is the capacity to evaluate the importance of a channel. Currently, in literature, there has been a lot of criteria that has been used to evaluate the importance of channels, e.g. L1[27], APoz[18], Entropy[31], L2[13] and Taylor[34]. Among these, we think that the Taylor criterion[34], has the most potential for pruning during training since the criterion is the result of trying to minimize the impact of having a channel pruned, although we can argue that it can be improved for progressive pruning.
3 Progressive Gradient Pruning
3.1 Pruning procedure
In a regular CNN, the weight tensor of a convolutional layer can be defined as , where and are the number of input and output channels, respectively. A weight tensor of output channel can be then defined as . In order to select the weak channels of a layer, we evaluate the importance of an output channel using a criterion function , is usually defined as . Given an output channel, it yields a scalar that represents the rank, e.g. L1 [27] or gradient norm in our case.
In order to prune convolution layer progressively, an exponential decay function is defined such that there is always a solution in . (It is slightly different than in [13], where the decay function can have solutions in .) This decay function allows to select the number of weak channels at each epoch. The decay function is defined as the ratio of output channels remaining after the training on epoch :
(1) 
where is a hyperparameter that defines the ratio of output channels to be pruned, and is the epoch. Since we progressively prune layer by layer and epoch by epoch, we calculate the the number of weak channels or the number of remaining channels at each layer, . Given ratio at epoch , the number of weak output channels for any layer is defined as:
(2) 
where can be the original number of output channels of any layers. Using the the number of weak channels and a pruing criterion function , we end up having a subset of output channels with the smallest value. This subset is further divided into two subsets, using a hyperparameter that decides the ratio of hardremoved output channels. The subset is removed completely, while the subset will be reset to zero while keeping and as indexes for the backward pass. Additionally, hard pruning is performed on the input channels of the next layer using .
Since we are setting some weights back to zero and we still apply backpropagation on those weights, we would need to handle the pruning of the Momentum tensor. The momentum tensor is defined as , same dimension as a weight tensor, currently in existing work of progressive pruning during training [13], the authors only set the weights to zero without handling the momentum accumulated from before which is critical for the optimization. Using the indexes of , we set to zero the subset and hard prune the subset using indexes . Figure 1 illustrates how the locally procedure of PGP for hard and soft pruning between two successive convolutional layers.
3.2 Criteria for progressive pruning
Molchanov at al. [34] proposed the following criterion to measure the importance of a feature map from an output channel , computed at each layer, and for each output channel:
(3) 
The term mrefers to the loss of a model when a labeled dataset is given with a pruned feature map . is the original loss before the model has been pruned. In summary, the criterion of Equation 3 is the difference between the loss of a pruned model and the original model. The criterion grows with the impact the feature map. This criterion has been shown to works well on some datasets and networks. However, in the scenario where the network is pruned from scratch, we argue that the information measured from feature map is not informative since the model is not trained. Empirical results in Section 4 also support that the criterion of Equation 3 is not effective at other criteria for progressive pruning.
Instead of using to prune a feature map or filter, we can replace with since setting an output channel to zero is the same as pruning it [13]. The same Taylor expansion from [34] then can applied with , resulting in:
(4) 
Equation 4 can be further simplified when taking in account the soft pruning nature. We can decomposed this equation because is an elementwise multiplication:
(5) 
where is the absolute value of the weight of channel . This meant that can be or very close to zero if was one of the channel that was softpruned. In this case, has little chance to recover, since it will likely be pruned. In order to have more recovery on soft prune channels, we propose to remove the term:
(6) 
where is the criterion for our approach for channel. There are two ways of calculating our criterion:

PGP: perform a training epoch without updating the model, and compute the pruning criterion.

RPGP: compute the pruning criterion directly during a forward/backward pass of the training loop while updating.
In the first case, it amounts to a batch gradient descent without updating the parameters at then end. This approach can lead to better performance since the optimization is less noisy than SGD. The second approach use a SGD optimizer and calculate the criterion directly during the optimization and update of the model.
In either case, the criterion is carried over several iterations, that means there’s two way of interpreting Equation 6. One way of interpreting is the natural way of accumulating gradient, where the gradients are summed up and since we want the total gradient of an output channel, We can use an L1 norm in order to sum up how much variation inside an output channel, this translates to follow Equation:
(7) 
where is the gradient tensor of an output channel at iteration inside an epoch. This Equation 7 measures how much changes an output channel is getting globally at the end of an epoch, this makes it very suitable for PGP since we go thought the whole epoch without updates, we will refer to this criterion as GN_G. The second way of interpreting is to accumulate the actual changes of an output channel at each updates, this meant the Equation will change to:
(8) 
Equation 8 calculates the L1 norm of a gradient tensor of an output channel at each iteration during an epoch, this would meant that, instead of measuring the global change only at the end like Equation 7, this measure the gradual changes during an epoch, we will refer this as GN_S. This Equation would works better for RPGP since we are updating the weight at the same time as we accumulate our gradient. In summary, the PGP can be summarized in algorithm 1 and the algorithm for RPGP is similar but the criterion is calculated directly at the Train step.
4 Experimental Results
In this section, we compare the proposed technique with some baseline techniques like L1norm Pruning[27], Taylor Pruning [34] which represent an prune once approach and iterative approach. We also compare with state of the arts techniques like DCP[53] and PSFP[13] that prune while training. For techniques like ours, PSFP[13], DCP[53] and L1[27] , it is possible to set a target pruning rate/percent hyper parameter, and we compare them at each level of target pruning using 0.3, 0.5, 0.7, 0.9 (or 30, 50, 70, 90 percents prune away) ratio as our comparison points. For techniques like Taylor[34], we prune until the end and we select the points 0.3, 0.5, 0.7, 0.9 of number of output channels pruned away. Currently, for our experiments, we are considering two datasets MNIST and CIFAR, for the supplemental material, we plan to add at least one more dataset.
Implementation details
One of the problem of pruning during training is how to handle the shape of the gradient tensor and momentum tensor during backward pass. Usually, and specifically in the case of PyTorch
[36], the shape of the gradient tensor and momentum tensor is handled by the optimizer, which does not necessary update the shape during forward pass. Also, redefining a new optimizer with the new pruned model in a trivial way is not good enough since we would lose all values accumulated in the momentum buffer. One of the way to overcome this, is to prune also the gradient tensor and momentum tensor, using indexes that we used to prune the weight tensor, and then transfer them to a newly defined optimizer. As for the pruning of ResNet, we decided to follow the popular pruning strategy proposed in [27], meaning we prune the downsampling layer and then use the same indexes to prune the last convolution of the residual. Our method was implemented on PyTorch[36], the source code for our paper will be available at https://github.com/Anon6627/PruningPGPResult on MNIST dataset:
For the comparison on MNIST, we use the same hyperparameters as the original papers. For LeNet5, we use a learning rate 0.01, momentum 0.9, 40 epochs with a remove rate of 50% for PGP, our progressive pruning that does not compute the criterion during an epoch and RPGP that does. For PSFP, we used the same settings as mentionned before except for removal rate of 50%. For Taylor[34], we remove in an iterative way 5 channels each time and finetune for 5 epochs after that. We slightly changed this from the original procedure because this configuration does not collapse and return the best result. For L1 pruning, we use a 20 epochs finetuning after pruned. We use the same settings for ResNet20. For the DCP[53] algorithm, we took the available code and ran it on MNIST using 40 epochs, with 20 epochs for the channel pruning and 20 epochs for finetuning.
Methods  Params  FLOPS  Error % ( gap)  

Baseline LeNet5  0  61K  446k  0.84 ( 0) 
L1[27]  30%  34.1K  304K  0.9 ( +0.06) 
50%  18K  152K  1.05 ( +0.21)  
70%  84K  82K  2.22 ( +1.38)  
90%  8.8K  31K  8.17 ( +7.33)  
Taylor[34]  30%  38K  404K  0.9 ( +0.06) 
50%  24K  387K  1.05 ( +0.21)  
70%  13K  374K  1.22 ( +0.38)  
90%  3K  286K  7.73 ( +6.89)  
DCP[53]  30%      4.27 
50%      4.18  
70%      6.28  
90%      6.81  
PSFP[13]  30%  34.1K  304K  1.32 ( +0.48) 
50%  18K  152K  2.27 ( +1.43)  
70%  84K  82K  2.99 ( +2.15)  
90%  8.8K  31K  32.28 ( +31.44)  
PGP(TW)  30%  34.1K  304K  0.82 ( 0.02)) 
50%  18K  152K  1.06 ( +0.22))  
70%  84K  82K  1.52 ( +0.68))  
90%  8.8K  31K  4.7 ( +3.86))  
PGP(GN_G)  30%  34.1K  304K  0.87 ( +0.03) 
50%  18K  152K  1.08 ( +0.24)  
70%  84K  82K  1.74 ( +0.9)  
90%  8.8K  31K  4.25 ( +3.41)  
RPGP(GN_S)  30%  34.1K  304K  0.9 ( +0.06) 
50%  18K  152K  1.25 ( +0.41)  
70%  84K  82K  1.75 ( +0.91)  
90%  8.8K  31K  8.14 ( +7.3) 
Methods  Params  FLOPS  Error % ( gap)  

Baseline Resnet20  0  272K  41M  0.74 ( 0) 
L1 [27]  30%  137K  22M  0.75 ( +0.01) 
50%  68K  10M  1.09 ( +0.35)  
70%  27K  4.2M  2.02 ( +1.28)  
90%  3K  714K  8.17 ( +7.43)  
Taylor [34]  30%  268K  40.9M  0.87 ( +0.13) 
50%  260K  40.5M  0.95 ( +0.21)  
70%  250K  39.8M  1.04 ( +0.3)  
90%  241K  39.3M  7.73 (6.99)  
DCP [53]  30%  193K  30.3M  1.11 (0.37) 
50%  138K  21.1M  0.62 ( 0.12)  
70%  87.7K  13.5M  1.19 ( +0.45)  
90%  34K  5M  1.13 ( +0.39)  
PSFP [13]  30%  137K  22M  0.5 ( 0.24) 
50%  68K  10M  0.61 ( 0.13)  
70%  27K  4.2M  0.72 ( 0.02)  
90%  3K  714K  0.73 ( 0.01)  
PGP(TW)  30%  137K  22M  0.45 ( 0.29) 
50%  68K  10M  0.35 ( 0.39)  
70%  27K  4.2M  0.52 ( 0.22)  
90%  3K  714K  1.52 ( +0.78)  
PGP(GN_G)  30%  137K  22M  0.4 ( 0.34) 
50%  68K  10M  0.51 ( 0.23)  
70%  27K  4.2M  0.57 ( 0.17)  
90%  3K  714K  0.86 ( +0.12)  
RPGP(GN_S)  30%  137K  22M  0.4 ( 0.34) 
50%  68K  10M  0.48 ( 0.29)  
70%  27K  4.2M  0.5 ( 0.24)  
90%  3K  714K  1.87 ( +1.13) 
Results in Table 1 show that our methods compare favorable against baseline techniques like L1[27] and Taylor[34]. It also has better performance than state of the art technique like PSFP[13]. Also, the model used by this experiment is small which makes it harder to prune. As for the Table 2, we can see the same thing happens here. Also, in this settings, we find that our methods performs slightly better than DCP[53]
in some settings. As for the difference between PGP(GN) and RPGP(GN), since both algorithms have the same criterion, it’s the procedure that differs, the slight better performance of PGP(GN) can probably be explained by the fact that the pruning criterion is calculated using Batch Gradient Descent instead of Stochastic Gradient Descent.
Result on the CIFAR10 dataset:
For the comparison on CIFAR10, we mostly use the same hyperparameters as the considered papers. For VGG, we use a VGG19 for CIFAR10, with learning rate 0.1, momentum 0.9, 400 epochs and we decrease the learning rate by a factor of 10 at 160 and 240 epochs. For PSFP, PGP and RPGP, we set the remove rate hyperparameter to 0.5 (50%) and we finetune them for 100 epochs after pruned and keep the best score. For Taylor[34], we remove in an iterative way 5 channels each time and finetune on 5 epochs after that. We slightly changed the procedure compared to the original paper because the original procedure pruned one feature map each iteration which is not very efficient on a large model. Also, empirically we found that 5 feature maps has the best accuracy. For L1 pruning, we use a 100 epochs finetuning after pruned and keep the best score. For PSFP[13] and DCP[53], we used the same settings provided by the authors in order to have the best possible performance. For Resnet, we use a Resnet56 adapted to CIFAR10, we keep the same settings for this experiment, except the number of epochs for our techniques is now 500.
Methods  Params  FLOPS  Error % ( gap)  

Baseline VGG19  0%  20M  400M  6.23 (0) 
Li [27]  30%  9M  198M  16.94 ( +8.41) 
50%  5M  100M  16.51 ( +7.98)  
70%  1M  37M  16.17 ( +7.64)  
90%  209K  4M  19.91 (+11.38)  
Taylor [34]  30%  156M  156M  9.82 ( +1.29) 
50%  5M  72M  11.94 ( +3.41)  
70%  1.9M  24M  16.85 ( +8.32)  
90%  249K  2.2M  41.58 (+33.05)  
DCP [53]  30%       
50%    139M   ( 0.58)  
70%        
90%        
PSFP [13]  30%  9M  198M  8.98 ( +2.75) 
50%  5M  100M  11.2 ( +4.97)  
70%  1M  37M  12.06 ( +5.83)  
90%  209K  4M  18.73* ( +12.5)  
PGP(TW)  30%  9M  198M  8.78 ( +0.25) 
50%  5M  100M  9.89 ( +1.36)  
70%  1M  37M  11.97 ( +3.44)  
90%  209K  4M  21.08 (+12.55)  
PGP(GN_G)  30%  9M  198M  7.37 ( +1.14) 
50%  5M  100M  8.38 ( +2.15)  
70%  1M  37M  9.7 ( +3.47)  
90%  209K  4M  16.46 (+10.33)  
RPGP(GN_S)  30%  9M  198M  7.65 ( +1.42) 
50%  5M  100M  8.79 ( +2.56)  
70%  1M  37M  10.56 ( +4.33)  
90%  209K  4M  17.53 (+11.3) 
Methods  Params  FLOPS  Error % ( gap)  

Baseline Resnet56  0  855K  128M  6.02 ( 0) 
L1[27]  30%  431K  198M  13.34 ( +7.32) 
50%  215K  100M  15.57 ( +9.55)  
70%  84K  37M  17.89 ( +11.87)  
90%  11K  4M  40.24 ( +34.22)  
Taylor[34]  40%  491K  51M  13.9 ( +7.88) 
50%  268K  23M  15.34 ( +9.32)  
70%  100k  8M  22.1 ( +16.08)  
90%  10k  1M  45.69 ( +39.67)  
DCP[53]  0.3  600K  90M  5.67 ( 0.35) 
50%  430K  65M  6.43 ( +0.41)  
70%  270K  41M  7.18 ( +1.16)  
90%  100K  17M  9.42 ( +3.4)  
PSFP[13]  0.3  431K  198M  8.94 ( +2.92) 
50%  215K  100M  10.93 ( +4.91)  
70%  84K  37M  14.18 ( +8.16)  
90%  11K  4M  28.09 ( +22.07)  
PGP(TW)  30%  431K  198M  10.38 ( +4.36) 
50%  215K  100M  11.95 ( +5.93)  
70%  84K  37M  13.63 ( +7.61)  
90%  11K  4M  30.68 ( +24.66)  
PGP(GN_G)  30%  431K  198M  8.95 ( +2.93) 
50%  215K  100M  10.59 ( +4.57)  
70%  84K  37M  13.02 ( +7)  
90%  11K  4M  26.02 ( +20)  
RPGP(GN_S)  30%  431K  198M  9.37 ( +3.35) 
50%  215K  100M  10.46 ( +4.44)  
70%  84K  37M  14.16 ( +8.14)  
90%  11K  4M  36.65 ( +30.63) 
From Tables 3 and 4, our technique consistently perform better than the baseline techniques. It also has slightly better performance than state of the art techniques PSFP[13] on VGGNet. For ResNet, PSFP[13] has a different pruning strategy than ours on ResNet. PSFP does not prune the downsample layer and therefore it also does not prune the last convolutional layer of the residual. This translates into a slight better accuracy on some settings. In our supplemental material, we also provide a comparison using the same pruning strategy on ResNet in order to have a fair comparison between approaches. DCP[53] performs better than ours on this dataset, however, it is difficult to compare since both techniques do not end up with the same number of FLOPS and parameters.
Training and pruning time:
The training and pruning time of model are important factors of a technique, for instance for deploying or adapting a model in an operational environment. One of advantage of progressive pruning techniques is the reduction of processing time at each epoch since filters are removed while training, at each epoch. Table 5 presents the training and pruning time pruning for the evaluated techniques. For progressive pruning and DCP technique, values represent both pruning and training times, while for L1 and Iterative pruning, values represent (training time) + pruning and retrain times. Experiments are conducted on the CIFAR10 dataset with the same settings as above, running on an isolated computer (Intel Xeon Gold 5118, @2.3GHZ) with an Nvidia Tesla P100 GPU card.
Methods  VGG19  Resnet56  

0.5  0.9  0.5  0.9  
Baseline  219m  219m  307m  307m 
L1 [27]  (219) + 32m  (219) + 32m  (307) + 48m  (307) + 48m 
Taylor [34]  (219) + 254m  (219) + 457m  (307) + 488m  (307) + 878m 
DCP [53]      489m  443m 
PSFP [13]  219m  219m  307m  307m 
PGP  329m  329m  441m  441m 
RPGP  211m  168m  263m  241m 
From Tab. 5, the fastest pruning method (without considering training time) is currently the L1 [27]. However, it should be noted that the original training of the model takes around 219 mins for VGG and 307 mins for Resnet56. So, taking into account also training time L1 is slower than our approach. Other techniques likes Taylor [34] prune in a iterative way composed of multiple feature maps and finetuning, this method can be very slow, depending on the number of channels pruned at each iteration. DCP [53] is also slow for pruning due to the channel pruning optimization process and the finetuning after pruning. For PSFP [13], this algorithm has similar time to the original training since it does not technically change the size of the model during training. Between PGP and RPGP, the difference is the use of an entire epoch to compute the pruning criterion with PGP, and the direct computation of the criterion during a training epoch with RPGP. Also, since we hardprune channels at each epoch, the epoch time will become faster as the model is pruned/trained. Overall, the progressive pruning methods train and prune in considerably less time than other methods.
Pruning criteria:
For this comparison, we compare our pruning criterion with other approaches. This experiment is realized on CIFAR10 using the same pruning strategy and method: we use our progressive pruning with the L2norm criterion[13] on the weight and also the Taylor criterion [34]. The configuration for this experiment is the same as the general comparison except we set the target pruning rate to 50% and we use the RPGP pruning strategy for all criteria.
Networks  L2  Taylor  TW  GN_G  GN_S 

VGG19  8.47%  9.27%  8.78%  8.47%  8.79% 
ResNet56  10.30%  10.97%  10.46%  10.24%  10.28% 
In Tab. 6, we can see that our criterion performs better than others in the context of progressive pruning, except for the case of the L2norm. The comparison between Taylor Weight (TW), and Gradient Norm (GN) shows that a small gradient norm during training may be a good indicator about the importance of a channel. From the table we can also see that Taylor Weights performs better than the original Taylor criterion. Overall , which promotes the capture of the variation, seems to work the best with progressive pruning. Finally, we notice that L2 and our criteria have similar performance. This can be understood considering the following:
(9)  
Where represents the weight of an output channel at iteration in an epoch, is the learning rate, and denotes here the loss function at iteration . From this equation we can observe the difference between L2 and Gradient Norm is the initial values of . Taking in account the partial soft pruning nature of our approach, can be zero when it is soft pruned. Therefore the two approaches tends to have similar values (since is a scalar, it is not important in this context).
Progressive pruning procedure:
We compare PSFP and RPGP when using the same pruning criterion procedure, the same and the same settings. This experiment shows the importance of handling the pruning of the backward pass during training.
Method  VGG19  ResNet56 

PSFP  11.20%  10.93% 
RPGP  8.47%  10.13% 
From the Table 7, on VGG19, the proposed pruning procedure outperforms the original PSFP[13]. On Resnet, the error rate of the two approaches is much closer. This is because, as mentioned earlier, PSFP[13]
does not perform pruning on the downsample layers and on the last layer of the residual connection which helps to improve its performance.
Pruning from scratch vs after pretraining:
In Tab. 8, we compare the performance obtained by our method on a model that was randomly initialized (Scratch) and a model that was already trained (Pretrained) on CIFAR10. We set the target prune to 50% and use a removal rate of 0.5% on VGGNet and Resnet56 with the same settings as before.
Training Scenario  VGG19  ResNet56 

Scratch  8.79 %  10.46 % 
Pretrained  8.23 %  9.51 % 
From this experiment, we can see that the difference in terms of accuracy between a network pruned starting from scratch and a network pruned after training is quite reduced and can vary depending on the architectures. This shows that, instead of starting from a trained model and prune, the proposed technique can attain similar performance starting from a randomly initialized model, thus, with a reduced training and pruning time.
Hard vs soft pruning:
We also want to compare, the impact of having different remote rate. For this experiment, we use RPGP with our gradient criterion and fixes the target prune rate at 50% using the same hyperparameters as before. We varies the removal rate in order to see the impact of having more recovery versus the opposite.
Networks  

VGG19  8.74%  8.79%  8.99%  8.92% 
ResNet56  10.57%  10.46%  11.03%  10.78% 
The result shown in Table 9 indicate that a remove rate of 0.3(30%)or 0.5(50%) has the best balance between the amount of hard pruning soft pruning. It is also interesting to see that, without any soft pruning (=1.0), the performance of the approach is still close to others removal rate.
5 Conclusion
The PGP is a new progressive pruning technique that measures change in channel weights, and applies effective hard and soft pruning strategies. In this paper, we show that it is possible to prune a deep learning model from scratch with the PGP technique while improving the tradeoff between compression and accuracy. We proposed a criterion that is well adapted for progressive pruning from scratch that considers the norm of the gradient. Experimental result obtained after pruning various CNNs on the MNIST and CIFAR10 datasets show that the proposed method can maintain a high level of accuracy with compact neural networks. Future research will involve analyzing the performance of different CNNs pruned using the proposed method on larger realworld datasets, and for other visual recognition tasks (e.g., person and face detection in video surveillance).
References

[1]
J. M. Alvarez and M. Salzmann.
Learning the number of neurons in deep networks.
In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2270–2278. Curran Associates, Inc., 2016.  [2] J. M. Alvarez and M. Salzmann. Compressionaware training of deep networks. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 856–867. Curran Associates, Inc., 2017.
 [3] Z. Cai, X. He, J. Sun, and N. Vasconcelos. Deep learning with low precision by halfwave gaussian quantization. CoRR, abs/1702.00953, 2017.
 [4] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker. Learning efficient object detection models with knowledge distillation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 742–751. Curran Associates, Inc., 2017.
 [5] F. Chollet. Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357, 2016.
 [6] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and activations constrained to +1 or 1. CoRR, abs/1602.02830, 2016.
 [7] Y. L. Cun, J. S. Denker, and S. A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems, pages 598–605. Morgan Kaufmann, 1990.
 [8] J. Dai, Y. Li, K. He, and J. Sun. Rfcn: Object detection via regionbased fully convolutional networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 379–387. Curran Associates, Inc., 2016.
 [9] J. Faraone, N. J. Fraser, M. Blott, and P. H. W. Leong. SYQ: learning symmetric quantization for efficient deep neural networks. CoRR, abs/1807.00301, 2018.
 [10] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.
 [11] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 1135–1143. Curran Associates, Inc., 2015.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
 [13] Y. He, X. Dong, G. Kang, Y. Fu, and Y. Yang. Progressive deep neural networks acceleration via soft filter pruning. CoRR, abs/1808.07471, 2018.
 [14] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. CoRR, abs/1707.06168, 2017.
 [15] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
 [16] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop, 2015.
 [17] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
 [18] H. Hu, R. Peng, Y. Tai, and C. Tang. Network trimming: A datadriven neuron pruning approach towards efficient deep architectures. CoRR, abs/1607.03250, 2016.
 [19] G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. CoRR, abs/1711.09224, 2017.
 [20] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
 [21] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy tradeoffs for modern convolutional object detectors. CoRR, abs/1611.10012, 2016.
 [22] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. CoRR, abs/1405.3866, 2014.
 [23] Y. Kim and A. M. Rush. Sequencelevel knowledge distillation. CoRR, abs/1606.07947, 2016.
 [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. Commun. ACM, 60(6):84–90, May 2017.
 [25] x. lan, X. Zhu, and S. Gong. Knowledge distillation by onthefly native ensemble. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7517–7527. Curran Associates, Inc., 2018.
 [26] V. Lebedev, Y. Ganin, M. Rakhuba, I. V. Oseledets, and V. S. Lempitsky. Speedingup convolutional neural networks using finetuned cpdecomposition. CoRR, abs/1412.6553, 2014.
 [27] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning filters for efficient convnets. CoRR, abs/1608.08710, 2016.

[28]
B. Liu, M. Wang, H. Foroosh, M. Tappen, and M. Pensky.
Sparse convolutional neural networks.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2015.  [29] W. Liu, D. Anguelov, D. Erhan, C. S. andRH Scott E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV (1), volume 9905 of Lecture Notes in Computer Science, pages 21–37. Springer, 2016.
 [30] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang. Learning efficient convolutional networks through network slimming. CoRR, abs/1708.06519, 2017.
 [31] J. Luo and J. Wu. An entropybased pruning method for CNN compression. CoRR, abs/1706.05791, 2017.
 [32] J. Luo, J. Wu, and W. Lin. Thinet: A filter level pruning method for deep neural network compression. CoRR, abs/1707.06342, 2017.
 [33] N. Ma, X. Zhang, H.T. Zheng, and J. Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In The European Conference on Computer Vision (ECCV), September 2018.
 [34] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. Pruning convolutional neural networks for resource efficient transfer learning. CoRR, abs/1611.06440, 2016.
 [35] L. T. NguyenMeidine, E. Granger, M. Kiran, and L. BlaisMorin. A comparison of cnnbased face and head detectors for realtime video surveillance applications. In 2017 Seventh International Conference on Image Processing Theory, Tools and Applications (IPTA), pages 1–7, Nov 2017.
 [36] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPSW, 2017.
 [37] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. CoRR, abs/1603.05279, 2016.
 [38] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018.
 [39] M. Ren, A. Pokrovsky, B. Yang, and R. Urtasun. Sbnet: Sparse blocks network for fast inference. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [40] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster RCNN: towards realtime object detection with region proposal networks. CoRR, abs/1506.01497, 2015.
 [41] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. CoRR, abs/1412.6550, 2014.
 [42] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation. CoRR, abs/1801.04381, 2018.
 [43] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [44] C. Tai, T. Xiao, X. Wang, and W. E. Convolutional neural networks with lowrank regularization. CoRR, abs/1511.06067, 2015.
 [45] J. Wang, W. Bao, L. Sun, X. Zhu, B. Cao, and P. S. Yu. Private model compression via knowledge distillation. CoRR, abs/1811.05072, 2018.
 [46] X. Wang, R. Zhang, Y. Sun, and J. Qi. Kdgan: Knowledge distillation with generative adversarial networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 775–786. Curran Associates, Inc., 2018.
 [47] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. CoRR, abs/1608.03665, 2016.
 [48] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li. Coordinating filters for faster deep neural networks. CoRR, abs/1703.09746, 2017.
 [49] T. Zhang, S. Ye, K. Zhang, J. Tang, W. Wen, M. Fardad, and Y. Wang. A systematic dnn weight pruning framework using alternating direction method of multipliers. In The European Conference on Computer Vision (ECCV), September 2018.
 [50] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR, abs/1707.01083, 2017.
 [51] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu. Deep mutual learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
 [52] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. CoRR, abs/1702.03044, 2017.
 [53] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu. Discriminationaware channel pruning for deep neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 883–894. Curran Associates, Inc., 2018.
Appendix A Additional Experimental Results
a.1 Comparison of PSFP and RPGP with ResNet:
As described in previous experiments, PSFP [13] does not prune the downsampling layer of ResNet56. Therefore, it does not prune the last layer of the residual connection. The common strategy with ResNet consists in pruning the downsampling layer, and then pruning the last layer of the residual connection. In this section, the performance of PSFP is compared with our proposed RPGP technique using the same strategy on ResNet56, i.e., the downsampling layer and last layer of residual connection are not pruned. We employ the CIFAR10 dataset and the same hyperparameters as in previous experiments in our paper.
The results in Table 11 indicate that the proposed RPGP approach typically performs better than PSFP. Interestingly, when no pruning is performed on the downsampling layer and last layer of the residual connection, our method performs much better. Results suggest that the residual connection is very sensitive to pruning, and we may require a different pruning strategy.
%  

Methods  30%  50%  70%  90% 
PSFP  8.94  10.93  14.18  28.09 
RPGP(GN_S)  8.87  10.09  11.02  13.94 
a.2 Graphical comparison on CIFAR10 with VGG:
The results presented in this section are similar to the ones shown in Tables 1 to 4 of our paper. In the main paper, we could only compare the performance of methods with 4 pruning rates due to space constraints. In this section, we compare the performance of methods using the same experimental settings (as in our paper), but with 10 data points () on L1 [27], Taylor [34], PSFP [13] and our approach. Since the number of remaining parameters can differ slightly from one algorithm to the other, some of the value on Xaxis are rounded up for a better visualization.
Results in Figure 2 show the proposed PGP and RPGP pruning methods consistently outperforming the other methods. Note that the proposed methods allow to maintain a low lever of error event with an important increase in the pruning rate.
a.3 Comparison on Faster RCNN
As stated previously, object detection is an important field and there has been a lot of efforts in order to reduce computational complexity of current existing detection. For this comparison, we set out to compare our pruning algorithm with other algorithm on a every popular object detection algorithm, Faster RCNN. The setting of this experiment is as follow, due to the complexity and attaining good performance on object detection, we start this experiment with a COCO pretrained model.
Methods  VGG16  Resnet101 

PGP  65.5%  
RPGP  64.0 
Comments
There are no comments yet.