Difference between revisions of "Quantization"

From
Jump to: navigation, search
Line 8: Line 8:
 
[http://www.google.com/search?q=Quantization+aware+model+training ...Google search]
 
[http://www.google.com/search?q=Quantization+aware+model+training ...Google search]
  
 +
 +
===== Quantization-aware model training =====
 
* [http://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize#quantization-aware-training Quantization-aware training]
 
* [http://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize#quantization-aware-training Quantization-aware training]
  
Quantization-aware model training ensures that the forward pass matches precision for both training and inference. There are two aspects to this:
+
ensures that the forward pass matches precision for both training and inference. There are two aspects to this:
  
 
* Operator fusion at inference time are accurately modeled at training time.
 
* Operator fusion at inference time are accurately modeled at training time.
Line 16: Line 18:
  
 
For efficient inference, [[TensorFlow]] combines batch normalization with the preceding convolutional and fully-connected layers prior to quantization by folding batch norm layers.
 
For efficient inference, [[TensorFlow]] combines batch normalization with the preceding convolutional and fully-connected layers prior to quantization by folding batch norm layers.
 +
 +
===== Post-training quantization =====
 +
* [http://www.tensorflow.org/lite/performance/post_training_quantization Post-training quantization]
 +
 +
is a general technique to reduce the model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights to 8-bits of precision from floating-point.
 +
  
 
<youtube>eZdOkDtYMoo</youtube>
 
<youtube>eZdOkDtYMoo</youtube>

Revision as of 20:42, 2 March 2019

YouTube search... ...Google search


Quantization-aware model training

ensures that the forward pass matches precision for both training and inference. There are two aspects to this:

  • Operator fusion at inference time are accurately modeled at training time.
  • Quantization effects at inference are modeled at training time.

For efficient inference, TensorFlow combines batch normalization with the preceding convolutional and fully-connected layers prior to quantization by folding batch norm layers.

Post-training quantization

is a general technique to reduce the model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights to 8-bits of precision from floating-point.