Difference between revisions of "Quantization"
| Line 8: | Line 8: | ||
[http://www.google.com/search?q=Quantization+aware+model+training ...Google search] | [http://www.google.com/search?q=Quantization+aware+model+training ...Google search] | ||
| + | |||
| + | ===== Quantization-aware model training ===== | ||
* [http://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize#quantization-aware-training Quantization-aware training] | * [http://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/quantize#quantization-aware-training Quantization-aware training] | ||
| − | + | ensures that the forward pass matches precision for both training and inference. There are two aspects to this: | |
* Operator fusion at inference time are accurately modeled at training time. | * Operator fusion at inference time are accurately modeled at training time. | ||
| Line 16: | Line 18: | ||
For efficient inference, [[TensorFlow]] combines batch normalization with the preceding convolutional and fully-connected layers prior to quantization by folding batch norm layers. | For efficient inference, [[TensorFlow]] combines batch normalization with the preceding convolutional and fully-connected layers prior to quantization by folding batch norm layers. | ||
| + | |||
| + | ===== Post-training quantization ===== | ||
| + | * [http://www.tensorflow.org/lite/performance/post_training_quantization Post-training quantization] | ||
| + | |||
| + | is a general technique to reduce the model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights to 8-bits of precision from floating-point. | ||
| + | |||
<youtube>eZdOkDtYMoo</youtube> | <youtube>eZdOkDtYMoo</youtube> | ||
Revision as of 20:42, 2 March 2019
YouTube search... ...Google search
Quantization-aware model training
ensures that the forward pass matches precision for both training and inference. There are two aspects to this:
- Operator fusion at inference time are accurately modeled at training time.
- Quantization effects at inference are modeled at training time.
For efficient inference, TensorFlow combines batch normalization with the preceding convolutional and fully-connected layers prior to quantization by folding batch norm layers.
Post-training quantization
is a general technique to reduce the model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights to 8-bits of precision from floating-point.