Quantization

From
Revision as of 20:42, 2 March 2019 by BPeat (talk | contribs)
Jump to: navigation, search

YouTube search... ...Google search


Quantization-aware model training

ensures that the forward pass matches precision for both training and inference. There are two aspects to this:

  • Operator fusion at inference time are accurately modeled at training time.
  • Quantization effects at inference are modeled at training time.

For efficient inference, TensorFlow combines batch normalization with the preceding convolutional and fully-connected layers prior to quantization by folding batch norm layers.

Post-training quantization

is a general technique to reduce the model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights to 8-bits of precision from floating-point.