Difference between revisions of "Quantization"

From
Jump to: navigation, search
Line 7: Line 7:
 
[http://www.youtube.com/results?search_query=Quantization+aware+model+training YouTube search...]
 
[http://www.youtube.com/results?search_query=Quantization+aware+model+training YouTube search...]
 
[http://www.google.com/search?q=Quantization+aware+model+training ...Google search]
 
[http://www.google.com/search?q=Quantization+aware+model+training ...Google search]
 +
 +
 +
* [http://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/ How to Quantize Neural Networks with TensorFlow | Pete Warden]
 +
* [http://heartbeat.fritz.ai/8-bit-quantization-and-tensorflow-lite-speeding-up-mobile-inference-with-low-precision-a882dfcafbbd 8-Bit Quantization and TensorFlow Lite: Speeding up mobile inference with low precision | Manas Sahni]
 +
* [[TensorFlow Lite]]
 +
 +
the process of constraining an input from a continuous or otherwise large set of values (such as the real numbers) to a discrete set (such as the integers). An umbrella term that covers a lot of different techniques to store numbers and perform calculations on them in more compact formats than 32-bit floating point.
 +
 +
http://cdn-images-1.medium.com/max/800/0*lKwwM6_WSyBRkPCe.png
  
  
Line 26: Line 35:
  
 
<youtube>eZdOkDtYMoo</youtube>
 
<youtube>eZdOkDtYMoo</youtube>
 +
 +
http://petewarden.files.wordpress.com/2016/05/quantization2.png

Revision as of 20:53, 2 March 2019

YouTube search... ...Google search


the process of constraining an input from a continuous or otherwise large set of values (such as the real numbers) to a discrete set (such as the integers). An umbrella term that covers a lot of different techniques to store numbers and perform calculations on them in more compact formats than 32-bit floating point.

0*lKwwM6_WSyBRkPCe.png


Quantization-aware model training

ensures that the forward pass matches precision for both training and inference. There are two aspects to this:

  • Operator fusion at inference time are accurately modeled at training time.
  • Quantization effects at inference are modeled at training time.

For efficient inference, TensorFlow combines batch normalization with the preceding convolutional and fully-connected layers prior to quantization by folding batch norm layers.

Post-training quantization

is a general technique to reduce the model size while also providing up to 3x lower latency with little degradation in model accuracy. Post-training quantization quantizes weights to 8-bits of precision from floating-point.


quantization2.png