Quantization refers to the idea of using a smaller resolution on the weight variables where possible. We utilize the tensorflow quantization system to use int16 or int32 instead of a double or a float for the internal weights of the network. Our pipeline compares the impact to ensure that the accuracy or the objective function of the network is negligibly impacted.
Our runtime ensures that we take advantage of any GPU optimizations to run the neural network efficiently. The Adreno GPU line, part of the Snapdragon processors, exposes these functions via their neural network library which we take advantage of during runtime.
This refers to the technique of pruning chunks of a rather sparse neural network. NN-pruning is a well known technique that can result in significant reduction in neural network size and improve efficiency.