字幕列表 影片播放
[MUSIC PLAYING]
JIAN LI: Hello, everyone.
My name's Jian.
I'm a software engineer on the TensorFlow team.
Today, my colleague Pulkit and I will
be talking about the TensorFlow model optimization toolkit.
Model optimization means transforming your machine
learning models to make them efficient to execute.
That means faster computation as well as a lower memory,
storage, and battery usage.
And it is focused on inference instead of training.
And because of the above mentioned benefits,
optimization can unlock use cases
that are otherwise impossible.
Examples include speech recognition, face unlock,
object detection, music recognition, and many more.
The model optimization toolkit is a suite
of TensorFlow and TensorFlow Lite tools
that make it simple to optimize your model.
Optimization is an active research area
and there are many techniques.
Our goal is to prioritize the ones that
are general across model architectures
and across various hardware accelerators.
There are two major techniques in the toolkit, quantization
and pruning.
Quantization stimulates flow calculation in lower bits,
and pruning forces zero interconnection.
Today we are going to focus on quantization
and we'll briefly talk about pruning.
Now let's take a closer look at quantization.
Quantization is a general term describing technologies
that reduce the numerical precision of static parameters
and execute the operations in lower precision.
Precision reduction makes the model smaller,
and a lower precision execution makes the model faster.
Now let's dig a bit more onto how we perform quantization.
As a concrete example, imagine we have
a tensor with float values.
In most cases, we are wasting most of the representation
space in the float number line.
If we can find a linear transformation that
maps the float value onto int8, we can reduce the model size
by a factor of four.
Then computations can be carried out between int8 values,
and that is where the speed up comes from.
So there are two main approaches to do quantization, post
training and during training.
Post training operates on a already trained model
and is built on top of TensorFlow Lite converter.
During training, quantization performs additional weight
fine-tuning, and since training is required,
it is a build on top of a TensorFlow Keras API.
Different techniques offers a trade off
between ease of use and model accuracy.
The most easy to use technique is the dynamic range
quantization, which doesn't require any data.
There can be some accuracy loss but we get a two to three times
speed up.
Because floating point calculation
is still needed for the activation,
it's only meant to run on CPU.
If we want extra speed up on CPU or want
to run the model on hardware accelerators,
we can use integer quantization.
It runs a small set of unlabeled calibration data
to collect the min-max range on activation.
This removes the floating point calculation
in the computer graph, so there is a speed up on CPU.
But more importantly, it allows the model
to run on hardware accelerators such as DSP and TPU,
which are faster and more energy efficient than CPU.
And if accuracy is a concern, we can
use Quantization Aware Training to fine-tune the weights.
It has all the benefits of integer quantization,
but it requires training.
Now let's have a operator level breakdown on the post training
quantization.
Dynamic range quantization is fully supported
and integer quantization is supported
for most of the operators.
The missing piece is the recurrent neural network
support, and that blocks use cases
such as speech and language where a context is needed.
To unblock those use cases, we have recently
added a recurrent neural network quantization
and built a turnkey solution through the post training API.
RNN model build with Keras 2.0 can be converted and quantized
with the post training API.
This slide shows the end to end workflow
in the post training setup.
We create the TensorFlow Lite converter
and load the saved RNN model.
We then set the post training optimization flags
and provide calibration data.
After that, we are able to call the convert method to convert
and quantized the model.
This is the exact same API and workflow for models
without RNN, so there is no API change for the end users.
Let's take a look at the challenges
of the RNN quantization.
Quantization is a lossy transformation.
RNN cell has a memory state that persists
across multiple timestamps, so quantization errors
can accumulate in both the layer direction and the time
direction.
RNN cell contains many calculations,
and determining the number of bits and the scale
is a global optimization problem.
Also, quantized operations are restricted
by hardware capabilities.
Some operations are not allowed on certain hardware platforms.
We solved the challenge and created the quantization spec
for RNN.
The full spec is quite complicated,
and this slide shows this spec by zooming
into one of the LSTM gates.
As I mentioned, there are many calculations in one cell.
To balance performance and accuracy,
we keep eight bit calculations as much as possible
and it only goes to higher bits when required by accuracy.
As you can see from the diagram, metrics
related operations are in 8 bit, and web related operations
are a mixture of 8 bit and 16 bits.
And please note, the use of higher bits
is only internal to the cell.
The input and output activation for RNN cell are all 8 bits.
Now we see the details of RNN quantization.
Let's look at the accuracy and the performance.
This table shows some published accuracy numbers
on a few data sets.
It's a speech recognition model that consists
of 10 layers of quantized LSTM.
As you can see, integer quantized model
has the same accuracy as the dynamic range quantized model,
and the accuracy loss is negligible
compared with the float case.
Also, this is a permanent model, so RNN quantization
works with pruning as well.
As expected, there is a four time model size reduction
because static weights are quantized to 8 bits.
Performance-wise, there is a two to four times
speed up on a CPU and a more than 10 times speed
up on DSP and TPU.
So those numbers are consistent with the numbers
from other operators.
So here are the main takeaways.
TensorFlow now supports the RNN/LSTM quantization.
It is a turnkey solution through the post training API.
It enables smaller, faster, and a more energy
efficient execution that can run on DSP and TPU.
There are already production models
that use the quantization.
And please check the link for more details on the use cases.
Looking forward, our next step will
be to expand quantization to other recurrent neural
networks, such as the GRU and SRU.
We also plan to add Quantization Aware Training for RNN.
Now I'll hand it over to my colleague Pulkit.
Thank you.
PULKIT BHUWALKA: Thanks.
Thanks Jian.
Hi, my name is Pulkit.
I work on model optimization tool kitting.
And let's talk about--
clicker doesn't seem to be working.
Sorry, can we go back a slide?
Yes.
Quantization Aware Training.
So Quantization Aware Training is a training time technique
for improving the accuracy of quantized models.
The way it works is that we introduced