字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] JIAN LI: Hello, everyone. My name's Jian. I'm a software engineer on the TensorFlow team. Today, my colleague Pulkit and I will be talking about the TensorFlow model optimization toolkit. Model optimization means transforming your machine learning models to make them efficient to execute. That means faster computation as well as a lower memory, storage, and battery usage. And it is focused on inference instead of training. And because of the above mentioned benefits, optimization can unlock use cases that are otherwise impossible. Examples include speech recognition, face unlock, object detection, music recognition, and many more. The model optimization toolkit is a suite of TensorFlow and TensorFlow Lite tools that make it simple to optimize your model. Optimization is an active research area and there are many techniques. Our goal is to prioritize the ones that are general across model architectures and across various hardware accelerators. There are two major techniques in the toolkit, quantization and pruning. Quantization stimulates flow calculation in lower bits, and pruning forces zero interconnection. Today we are going to focus on quantization and we'll briefly talk about pruning. Now let's take a closer look at quantization. Quantization is a general term describing technologies that reduce the numerical precision of static parameters and execute the operations in lower precision. Precision reduction makes the model smaller, and a lower precision execution makes the model faster. Now let's dig a bit more onto how we perform quantization. As a concrete example, imagine we have a tensor with float values. In most cases, we are wasting most of the representation space in the float number line. If we can find a linear transformation that maps the float value onto int8, we can reduce the model size by a factor of four. Then computations can be carried out between int8 values, and that is where the speed up comes from. So there are two main approaches to do quantization, post training and during training. Post training operates on a already trained model and is built on top of TensorFlow Lite converter. During training, quantization performs additional weight fine-tuning, and since training is required, it is a build on top of a TensorFlow Keras API. Different techniques offers a trade off between ease of use and model accuracy. The most easy to use technique is the dynamic range quantization, which doesn't require any data. There can be some accuracy loss but we get a two to three times speed up. Because floating point calculation is still needed for the activation, it's only meant to run on CPU. If we want extra speed up on CPU or want to run the model on hardware accelerators, we can use integer quantization. It runs a small set of unlabeled calibration data to collect the min-max range on activation. This removes the floating point calculation in the computer graph, so there is a speed up on CPU. But more importantly, it allows the model to run on hardware accelerators such as DSP and TPU, which are faster and more energy efficient than CPU. And if accuracy is a concern, we can use Quantization Aware Training to fine-tune the weights. It has all the benefits of integer quantization, but it requires training. Now let's have a operator level breakdown on the post training quantization. Dynamic range quantization is fully supported and integer quantization is supported for most of the operators. The missing piece is the recurrent neural network support, and that blocks use cases such as speech and language where a context is needed. To unblock those use cases, we have recently added a recurrent neural network quantization and built a turnkey solution through the post training API. RNN model build with Keras 2.0 can be converted and quantized with the post training API. This slide shows the end to end workflow in the post training setup. We create the TensorFlow Lite converter and load the saved RNN model. We then set the post training optimization flags and provide calibration data. After that, we are able to call the convert method to convert and quantized the model. This is the exact same API and workflow for models without RNN, so there is no API change for the end users. Let's take a look at the challenges of the RNN quantization. Quantization is a lossy transformation. RNN cell has a memory state that persists across multiple timestamps, so quantization errors can accumulate in both the layer direction and the time direction. RNN cell contains many calculations, and determining the number of bits and the scale is a global optimization problem. Also, quantized operations are restricted by hardware capabilities. Some operations are not allowed on certain hardware platforms. We solved the challenge and created the quantization spec for RNN. The full spec is quite complicated, and this slide shows this spec by zooming into one of the LSTM gates. As I mentioned, there are many calculations in one cell. To balance performance and accuracy, we keep eight bit calculations as much as possible and it only goes to higher bits when required by accuracy. As you can see from the diagram, metrics related operations are in 8 bit, and web related operations are a mixture of 8 bit and 16 bits. And please note, the use of higher bits is only internal to the cell. The input and output activation for RNN cell are all 8 bits. Now we see the details of RNN quantization. Let's look at the accuracy and the performance. This table shows some published accuracy numbers on a few data sets. It's a speech recognition model that consists of 10 layers of quantized LSTM. As you can see, integer quantized model has the same accuracy as the dynamic range quantized model, and the accuracy loss is negligible compared with the float case. Also, this is a permanent model, so RNN quantization works with pruning as well. As expected, there is a four time model size reduction because static weights are quantized to 8 bits. Performance-wise, there is a two to four times speed up on a CPU and a more than 10 times speed up on DSP and TPU. So those numbers are consistent with the numbers from other operators. So here are the main takeaways. TensorFlow now supports the RNN/LSTM quantization. It is a turnkey solution through the post training API. It enables smaller, faster, and a more energy efficient execution that can run on DSP and TPU. There are already production models that use the quantization. And please check the link for more details on the use cases. Looking forward, our next step will be to expand quantization to other recurrent neural networks, such as the GRU and SRU. We also plan to add Quantization Aware Training for RNN. Now I'll hand it over to my colleague Pulkit. Thank you. PULKIT BHUWALKA: Thanks. Thanks Jian. Hi, my name is Pulkit. I work on model optimization tool kitting. And let's talk about-- clicker doesn't seem to be working. Sorry, can we go back a slide? Yes. Quantization Aware Training. So Quantization Aware Training is a training time technique for improving the accuracy of quantized models. The way it works is that we introduced