字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] JIAN LI: Hello, everyone. My name's Jian. I'm a software engineer on the TensorFlow team. Today, my colleague Pulkit and I will be talking about the TensorFlow model optimization toolkit. Model optimization means transforming your machine learning models to make them efficient to execute. That means faster computation as well as a lower memory, storage, and battery usage. And it is focused on inference instead of training. And because of the above mentioned benefits, optimization can unlock use cases that are otherwise impossible. Examples include speech recognition, face unlock, object detection, music recognition, and many more. The model optimization toolkit is a suite of TensorFlow and TensorFlow Lite tools that make it simple to optimize your model. Optimization is an active research area and there are many techniques. Our goal is to prioritize the ones that are general across model architectures and across various hardware accelerators. There are two major techniques in the toolkit, quantization and pruning. Quantization stimulates flow calculation in lower bits, and pruning forces zero interconnection. Today we are going to focus on quantization and we'll briefly talk about pruning. Now let's take a closer look at quantization. Quantization is a general term describing technologies that reduce the numerical precision of static parameters and execute the operations in lower precision. Precision reduction makes the model smaller, and a lower precision execution makes the model faster. Now let's dig a bit more onto how we perform quantization. As a concrete example, imagine we have a tensor with float values. In most cases, we are wasting most of the representation space in the float number line. If we can find a linear transformation that maps the float value onto int8, we can reduce the model size by a factor of four. Then computations can be carried out between int8 values, and that is where the speed up comes from. So there are two main approaches to do quantization, post training and during training. Post training operates on a already trained model and is built on top of TensorFlow Lite converter. During training, quantization performs additional weight fine-tuning, and since training is required, it is a build on top of a TensorFlow Keras API. Different techniques offers a trade off between ease of use and model accuracy. The most easy to use technique is the dynamic range quantization, which doesn't require any data. There can be some accuracy loss but we get a two to three times speed up. Because floating point calculation is still needed for the activation, it's only meant to run on CPU. If we want extra speed up on CPU or want to run the model on hardware accelerators, we can use integer quantization. It runs a small set of unlabeled calibration data to collect the min-max range on activation. This removes the floating point calculation in the computer graph, so there is a speed up on CPU. But more importantly, it allows the model to run on hardware accelerators such as DSP and TPU, which are faster and more energy efficient than CPU. And if accuracy is a concern, we can use Quantization Aware Training to fine-tune the weights. It has all the benefits of integer quantization, but it requires training. Now let's have a operator level breakdown on the post training quantization. Dynamic range quantization is fully supported and integer quantization is supported for most of the operators. The missing piece is the recurrent neural network support, and that blocks use cases such as speech and language where a context is needed. To unblock those use cases, we have recently added a recurrent neural network quantization and built a turnkey solution through the post training API. RNN model build with Keras 2.0 can be converted and quantized with the post training API. This slide shows the end to end workflow in the post training setup. We create the TensorFlow Lite converter and load the saved RNN model. We then set the post training optimization flags and provide calibration data. After that, we are able to call the convert method to convert and quantized the model. This is the exact same API and workflow for models without RNN, so there is no API change for the end users. Let's take a look at the challenges of the RNN quantization. Quantization is a lossy transformation. RNN cell has a memory state that persists across multiple timestamps, so quantization errors can accumulate in both the layer direction and the time direction. RNN cell contains many calculations, and determining the number of bits and the scale is a global optimization problem. Also, quantized operations are restricted by hardware capabilities. Some operations are not allowed on certain hardware platforms. We solved the challenge and created the quantization spec for RNN. The full spec is quite complicated, and this slide shows this spec by zooming into one of the LSTM gates. As I mentioned, there are many calculations in one cell. To balance performance and accuracy, we keep eight bit calculations as much as possible and it only goes to higher bits when required by accuracy. As you can see from the diagram, metrics related operations are in 8 bit, and web related operations are a mixture of 8 bit and 16 bits. And please note, the use of higher bits is only internal to the cell. The input and output activation for RNN cell are all 8 bits. Now we see the details of RNN quantization. Let's look at the accuracy and the performance. This table shows some published accuracy numbers on a few data sets. It's a speech recognition model that consists of 10 layers of quantized LSTM. As you can see, integer quantized model has the same accuracy as the dynamic range quantized model, and the accuracy loss is negligible compared with the float case. Also, this is a permanent model, so RNN quantization works with pruning as well. As expected, there is a four time model size reduction because static weights are quantized to 8 bits. Performance-wise, there is a two to four times speed up on a CPU and a more than 10 times speed up on DSP and TPU. So those numbers are consistent with the numbers from other operators. So here are the main takeaways. TensorFlow now supports the RNN/LSTM quantization. It is a turnkey solution through the post training API. It enables smaller, faster, and a more energy efficient execution that can run on DSP and TPU. There are already production models that use the quantization. And please check the link for more details on the use cases. Looking forward, our next step will be to expand quantization to other recurrent neural networks, such as the GRU and SRU. We also plan to add Quantization Aware Training for RNN. Now I'll hand it over to my colleague Pulkit. Thank you. PULKIT BHUWALKA: Thanks. Thanks Jian. Hi, my name is Pulkit. I work on model optimization tool kitting. And let's talk about-- clicker doesn't seem to be working. Sorry, can we go back a slide? Yes. Quantization Aware Training. So Quantization Aware Training is a training time technique for improving the accuracy of quantized models. The way it works is that we introduced some of the errors which actually happened during quantized inference into the training process, and that actually helps the trainer learn around these errors and get a more accurate model. Now let's just try to get a sense of why is this needed in the first place. So we know that quantized models, they run in lower precision, and because of that, it's a lossy process, and that leads to an accuracy drop. And while quantized models are super fast and we want them, but nobody wants an accurate model. So the goal is to kind of get the best of both worlds, and that's why we have this system. To get a sense of why these losses get introduced, one is that we actually have a-- once we have quantized models, these parameters are in lower precision. So, in a sense, you have more coarse information, fewer buckets of information. So that's where you have information representation loss. The other problem is that, when you're actually doing these computations, then you have computation loss when you're actually adding to coarse values instead of finer buckets of values. Typically, during matrix multiplication type of operations, even if you're doing it at int8, you accumulate these values to int32, and then you rescale them back to int8, so you have that rescaling loss. The other thing is that, generally, when we run these quantized models during inference, there are various inference optimizations that get applied to the graph, and because of that, the training graph and the inference graph can be subtly different, which also can potentially introduce some of these errors. And how do we recover lost accuracy? Well, for starters, we try to make the training graph as similar as possible to the inference graph to remove these subtle differences. And the other is that we actually introduce these errors which actually happened during inference, so the trainer learns around it and machine learning does its magic. So for example, when it comes to mimicking errors, as you can see in the graph here, you go from weights to lower precision. So let's say if your weights are in floating point, you go down to int8, and then you go back up to floating point. So in that sense, you've actually mimicked what happens during inference when you're executing at lower precision. Then you actually do your computation, and because both your inputs and your weights are at int8 and the losses have been introduced, the computation happens correctly. But then after the computation, you add another fake quant to kind of drop back to lower precision. The other thing is we model the inference part. So for example, if you noticed in the previous slide, the fake quant operation came after the value activation. So this is one of the optimizations that happened during inference, that the value gets folded in. And what we do is that when we're actually constructing your graph, we make sure that these sorts of optimizations get added in. And let's look at the numbers. So the numbers are pretty good. So if you look at the slide, we're almost as close as the float baseline on various version models that we've tried. So this is really powerful. You can actually execute a model which gives you nearly as good accuracy and is quantized. So what's the value to users? Well, you have on the one hand a simple, almost one line API that you can use to quantize your model, train it, convert it, and go ahead and execute it. This works great for app developers, ML engineers, et cetera. You might want to go one step ahead, and then we have a slightly more complicated API, where it's like, hey, you can kind of configure your quantization however you want, and this would be something that's quite useful to ML engineers, some researchers. And if you want to go completely out there, you can actually completely configure quantization algorithm schemes, different bits, et cetera, what do you want, and this provides a very good fertile ground for researchers or hardware engineers. So basically, the API is, easy is easy, hard is possible. So let's look at how do we do this. So well, this is your standard Keras model. If you want to, let's say, quantize your entire model, typically you construct the model, import tensorflow as tf, model.compile, model.fit, go ahead, right? Now, let's look at what quantizing the model looks like. Pretty much the same thing, right? Import tensorflow_model_optimization as tfmot. That's the package you put in. You construct your model, quantize the model, and then just go ahead. You do your compile fit, all of that, continue with that. Now, you might not want to quantize the entire model. Maybe you want to quantize a subset of your model, because some parts of the model are either most sensitive to quantization losses, or you want to get the most performance out of them. So you want to quantize only a part of your model. And in that case, it's still pretty simple, just slightly different. So for example, you have a quantized annotate layer. You tell it which layers you want to quantize, and then you apply it at the end, and then you're good to go. Beyond that, you might want to control the quantization within a layer. So for example, you have a particular layer, but you want to control which weights you want to quantize, how you want to quantize it. And in that case, also, it's pretty similar API. You use quantize annotate layer, but when you actually pass in the layer, you also pass in a specific config, and this config tells the infrastructure how to actually quantize this layer. And the rest of the API is the same. Let's look at how you define this config. So this config is largely telling us two things. One is that what is it within that layer that you want to quantize, and the other is how you want to quantize it. So you tell us which weight or which activation you want to quantize. And the other thing is you tell us-- you give us-- pass us a quantizer, and this quantizer is basically an object that encapsulates kind of the algorithm about how to quantize this. We give you a bunch of built-in ones, but you can write your own. You might want to quantize your own layer. So let's say you have a special algorithm, like a fancy convolutional layer that you write, and you want to quantize that as well. Well, you do it almost in exactly the same way. You quantize annotate your layer, you pass in a config, and this config tells us how we should quantize your fancy layer. And again, you tell us what to quantize, how to quantize it. And in this case, what do you look-- what you notice is that there is like a histogram quantizer, and this is, let's say, a special quantizer. And a special quantizer is interesting, because that allows you to completely control what sort of strategy you're using to quantize your model. You, in this case, could use a histogram to determine the range and then quantize it. And that's how you would write the algorithm. And it's pretty simple. You just implement two methods. One is build, which is basically for you to construct any variables you need. And then in the call method, we give you a bunch of tensors, you quantize them however you wish. You return us the tensors and we'll take care of the rest. And it doesn't end here. We actually provide you the ability to completely kind of define your own schemes, specify how each layer should be quantized, going so far as to kind of-- I mentioned earlier that you can-- we fuse the values for you, for example, so you can actually define your own kind of transforms, which tell what sort of manipulations you want to do on the graph. So in summary, Quantization Aware Training is an API which helps you recover your accuracy while getting the benefits of quantization. It's a pretty simple API for easy tasks, but quite flexible if you want to do more complicated things. And it simulates quantization loss that happens on various different backends and schemes. You can kind of configure that. There are cooler things coming up. We released the sparsity training time API sometime back. But now we're working on sparse kernel execution, and that's coming up. And then you'll have an end to end story, that you can train sparse models and execute it on device. And you can also use quantization and sparsity together, and that's quite powerful when they go together. So that's the model optimization toolkit. It's a suite of tools that make your models faster and smaller. Quantization and sparsity are the main techniques that we have. You can find us on github/model-optimization. Please file any requests, concerns, bugs, or feedback that you have, and we're always working on making those models smaller and faster. Thank you. [MUSIC PLAYING]
B1 中級 使用TF模型優化工具包優化您的模型(TF Dev Summit '20)。 (Optimize your models with TF Model Optimization Toolkit (TF Dev Summit '20)) 1 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字