字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] ZONGWEI ZHOU: Good morning and good afternoon, researchers and TensorFlow developer. My name is Zongwei Zhou. I'm from TensorFlow performance team. Today, I'm going to tell you more about TensorFlow distributed-- training. Let me start the talk by sharing a piece of user experience that you might already encountered before. Suppose that you have an awesome prototype machine learning model that you worked so hard to make it run efficiently on single host with multiple GPUs. Now, it's time to really get it running end to end with some more resources. Then, you started your four of your beefy cloud virtual machines, each with multiple modern GPUs with connected fancy 100 gigabits network. With all these resources, you deploy your models and hope to see it run blazingly fast. But wait, why am I only getting 1.5X faster than a single machine while I am using four of them? Yes, I know. This is really frustrating. If you have similar experience and questions, this talk is for you. In today's talk, you are going to learn how to scale out your TensorFlow to Keras model to multiple machine multiple GPU. And using the optimization, we will ship in the upcoming TensorFlow 2.2 release. You are going to dramatically improve your training throughput. I use one case study, BERT SQuAD, for today's talk. BERT, of course, it has to be this revolutionary NLP model, which is very popular in the past year. I used the fine tuning task, which is the SQuAD reading comprehension task, for example. Your training model is going to be given some reading materials and questions. And your model needs to answer the questions. And then the context can be evaluated with the accuracy. The model we use today available in TensorFlow to Official Model Garden, where you could go and download via this link, and try it out following. And here is a quick demonstration of the BERT SQuAD training throughput. Starting from the first line, this is the training throughput using TensorFlow 2.1 one out of the box. With three optimizations in TensorFlow 2.2 that I'm going to introduce today, the model is now running 2 point X faster. Aren't you excited and want to see these optimizations? But before diving into the sizing optimization, let me provide some background information to introduce how the model is synchronously trained on multiple hosts, multiple GPU. And here, we are leveraging the native TensorFlow disputed training support, which is multi-worker mirror strategy. In the figure here, we have two GPU device on two different hosts. And we are using a simple deep learning model, which have two layer, A and B, and with one variable each. So each GPU receive a subset of the training data and compute the forward path using its local copy of model variables. And then, it runs through the backward path to calculate the gradients of each layer. After the gradient are calculated, all the devices now start communicating between themselves using allreduce algorithm to aggregate the gradients. Up til gradient aggregation, each device would get back the same set of the aggregated gradients and then use that to update their local variables. We call it synchronous, because every device needs to aggregate a gradient, get the same set of the aggregated gradient, update their variables, before they can actually proceed to the forward path of the next training step. So with so many phases in the deep learning training, how do we actually identify where the bottleneck is? Is it in forward path, backward path, gradient aggregation, or variable updates, how do we know where to optimize? Yes, of course, we are going to use TensorFlow profiler, which Quimin just introduced to you today. So let's start with the run of BART SQuAD model, log in to 1 VN, fire up the TensorBoard, take a profile, and then open the TraceViewer. And here is the trace you will get from the TraceViewer. In this view, you see there's a pink bar with number five at the bottom of the profile, which means that this is the number five training step in your profile. And this is the whole operations within this training step. And the first five rows are events that is related to GPU. Specifically, in the first row, you will see a GPU computation, which includes forward path, backward path, and variable updates. And you can also see a big blue bar called NCCL allreduce kernel, which is the gradient aggregation. It's called NCCL because, including TensorFlow, every machine learning framework is using NVIDIA NCCL allreduce library to aggregate gradients. So by far, you may already spot the problem, right. This is exactly the power of the TensorFlow profiler, which make it so visualized and obvious. So the problem here we are facing is that the gradient aggregation can actually dominate the entire step time. All you see is this big blue box of NCCL allreduce. So how do we resolve this issue and optimize the training performance? Here comes the three optimizations. First optimization from the profiler, you can get the information regarding the total time use in NCCL allreduce. And you know your model, you know the total size of the model variables and gradients. And from these three information, you can calculate your NCCL allreduce throughput. And you also know you use how many machines in each machine, how many GPUs are there. Use these two informations, and following the NCCL NVIDIA tuning guideline, you can calculate the ideal NCCL throughputs. And in this case, we found out that the real NCCL allreduce throughput is actually much smaller than the ideal throughput. Then, we know that the first optimization is to actually tune your NCCL to fully utilize the underlying cross-host network to improve the performance. Second optimization, so if you notice the big blue bar here, it says at 32, which means that the gradient aggregation is in the four precision float32 format. And we know that the model can be trained with mixed precision, so the gradient can actually be in lower precision live float16. So can we actually aggregate the gradient in float16, which would efficiently cut the network data that is being transferred by half and last improve the performance? The third one, if you noticed, the NCCL allreduce it change data across the network. And it doesn't even use the GPU itself a lot. So you see an empty space out there about the blue bar. So can we actually push the NCCL allreduce forward a little bit, so that it overlap with the GPU computation happening in backward path? This way we can reduce the GPU idle time and improve the training step time. So here is the three ideas, and they all look good. And I'm going show how to implement this idea using TensorFlow 2.2. First optimization, NCCL throughput tuning. So TensorFlow 2.2 is shipped with the latest NVIDIA NCCL libraries. And we have a lot of experiments on GoogleCloud VMs to identify a set of recommended parameters to help you reach the peak throughput. Users can append those parameters when running their models, such as the NCCL Socket NThreads parameter here, append it before the model.main.py. So if users has different network environment than the Cloud VMs, you might need to run the experiment to find out the optimum parameters. And this is suboptimal, and we are looking to improve that. We are working with NVIDIA to see if we can autotune the NCCL parameters. So in future TensorFlow release, user doesn't manually test and set any of these parameters. For now, with the optimal NCCL parameters, we see a dramatic 30% to 40% throughput improvement in BART SQuAD. These are good for just one optimization. So are you excited to see the second optimizations? OK, here it comes. If you remember, we were looking to aggregate the gradients in lower precision, which is float16. The prerequisite is that the model is already trained with Keras mixed precision API. And we have an online tutorial, following this link, to tell you about the technical details and usage of this API. Generally speaking, mixed precision would use two types of floating point number representation during training, float32 and float16. If you can see from the figure, float32 with more data, it can represent a larger range of numbers with high precision than float16. But float16 has its own advantage. Computing flaot16 in modern GPU can be up to 2 to 4X faster than a FP32 computation. So this is really a trade-off between the model accuracy and the computing efficiency. But mixed precision API actually try to give you the best of both world. To show that the mixed precision under the hood I will show the BART SQuAD custom training loop codes. With this code, user can get the maximum flexibility to change every aspect of the training. And here is the training loop code in TensorFlow 2.1. With mixed precision, your model variable are still kept in float32 for best model accuracy. But the computation, including the grading computation here, are all done in float16. And up to the gradient computation, the gradients are converted back to a FP32, and then apply the gradient updates to the FP32 model variables. The applied gradients API here would actually implicitly aggregate the FP32 gradient for you. This is why you see the gradient is aggregated in FP32 from the previous profile. With TensorFlow 2.2, now you can change the custom training loop code a bit to enable gradient aggregation in float16. First, we modify the optimizer apply gradient API. In our text and argument, called all_reduce_sum_gradients. When you set this argument to force, it essentially tell the optimizer API not to do the gradient aggregation for you. Instead, the user can manually call gradient aggregation API from distribution strategy to do the allreduce by yourself. And the best thing is that the user can apply any customized gradient operations before and after allreduce using this manual allreduce, including what we want to do for gradient aggregation in float16. So you simply just need to cast the gradient to FP16 before the allreduce, do the gradient allreduce, and then cast the gradient back to FP16. So this is as simple as what you need to do. And as you can see, you may find a lot of gradient casts here to flow in between FP32 and FP16, which is OK. Because the TensorFlow graph optimization will actually remove these redundant casts under the hood. So you only leave with one gradient cast from FP16 to FP32, which anyway you have to do, because you are applying gradient to the FP32 model variables. So the cast here is just more for your code readability but will not downgrade your performance. Also I want to make one more point. For advanced user that wants to customize gradient operation, including allreducing FP16, they can use the Custom Training Loop to get maximum flexibility. But for average user who just want the allreducing FP16 out of the box, so we are working on supporting this in a Keras Compile and Fit in upcoming future releases. So in the future release, you can get allreducing FP16 out of the box using Keras Compile and Fit API. So with FP16 allreduce, we are now seeing that the BART SQuAD training throughput is further increased by about 35%. We are now almost 2.2X throughput comparing with the TensorFlow 2.1 by far. But wait, we still got one more optimization. The third optimization, if you still remember, we want to reduce the GPU idle time by overlapping gradient aggregation with the backward computation on GPU. Let's take a look at the deep learning model figure we have seen previously. So in this model, after the TensorFlow calculate the gradient of your Layer B, we can immediately send out the gradient of the Layer B for aggregation. And at the same time, on your GPU, you can still calculated the gradient of Layer A. So as you can see right now, the network computation of allreduce the gradients of Layer B is now run in parallel with the gradient calculation in layer A, which we can use the GPU resources in network both efficiently. To enable this overlap, let's meet some more changes to the same custom training loop code in previous optimization. So in TensorFlow 2.2, we introduce these collected hints to the distribution strategy allreduce API by inputting a bytes per pack arguments to the allreduce API. It tells the TensorFlow to break your model gradients into multiple packs. And the TPU TensorFlow runtime, we actually send our pack once this pack of gradient is available in the backward computation. So this is as easy as it looks, just giving it a bytes per pack numbers, right. But the next question would be, what is the optimum bytes per pack to achieve the maximum training throughput? In TensorFlow 2.2, user needs to do some simple experiment to identify the optimum parameter. NVIDIA provide the official NCCL allreduce benchmarks, user can get these benchmarks running on their multi-hosts with their networking environment. And they need to change the data, allreduce data size. And typically, what user will see along with increase of the allreduce data size, so does the NCCL allreduce throughput, it would also increase with the data size. But up to a certain allreduce data size, the NCCL throughput will start to plateau, which means that your NCCL reached the limits of your underlying network. And here, the data range of the data size is exactly the optimum pack size, which is sufficiently large to saturate your underlying network. If you set the pack size to be smaller, each gradient pack you see now cannot fully utilize the network. So you are wasting your network bandwidth. If you set the gradient pack larger, then it means that TensorFlow needs to wait for longer time to actually waiting for the first pack of the gradient, it means that you have less overlaps. So we know that the optimum pack size is actually the allreduce pack size that reached the throughput plateau. But this seems to-- still some work that require from the user. So in future release, we are also working on autotuning this pack size so the user can get the optimum performance out of the box. So by far, we have seen these three optimizations. And it improves the BART SQuAD training throughput by 2.5X. And these optimizations are very useful in public Cloud when the networking between the Cloud VMs is limited. We are also working on more improvements, as I introduced, along with the talk. We are going to support Keras Compile and Fit, support autotune in NCCL parameters, and the pack size. All coming in the future releases of TensorFlow. So stay tuned for all and more improvements. [MUSIC PLAYING]
B2 中高級 將TensorFlow 2模型擴展到多工作站GPU(TF Dev Summit '20)。 (Scaling TensorFlow 2 models to multi-worker GPUs (TF Dev Summit '20)) 2 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字