字幕列表 影片播放 列印英文字幕 QIUMIN XU: Hello, everyone. I am Qiumin. I am a software engineer at Google working on TensorFlow Performance. Today I'm very excited to introduce you to our brand new TensorFlow 2 Performance Profiler. We all like speed, and we want our models to run faster. TensorFlow 2 Profiler can help you improve your model performance like a professional player. In this talk, we're going to first talk about what's new in TF2 Profiler, and then we'll show you a case study. I'm a performance engineer, and this is how I used to start my day. In the morning, I ran a model and a capture trace of it. I would gather the profiling results in a spreadsheet to analyze the bottlenecks and optimize the model. We also have gigabytes of traces, and to process all of them manually is boring and time-consuming. Then, after that, we run the model again to check for performance. If your performance is quite good, hooray, we have done our job. Go and grab coffee. Otherwise, we will go back to step one-- recapture a profile, gather results, and find out the reason, fix it, and try again. Repeat this iteration by n times until the performance is good. This is a typical day of a performance engineer. Can we make it more productive? The most repeated work here is to gather the trace information and analyze the result. We always want to work smarter. At Google, we find out a way to build tools to automatically process other traces, analyze them, and provide automated performance guidance. It does intensive trace analysis, learns from how Google internal experts tune the performance and automate it for non-expert users. Here's the thing I'm very excited about. We are releasing this most useful set of internal tools today as a TF2 Profiler. The same set of tools in TF2 Profiler has been used extensively inside Google, and we are making it available to public. Let me introduce you to the toolset. Today, we will launch eight tools. Four of them are common to CPU, GPU, and TPUs. This enables consistent metrics and analysis across different platforms. The first tool is called Overview Page. This tool provides an overview of the performance of the workload running on the device. The second tool is Input Pipeline Analyzer. It is very powerful tool to analyze the TensorFlow Input Pipeline. TensorFlow rates data from the files in the pipeline demand. And an inefficient input pipeline severely slows down your application. This tool presents an in-depth analysis of your model input pipeline performance, based on various performance data collected. At the high level, this tool tells you whether your program is input bound. If that is the case, the tool can also walk you through the device and the host-side analysis to debug which stage of the pipeline is the bottleneck. The third tool we released today is called TensorFlow Stats. TensorFlow Stats presents TensorFlow ops statistic in charts and tables. The fourth tool we released today is called Trace Viewer. Trace Viewer tool displays detailed event timeline for in-depth performance debugging. We also provide four tools that are TPU or GPU specific. They are all available today on TensorFlow. Please check out. Now let's look at the case study. Let's assume that we are running an un-optimized Resnet50 Model on a V100 GPU. TF2 Profiler provides a number of ways to capture a profile. In this talk, we will focus on Keras callback. To check out other ways of profiling, including sampling and the programatically profiling, refer to TensorFlow docs for more details. Using Keras TensorBoard callback, we simply need to add an additional line specifying profiling range. The argument profile_batch equals to 150 to 160 here indicates we are start to profile from batch 150 to 160. Run a model, launch TensorBoard, and go to the Profile plugin. Here's a Performance Overview. Let's remain and look at the Performance Overview page. It contains three sections-- Performance Summary, Step-time Graph, and the Recommendation for the Next Step. Let's zoom into each of them. First, let's look at the performance summary. It shows the average step-time and breaks it down into the time spent on compilation, input output, kernel lunches, and the communication time. The next is a step-time graph. We can see the step-time is broken down into compilation time, kernel launch, compute, compute communication as well, and you can see how these breakdown changes over a number of steps. In this example, there's a lot of redness in this chart, and indicates it is severely input bound. The next is what I feel most excited about. This is the recommendation provided by our tool. Assess-- your program is highly input bound because 81.4% of the total step-time sampled is waiting for input. Therefore, we should first focus on reducing the input time. Overview page also provides a recommendation on which tool you should check out next. In this example, Input Pipeline Analyzer and the Trace Viewer are the next tools to see. In addition, this tool also suggests the related useful resources to check out to improve the input pipeline. Let's follow this recommendation and check out the Input Pipeline Analyzer tool. See, this is the host analysis breaking down, provided by the tool. It automatically detects the most time spent on the data processing. What should we do next? Our tool actually tells you what can be done next to reduce the data preprocessing. This is what is recommended by our tool. You may increase the number of parallel calls in the dataset map or process the data offline. If you follow the link on the dataset map, you will see how to do that. According to the guide, we change the sequential map to use a parallel course. We are also not to forget to try the most convenient autotune team option, which will tune the value dynamically at runtime. After this optimization, let's capture a new profile. Now you can see the redness is all gone in the step-time graph, and the model is no longer input bound. Checking the performance summary again, now you get 5x speedup. Overview page now recommends differently. It says your program is not input bound because only 0.1% of the total step-time sample is waiting for input. Therefore, you should instead focus on reducing other time. Here's another thing we can do. If you look at the other recommendations, the model is all using 32 bits. If you replace all of them by 16 bits, you can get 10x speedup. This release is just the beginning, and we have more features upcoming. We are working on Keras-specific analysis and the multiworker GPU analysis. Stay tuned. We also welcome your feedbacks, and please let us know and contribute your ideas. TensorFlow 2 Profiler is the tool you need for investigating TF2 performance. It works on CPU, GPU, and TPU. Here's more things to read-- a tutorial, guide, and Github source code. There are also two more related talks on performance tuning in this afternoon. They are super exciting, and don't miss them. Finally, I want to thank everyone who worked on this project. You are super amazing teammates.
B1 中級 TF 2中的性能剖析(TF開發峰會'20) (Performance profiling in TF 2 (TF Dev Summit '20)) 5 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字