TF 2中的性能剖析（TF開發峰會'20) (Performance profiling in TF 2 (TF Dev Summit '20))

字幕列表影片播放

QIUMIN XU: Hello, everyone.
I am Qiumin.
I am a software engineer at Google working
on TensorFlow Performance.
Today I'm very excited to introduce you
to our brand new TensorFlow 2 Performance Profiler.
We all like speed, and we want our models to run faster.
TensorFlow 2 Profiler can help you
improve your model performance like a professional player.
In this talk, we're going to first talk
about what's new in TF2 Profiler,
and then we'll show you a case study.
I'm a performance engineer, and this
is how I used to start my day.
In the morning, I ran a model and a capture trace of it.
I would gather the profiling results
in a spreadsheet to analyze the bottlenecks
and optimize the model.
We also have gigabytes of traces,
and to process all of them manually
is boring and time-consuming.
Then, after that, we run the model again
to check for performance.
If your performance is quite good,
hooray, we have done our job.
Go and grab coffee.
Otherwise, we will go back to step one--
recapture a profile, gather results,
and find out the reason, fix it, and try again.
Repeat this iteration by n times until the performance is good.
This is a typical day of a performance engineer.
Can we make it more productive?
The most repeated work here is to gather the trace information
and analyze the result. We always want to work smarter.
At Google, we find out a way to build
tools to automatically process other traces,
analyze them, and provide automated performance guidance.
It does intensive trace analysis,
learns from how Google internal experts tune the performance
and automate it for non-expert users.
Here's the thing I'm very excited about.
We are releasing this most useful set of internal tools
today as a TF2 Profiler.
The same set of tools in TF2 Profiler
has been used extensively inside Google,
and we are making it available to public.
Let me introduce you to the toolset.
Today, we will launch eight tools.
Four of them are common to CPU, GPU, and TPUs.
This enables consistent metrics and analysis
across different platforms.
The first tool is called Overview Page.
This tool provides an overview of the performance
of the workload running on the device.
The second tool is Input Pipeline Analyzer.
It is very powerful tool to analyze the TensorFlow Input
Pipeline.
TensorFlow rates data from the files in the pipeline demand.
And an inefficient input pipeline severely
slows down your application.
This tool presents an in-depth analysis of your model input
pipeline performance, based on various performance
data collected.
At the high level, this tool tells you
whether your program is input bound.
If that is the case, the tool can also
walk you through the device and the host-side analysis
to debug which stage of the pipeline is the bottleneck.
The third tool we released today is called TensorFlow Stats.
TensorFlow Stats presents TensorFlow ops statistic
in charts and tables.
The fourth tool we released today is called Trace Viewer.
Trace Viewer tool displays detailed event timeline
for in-depth performance debugging.
We also provide four tools that are TPU or GPU specific.
They are all available today on TensorFlow.
Please check out.
Now let's look at the case study.
Let's assume that we are running an un-optimized Resnet50
Model on a V100 GPU.
TF2 Profiler provides a number of ways to capture a profile.
In this talk, we will focus on Keras callback.
To check out other ways of profiling,
including sampling and the programatically profiling,
refer to TensorFlow docs for more details.
Using Keras TensorBoard callback,
we simply need to add an additional line specifying
profiling range.
The argument profile_batch equals to 150 to 160
here indicates we are start to profile from batch 150 to 160.
Run a model, launch TensorBoard, and go to the Profile plugin.
Here's a Performance Overview.
Let's remain and look at the Performance Overview page.
It contains three sections--
Performance Summary, Step-time Graph,
and the Recommendation for the Next Step.
Let's zoom into each of them.
First, let's look at the performance summary.
It shows the average step-time and breaks
it down into the time spent on compilation,
input output, kernel lunches, and the communication time.
The next is a step-time graph.
We can see the step-time is broken down
into compilation time, kernel launch,
compute, compute communication as well,
and you can see how these breakdown changes
over a number of steps.
In this example, there's a lot of redness in this chart,
and indicates it is severely input bound.
The next is what I feel most excited about.
This is the recommendation provided by our tool.
Assess-- your program is highly input bound
because 81.4% of the total step-time sampled
is waiting for input.
Therefore, we should first focus on reducing the input time.
Overview page also provides a recommendation on which tool
you should check out next.
In this example, Input Pipeline Analyzer and the Trace Viewer
are the next tools to see.
In addition, this tool also suggests the related useful
resources to check out to improve the input pipeline.
Let's follow this recommendation and check out the Input
Pipeline Analyzer tool.
See, this is the host analysis breaking down,
provided by the tool.
It automatically detects the most time
spent on the data processing.
What should we do next?
Our tool actually tells you what can
be done next to reduce the data preprocessing.
This is what is recommended by our tool.
You may increase the number of parallel calls
in the dataset map or process the data offline.
If you follow the link on the dataset map,
you will see how to do that.
According to the guide, we change the sequential map
to use a parallel course.
We are also not to forget to try the most convenient
autotune team option, which will tune the value
dynamically at runtime.
After this optimization, let's capture a new profile.
Now you can see the redness is all
gone in the step-time graph, and the model
is no longer input bound.
Checking the performance summary again, now you get 5x speedup.
Overview page now recommends differently.
It says your program is not input bound because only 0.1%
of the total step-time sample is waiting for input.
Therefore, you should instead focus on reducing other time.
Here's another thing we can do.
If you look at the other recommendations,
the model is all using 32 bits.
If you replace all of them by 16 bits, you can get 10x speedup.
This release is just the beginning,
and we have more features upcoming.
We are working on Keras-specific analysis
and the multiworker GPU analysis.
Stay tuned.
We also welcome your feedbacks, and please let us
know and contribute your ideas.
TensorFlow 2 Profiler is the tool
you need for investigating TF2 performance.
It works on CPU, GPU, and TPU.
Here's more things to read--
a tutorial, guide, and Github source code.
There are also two more related talks on performance
tuning in this afternoon.
They are super exciting, and don't miss them.
Finally, I want to thank everyone
who worked on this project.
You are super amazing teammates.