TensorFlow內部：tf.data + tf.distributed。 (Inside TensorFlow: tf.data + tf.distribute)

字幕列表影片播放

JIRI SIMSA: Hi, everyone.
My name is Jiri.
I'm a software engineer on the TensorFlow team.
And today, I'm going to be talking to you about tf.data
and tf.distribute, which are TensorFlow's APIs for input
pipeline and distribution strategy, respectively.
To set the stage for what I'm going to be talking about,
let's think about what are the basic building
blocks for a machine learning workflow?
Machine learning operates over data.
It runs some computation.
And it uses some sort of hardware to do this task.
This hardware can either be a single CPU on your laptop.
Or possibly it can be on your workstation that
has either one or multiple accelerators,
either GPUs or TPUs, attached to it.
But you can also run the computation
across a large number of machines
that each have one or multiple accelerators attached to it.
Now, let's talk about the how the machine learning building
blocks are being served or reflected in the APIs
that TensorFlow provides.
So for the data handling part of the machine learning task,
TensorFlow provides a tf.data API.
It's the input pipeline API for TensorFlow.
For the computation itself, such as supervised learning,
TensorFlow offers a number of different both high level
and low level APIs.
You might be familiar with Keras or Estimators--
they've been mentioned in earlier talks today--
as well as lower level APIs for building custom training loops.
And finally, to hide the hardware details
of your computation, TensorFlow provides a tf.distribute API,
which allows you to create your input pipeline
and model in a way that's agnostic to the environment
in which it's going to execute.
So kind of thinking that your program is
going to run, perhaps, on a single device,
and with minimal changes being able to deploy it
on a large set of different devices, a possibility
of different machine learning architectures.
In this talk, I'm going to talk about the tf.data input
pipeline API.
And then, in the second part, I'm
also going to talk about the tf.distribute, the distribution
strategy API.
I'm not going to talk about Keras, and Estimator,
and other APIs for the modeling itself,
as that has been covered in previous talks.
So without further ado, let's get
started with tf.data, which is TensorFlow input pipeline API.
So let's ask ourselves a question.
Why do we need an input pipeline API in the first place?
Why don't we just load the data in memory,
maybe in our Python program as a non py array,
and pass it into a Keras model?
Well, there is actually a number of good reasons
why we need an API or why using one will benefit us.
First of all, the data might not fit into memory.
For example, the ImageNet data set
is 140 gigabytes of data, which do not necessarily
fit into memory on every laptop or workstation.
The data itself might also require randomized
preprocessing, which means that we cannot preprocess everything
ahead of time offline and then have the data to be ready
for training.
We actually need to have an input pipeline that
performs the preprocessing, such as, in the case of ImageNet,
perhaps image cropping or randomized image distortions
or transformations on the fly as we're
running dimensional learning computation.
Having an input pipeline API as an abstraction might also
allow us to, in the runtime of this API,
implement things in a way that allows the computation
to efficiently utilize the underlying hardware.
And I'm actually going to spend a fair amount of the first part
of my talk talking about how to efficiently utilize
the hardware through the tf.data input pipeline abstraction.
Last, but not least, which is something that ties the tf.data
API to the tf.distribution API, using an input pipeline API
abstraction allows us to decouple
the task of loading and preprocessing of the data
from the task of distributing the computation.
We are using the abstraction, which
allows you to create your input pipeline assuming
it's going to run on one place.
And then the distribution strategy
will somehow distribute the data without you
having to worry about the fact that the input pipeline might
actually be evaluated in multiple places in parallel.
So for those reasons, we created tf.data, TensorFlow's input
pipeline API.
And the way I like to think about tf.data is an input
pipeline API's created through tf.data--
it's an ETL process.
What I mean by that is the E, T, and L
stand for different parts of the input pipeline stages.
E stands for Extract.
This is the stage in which we read the data,
either from a memory or local or remote storage.
And we possibly parse the file format
that the data is stored in.
Perhaps it's compressed.
Then the T, the Transform stage, in this stage,
we perform either domain specific or domain
agnostic transformations.
So the domain specific transformations
are specific to the type of data we're dealing with.
So, for instance, text vectorization,
image transformation, or temporal video sampling
are examples of domain specific transformations.
While domain agnostic transformations
include things like shuffling of your data
during training or batching.
That is combining multiple elements
into a single higher dimensional element.
And, finally, the last stage of the input pipeline, Loading,
pertains to efficiently transferring
the data onto the accelerator, which is either a GPU or TPU.
What I should point out here is that, traditionally, the input
pipeline portion of your machine learning computation
happens on a CPU.
Because some of the operations are naturally
only possible on the CPU, which leaves the GPU and TPU
resources available for your machine
learning specific computations, such as your map models.
This makes-- this puts an extra pressure
on the efficiency with which the input pipeline performs.
And the reason for that is--
which is what I'm trying to illustrate here
with the graph--
is that over time the rate at which CPU performs
has plateaued, while the computational power of GPUs
and TPUs, thanks to recent hardware advances,
continues to accelerate at an exponential rate, which
opens up this performance gap between a raw CPU
and GPU/TPU processing power available in a single machine.
And that can-- the consequence of this
could be that the CPU part of your machine learning
computation, namely the input pipeline,
can be a bottleneck of your computation.
So it's really important that the CPU input pipeline performs
as efficiently as it can.
So let's take a look at an example of what a tf.data-based
input pipeline actually looks like.
Here, I'm using an example for a common image,
or how a common image processing input pipeline would look like.
We're first creating a data set using the TFRecordDataset
operation.
It's a data set constructor that takes a set of file names
or a set of file patterns and produces
elements that are stored in those files
in a sequence-like manner.
And once you create a data set, you
can chain transformations onto the data set,
thus creating new types of data sets.
A very common one and very powerful
one is the map transformation, which
allows you to apply an arbitrary processing on the elements
of the data set.
And this preprocessing can be expressed
as a function that ends up being traced
using the mechanisms available in TensorFlow,
meaning this function that is being used to transform
elements of the data set is executed as a data flow
graph, which has important implications
for the performance and how the runtime can actually
execute this function.
And the last thing that I illustrate here
is the batch transformation, which
combines multiple elements of the input data set
and produces a single element as an output that
has a higher dimension, which is a common practice for training
efficiency.
Now one thing that's not illustrated here,
but it actually does happen under the hoods inside
of tf.data runtime is that for certain combinations
of transformations, a. tf.data provides more efficient
fused implementations.
For instance, if a map transformation is followed
by a batch transformation, we actually have a highly
efficient C++ based implementation
for the combination of the two that can give you up to 2x
speed up in the performance of your input pipeline.
And that happens kind of magically behind the scenes.
And the important bit that I want to highlight here
is that the user doesn't need to worry about it.
The user here doesn't really need
to do anything with respect to optimizing the performance.
They focus on creating an input pipeline with the functional
preprocessing in mind.
And once you create the data set that you would like,
you can pass it into TensorFlow high level API such as Keras
or Estimator, which all support data set abstraction
as an input for the data.
So let's talk a bit more about the input pipeline performance.
If you were to implement the input pipeline
in naive fashion using CPU for the input pipeline processing
or data preparation and the GPU and TPU for the training
computation, you might end up in a situation
like is illustrated on the slide where
at any given point in time you're
only utilizing one of two resources available to you.
And you could probably tell that this seems rather inefficient.
Well, a common technique that can
be used to make this style of computation more efficient
is called software pipeline.
And the idea is that while you're
working on the current element for training step
on a GPU and a TPU, you're already
started preprocessing data for the next training
step on a CPU.
And thus, you overlap the computation
that happens on the two devices or two
resources available to you.
To achieve that, the effect of software pipelining in tf.data
is pretty straight forward.
All you do is you chain a .prefetch transformation
to a particular point in your input pipeline.
And the effect of doing that will
be that the producer of the data up to that point
will be decoupled from the consumer of the data,
in this case, the Keras model.
And the two will be operating independently,
coordinating through an internal buffer.
And this will have the desired effect of software
pipelining that I illustrated in the previous slide.
Another opportunity for improving
the performance of your input pipeline
is to parallelize the transformation.
So the top part of this diagram illustrates
that we're using sequential processing for applying the map
transformation of the individual elements of the batch
that we are then going to create.
But there is no reason that you need
to do that unless there would, in effect, be some sort of data
or control dependency.
But commonly, there is not.
An in that case, you can parallelize
and overlap the preprocessing of all the individual elements
for which we're going to create the batch out of.
So let's take a look at how we would
do that using the tf.data API.
And similar to the software pipelining idea,
this is pretty straightforward.
You simply add a single argument,
num_parallel_calls, to the map transformation,
which indicates to the tf.data runtime
that it should, in fact, preprocess
elements of the input data set in parallel.
Important bit here is that the user doesn't really
need to worry about the threading
or multiprocessing and use complicated Python APIs
or be aware of things like the global interpreter log.
It just happens inside of the tf.data runtime,
which is implemented in C++.
And thus, it sidesteps the complexities
that the user would need to go through.
And a last best practice for optimizing
the performance of an input pipeline
is that of parallel extraction.
So similar to the parallel transformation,
where in that case the sequential mapping of the data
might have been the bottleneck, another potential source
of a bottleneck of your input pipeline
is the sequential nature with which date is being read.
If you're just reading elements from a file one file at a time,
the I/O could actually be a bottleneck
of your input pipeline.
And the answer to that, well, you
don't have to do that sequentially.
You can do it in parallel.
And to do that using the tf.data API, well,
this time it's not a one line change.
It's a two line change.
And so what changes is that we're
going to replace the TFRecordDataset
source with two lines.
The first line uses a list_files transformation,
which creates a data set that is going to contain all the file
names to which the particular pattern that we specify
evaluates to.
And then we're going to apply the interleave transformation
to this data set, which takes a user defined function,
which is a data set factoring operating on the inputs--
in this case, file names--
and producing data sets-- in this case,
TF record data sets for that particular file name.
And specifying the num_parallel_calls protocols
argument will determine how many files should we
be reading in parallel at any given point in time?
Now, I kind of cheated up to this point in my presentation.
Because I said, well, the user doesn't really
have to worry about performance and the aspects
of their environment.
And it turns out that in order to choose
optimal values for these num_parallel_calls arguments
or the buffer size for prefetch, you actually
have to understand your environment.
At least that's how it used to be, historically, when
this API was first introduced.
And over the past year or so, we actually
worked on lifting this restriction
and making the performance of tf.data great out of the box.
And the way this is achieved by is
instead of specifying manually what the right buffer
size or the right number of parallel calls
should be for these different transformations,
you can actually specify this special constant called
tf.data.experimental.AUTOTUNE.
And if you do that, this will indicate to the tf.data runtime
that you want to delegate the task of choosing
the optimal level of parallelism or buffer size
to the tf.data runtime.
And it will do that on your behalf.
I should mention that auto tuning, at this point,
is enabled by default. But you still
have to specify the constant if you actually
want to indicate which of these knobs should be autotuned.
You can also disable autotuning if you would like
to try to do this manually.
And the mechanism for disabling autotuning is tf.data.Options.
The tf.data.Options is an object that
specifies global options that should be used for your input
pipeline.
And besides controlling autotuning,
it can also be used to control things
like static optimizations that are not
enabled by default, because they are not always
a win, such as map vectorization or map parallelization,
or, for instance, specifying whether your input pipeline is
allowed to produce elements out of order, which, by default,
your input pipeline will be deterministic.
The options object also allows you to, for example,
collect statistics about data in your input pipeline.
And for the performance experts in the audience,
it also allows you to fine tune threading parameters of tf.data
internals.
And the way you would use tf.data.Options
is that once you create your data set,
you also create an instance of the options object
and set whatever options that you're interested in.
In this example, I'm setting the a map_parallelization
optimization on.
And then, importantly, you associate
the options object with the data set
using the with.options transformation, which,
similar to all the other transformations that I talked
about up to this point, returns back a new data set that
now has the options applied.
Last thing pertaining to tf.data that I would like to talk about
is the TensorFlow data sets project.
So up to this point, I've been talking about just core tf.data
API, which can be used by our users to create input pipeline
using--
starting from raw data.
However, for a lot of common existing data sets,
this is a repetitive task.
And especially machine learning learners or novice users
do not necessarily want to do that to get
started with machine learning.
And to address or make it easier to onboard new users,
as well as make it easier to use existing data sets,
the TensorFlow data set projects provides canned data
sets that are ready to be used with the rest of TensorFlow.
The way you could use TensorFlow data sets
project is once you import it as a module,
you can, for example, list, through the call list-builders,
the set of available data sets.
And I think, at this point, there is something like 60
plus different data sets spanning text, image, audio,
and video, that are supported through the TensorFlow data
sets project.
Then, through the load command, using the name
as the identifier of the data set
you would like to load and optionally
the split argument, which you can use to identify whether you
want the training or the test portion of the data
set, you get back an instance of a tf.data data set
that can be immediately used with your model.
Or you could, because it's a tf.data data set instance,
you can optionally apply some custom transformations to it,
such as, in this case, shuffling and batching.
Or, if you would like to just inspect
what's inside of the data set, you
could do so using a simple Python-like iteration where
you can print the elements of the data set.
So this concludes the first part of my talk
in which I talked about tf.data.
And in the second part of my talk,
we're going to talk about the distribution strategy API.
So similar to the first part our talk, where we asked ourselves,
why do we need an input pipeline API?
Let's start by asking ourselves, why
do we need to do distributed training?
Why do we need distribution strategy API?
Well, it turns out that if we do training in one machine,
on one device, it can take a pretty long time.
This graph illustrates that by showing the accuracy achieved
by the ResNet model over time on the ImageNet
data set using a single GPU.
And you can see that it takes close to 90 hours
to get to accuracy around 75%, while the most performant
implementations of the same model, or deployments
of the same model actually take less than 10 minutes using
an amazing amount of resources parallelizing this computation.
What going down from 87 hours to 10 minutes enables
is that you can actually experiment
with ideas very quickly as opposed
to starting an experiment and waiting for one or two
days before you can do the next iteration.
And I think this is game changing.
So I hope I convince you that distributing your computation,
if it takes a very long time, is a very good idea.
So let's talk about how you do that with TensorFlow's
distribution strategy API.
There is three main goals that the distribution strategy
API has.
First of all, it should be easy to use.
What this means is that it should be possible for you
to create your input pipeline and your model assuming
that it's going to run on one device and then,
with minimal code changes, be able to deploy
to different architectures, either multiple GPUs
on your workstation or possibly even a cluster of workstations
that either have GPUs or TPUs attached to it.
It should also provide great out of box performance.
This means that the performance that you get out
of using distribution strategy should
be close to the performance you would get if you were manually
targeting a specific architecture
with your implementation.
And finally, it should be versatile.
So it should support different types of architectures,
different types of hardware, and different types of APIs
for your input pipeline or model.
The use cases for the distribution strategy API
can be roughly categorized as follows,
ranging from the simplest to perhaps the most advanced.
So the simplest one is you have a model that
uses either the Keras or Estimator API.
And you would like to distribute it.
And this is what we are going to cover in this talk.
The second one is you have a model that you
used lower level TensorFlow APIs to create a custom training
loop.
And you would like to distribute it.
And we're also are going to cover that in this talk.
Now, the last two, the more advanced ones,
namely making a layer, library, or infrastructure
distribution-aware-- so, for example, how
would you make something like Keras distribution aware?
[INAUDIBLE] how would you make a new strategy,
where strategy is something that is an abstraction that
hides or decouples the model and input
pipeline from the particular architecture?
Those two use cases we will not cover in this talk.
But they're covered by guides and tutorials
on the TensorFlow web site.
So in case that you would like to learn more,
I direct your attention to the TensorFlow website.
So let's start by talking about the use case
where you have a model that's created
either and Keras and Estimator.
And you would like to distribute it using the distribution
strategy.
And in this section, I'm also going
to introduce the distribution strategies that are actually
available in TensorFlow.
So the first the strategy that's available that's called
mirrored strategy is one that allows you to distribute
your program across multiple GPUs attached
to a single worker.
And the particular implementation
of this strategy using something called
all-reduce synchronous training, where the synchronous part
means that all of the devices will be performing steps
in a lock-step, so in a coordinated fashion.
While the all-reduce portion pertains
to how the different devices exchange information
about the local updates that they collect in each step.
To shed a little more light onto how the all-reduce algorithm
works, on this slide, I illustrate
what happens in the all-reduce algorithm
when you have three GPUs that each perform a single step that
updates a mirrored version of three variables.
So each of the boxes, the blue, the green, and the pink,
corresponds to a variable that, in a single step,
receive different updates on different devices.
And once the step is performed on all the devices,
we can propagate the updates in a circular fashion
between the different devices.
And at that point, all of the devices
will have all of the updates from all the devices for all
of the variables, requiring N minus one transfers for N
devices.
And then, once all the updates have been collected,
a reduce function can be used to combine
the updates to a single global value
where the common reduced functions are either
a sum or an average of those updates.
And with that knowledge in mind, a single step
of a synchronous training can be illustrated
on this example, where let's assume we
have a model with two layers.
And each layer has two variables.
And the variables are mirrored on each device.
We have two devices.
Now, in the forward pass, data is
propagated through the layers.
And then in a backward pass, the gradients for the variables
are computed.
And at that point, the updates to the two variables
on the different devices might be different.
Because we actually use two different pieces
of data on each device.
And at that point, it's where we use the all-reduce algorithm
to share the updates on each device with each other,
and thus achieving a global state across the two devices.
And this is what synchronous training refers to.
Now, to-- let's take a look at how you would actually
go about using a mirrored strategy with the Keras
and the Estimator APIs.
So to create an instance of a mirrored strategy, you can--
there's a couple of different factories, a default one or one
where you can explicitly name the devices
that you would like to create the mirrored strategy for.
I believe that the default is if you don't specify it,
it's going to be all the GPUs attached to your worker.
And you can also optionally specify arguments
for the all-reduce algorithm through the cross device
of argument of MirroredStrategy constructor.
Now, how would we use mirrored strategy or any other strategy,
for that matter, with Keras API?
Well, here's a common or a simple example
of a Keras model for ResNet 50 with a stochastic gradient
descent optimizer.
We create the model.
We specify the optimizer.
And then we use the compile and fit APIs to perform training
over our training data set, which is an instance of tf.data
data set.
Now, this runs on a single machine,
possibly using a local GPU.
In case we have multiple GPUs, we
can simply define an instance of MirroredStrategy
and then make sure that all of the model creation
is wrapped inside of a strategy.scope.
And with these two lines, your program
will now be able to run on all the GPUs
available on the worker.
And the key here is that the strategy.scope
will take care of variable creation inside of your model,
making sure that all the variables are mirrored
on the different GPU devices.
And the body of the strategy.scope
will be distribution aware.
So recall that one of the goals for the distribution strategy
API was that it provides great out of box performance.
So on this slide, I would like to convince you
that it does, at least for mirrored strategy
on the ResNet 50.
So what we're looking at here is the performance
of a ResNet 50 based training using Keras,
running TensorFlow 2.0 on Google Cloud.
The vertical axis of the graph plots images per second.
And the horizontal axis ranges the number of GPUs from one
to two to eight.
And we can see that using mirrored strategy
achieves close to linear scaling,
starting with a single GPU achieving roughly 1,250
images per second to eight GPUs achieving close to 10,000
images per second.
Now, we've covered the Estimator-- sorry,
the Keras API usage with distribution strategy.
Let's also cover the Estimator API usage.
So this is a common example of how you would use the Estimator
API for your training.
Namely, you define a classifier using the Estimator constructor
that you provision with a model function.
And then, through the train call,
you specify an input function, which can return, for instance,
a tf.data data set.
And it performs the training.
In order to parameterize the Estimator API with a strategy,
all you need to do is, again, to create
an instance of the strategy, in this case, MirroredStrategy,
and pass it in through the RunConfig option
into the Estimator API.
And once that happens, the RunConfig
will actually-- with a strategy, will make sure
that the model function is created once per replica.
And replica, in this case, refers to the GPU.
So you're going to have copies of the model on each GPU
as well as of all the variables inside of the model.
And you will perform the all-reduce synchronous training
across the multiple GPUs.
So distributing your computation across multiple GPUs
on a single machine can get you up to N,
where N is the number of accelerators attached
to your machine, speed up.
But there is a physical limit to how many
accelerators you can have.
And to go beyond that limit, the natural next thing
is to actually use multiple machines with each one
[? are ?] multiple accelerators.
And that's what the multi-worker mirrored strategy
is intended to help you with.
And it's very similar to the mirrored strategy.
The only difference is that instead of distributing
your computation over GPUs on a single machine,
it distributes the computation over GPUs on many machines.
And it performs the all-reduce computation
not just across GPUs on a single workstation,
but across the GPUs on all the different workstations.
And it does so through TensorFlow collective ops,
which allows you to actually send
data in a broadcast fashion between the different
TensorFlow workers.
The way you would use this API is
similar to the mirrored strategy API.
So you can create a default instance
of a MultiWorkerMirroredStrategy.
Or you can specify a specific CollectiveCommunication
algorithm to be used.
Unlike the mirror strategy, you also
need to specify information about the different workers
that are participating inside of your computation.
And this is done so through a JSON encoded string that
identifies the host and ports of your different workers
as well as task types.
The third strategy that I'm going to talk about,
and that's available in the TensorFlow distribution
strategy API is the TPU strategy.
And this one is very similar to the mirrored strategy.
The main difference is that it allows
you to perform the all-reduce synchronous training on TPUs,
which are the hardware accelerators made by Google
specifically for TensorFlow.
But at this point, there are also
other frameworks that are capable of leveraging them.
And you can do so through the Google Cloud platform.
And unlike the mirrored strategy,
it uses the cross_replica_sum to perform the all-reduce
on TPUs, which is something that's
a difference between GPUs and TPUs.
And you can use this strategy for training
on a single TPU or an entire pod,
which that's a term that refers to a set of TPU cores
in a topology.
To use a TPU strategy is a little more complicated.
And it's also somewhat of an area of active development,
which the experimental portions of the API refer to.
But the high level idea is that you create a TPU cluster
resolver, which allows you to gather information
about your TPU hardware.
And then you create the TPU strategy
with this cluster resolver argument,
which then allows the TPU strategy to be
aware of the TPU hardware location and specifics.
And so up to this point, I've been
talking about synchronous training, where
all the devices in your training loop
are performing or operating in a lock-step, one step at a time.
An alternative to synchronous training,
which might be suitable for certain types of machine
learning tasks, is so-called asynchronous training,
where the different devices or different workers
in your set of workers performing your computation
are actually running at different rates.
And one of the architectures that
enables asynchronous training is a so-called parameter server
and worker architecture where your machines
have one of two roles, parameter server tasks or worker tasks.
The parameter server tasks is where global variable state
is stored and either updated or fetched
from by the individual workers.
While the workers perform a dimension learning
computation one step at a time, but not necessarily
at the same rate.
And this architecture can be targeted
for your machine learning program using the parameter
server strategy.
You create it using this factoring.
And similar to the multi-worker strategy,
you need to specify information about the workers and the types
of tasks that the worker machines should play,
namely the worker task or the parameter server task.
And again, this is done so through the TF_CONFIG
environment variable.
And a last strategy that I want to talk about
and that's available through the distribution strategy API
is the central storage strategy.
This is a special case of the parameter server strategy
where there is a single parameter server.
And its role is being fulfilled by a CPU
of the machine on which the other devices reside.
And the benefit of this strategy is
that any single GPU might not be able to fit
all the embeddings, all the variable states inside of them.
But the CPU might.
And in cases where this is a good fit,
the central storage strategy is available.
And this is how you would create one.
And that concludes the part talking about the Keras
and Estimator API support, as well as
the enumeration of the different types of strategies
that are available in the tf.distribution strategy API.
And in the last part of my talk, I'm
going to talk about how would you
go about distributing a model that you created using a custom
training loop, which is effectively a model
created out of lower level TensorFlow APIs?
The prerequisite for your custom training loop
to be distributable using the distribution strategy API
is that it has to adhere to the following programming model.
In particular, as far as data sources are concerned,
your variables may be read from any replica.
But the input data that's used for training
will be sharded, meaning divided into disjoint sets that
will be accessed exclusively by one replica.
Each replica performs computation on its sources.
And then the computation is combined using a reduction.
So, in essence, this programming model
is that of all-reduce synchronous training.
But it can be implemented using lower level TensorFlow APIs.
So let's take a look at how an example of a custom training
loop distributed through a distribution strategy
would look like.
So we create an instance of a distribution strategy.
And then we create a data set using your own create data set
method that takes a batch size.
The important bit here is that the batch size
should be the global batch size, that is a batch size that you
choose independently of the number of replicas or devices
on which you are going to run your computation on.
And it's going to be the responsibility
of the distribution strategy API to actually divide
this global batch size into per replica batch sizes.
And this is done through the experimental_distribute_dataset
invocation, which wraps the, quote unquote, sequential, data
set in what's called a distributed data set.
But as far as the custom training loop is concerned,
there is no difference between the two.
And your model, similar to the Keras API usage,
should be created under the strategy.scope, which
means that all the variables must be created
under this scope so that they're properly mirrored
across the different replicas.
As an alternative to delegating the distribution
of your data set to TF distribution strategy,
you can also use an alternative API, distribute dataset
from function, which gives the user the control
to decide what portions of the data set
should be distributed on which replica
and how by, instead of providing a data set,
you provide a data set factoring, which can input
the distribution strategy context,
which has information such as the particular replica
index or the total number of replicas.
And then in the rest of my presentation,
we're going to take a look at how you would actually
build a custom training loop in a kind of bottom up fashion.
So the first thing, the lowest building block
is the logic that performs a single training
step on a replica.
And here's an example of how you would do that.
So you would use a GradientTape, perform some computation,
and then with the [INAUDIBLE] of with the tape,
compute gradients, apply them to model variables,
and return the loss.
And this is something that happens on a single replica.
Now, to tie this computation across--
that happens on different replicas
together in a single training epoch,
you can enumerate the individual elements of the data set
using Python iteration.
And then use the run API of the distribution strategy
with the replica step function and the input
to collect the loss for that particular replica.
And combine the individual losses
using the reduce call with a particular reduce operation.
And at that point, you could do any per-step processing
inside of this for loop.
For an example, you could print the loss.
But you could do other types of computations here as well.
Now, one thing you might notice is
that this train_epoch function has a tf.function decorator.
The effect of this decorator is that TensorFlow will interpret
this Python function as a graph computation,
optionally using autograph to convert Python idioms,
such as the Python iteration of our data set,
into equivalent graph building methods.
And the reason we recommend using tf.function decorator
for your train_epoch here is that it will generally result
in much better performance.
Because the entire training epoch
will be executing as a data flow graph as
opposed to a Python function.
Now, the last step of your custom training loop
is the iteration over multiple epochs, which
is pretty straightforward.
And this just illustrates how you
do that, and optionally inserting per epoch processing
inside of the outer loop, such as checkpointing your model
or running an eval of the model.
So before I end, I want to give you
an overview of what's supported in TF 2.0 beta
as far as distribution strategy is concerned.
And this is a screenshot from the TensorFlow website.
So you can either take a picture now,
or you can also go to the website.
In the first column, we see the three types
of model building APIs, namely Keras, Estimator,
and custom training loop.
And then on the top row, we have the different types
of strategies.
And, as you can see, the Estimator API
is well supported across different types of strategies
while the other combinations are supported or on the way.
Most of them are targeting the RC release candidate of 2.0
for availability.
And that brings me to the end of my talk.
Thank you very much for your attention.
In case-- so throughout the talk,
I've been sharing links to different tutorials.
All the tutorials can be found on the TensorFlow web site
under the resources link shown here.
And in case you have any questions
or you would like to request a feature or report issues,
our GitHub repository is the correct forum for that.
So thank you very much for your attention.
[APPLAUSE]
[MUSIC PLAYING]