TensorFlow內部：tf.distribution.Strategy。 (Inside TensorFlow: tf.distribute.Strategy)

字幕列表影片播放

SPEAKER: We've been sort of ambitious
here in terms of all the different kinds of distribution
we want to support.
In particular, I think I hear a lot
from people who think of tf.distribute.Strategy
as a way to get access to an all-reduced programming
model, distribution model.
But really we are trying to do something
that's much broader in scope.
This is-- it introduces a number of challenges and is a bit more
restrictive when it--
in terms of users of the programming model
that users of tf.distribute.Strategy can use.
And I'll go into that a little bit more.
But I believe it's the right move,
because it will allow not just a variety of different use cases,
but a broad range of maybe future work.
What I am going to talk about is how easy it is to use
and how we've made it so that you
can switch between distribution strategies,
because the code really is not really entangled with user
code, with a few exceptions, which I'll go into.
It's also-- we've done a lot of work
to get very good performance.
But I am really not going to talk about that today.
It's very helpful to go through what these different training
architectures look like.
So you sort of have some idea of the range of different use
cases that we're trying to accommodate on the distribution
side.
Right now, we are doing only data parallelism
and tf.distribute.Strategy.
That means we are dividing up our input across into a bunch
of different pieces.
The model itself is replicated across maybe 100 devices
or workers.
And there is some shared view of the model variables
that all of those models--
copies will update via some mechanism
that is maybe specific to the particular strategy.
So the oldest approach here is parameter servers and workers,
which was what was used by disbelief prior to TensorFlow's
creation and was the model assumed by TensorFlow libraries
prior to tf.distribute.Strategy.
For example, Estimator pretty much
assumes a sort of a primer sort of model.
And here what is going on is that we have each worker
task is going to be communicating
with the parameter server tests.
Its variable is on a single parameter server test.
You might have multiple parameter server tests,
though, if you have lots of variables, or they're big.
It become-- you would, otherwise, need more of them
to avoid being a bottleneck.
This model was very well-suited to the early days.
You have the uses of TensorFlow at Google,
which were characterized by the fact
that we add huge quantities of CPUs in a data center.
And we had large betting matrices.
And we would do very sparse lookups
in those large embedding matrices.
The main thing different here about the distribute Strategy
version of ParameterServerStrategy
is that it has much better support
for multiple GPUs on a machine.
We call this-- it's a between graph
strategy, which is sort of a term from TensorFlow 1,
where we talk a lot more about graphs.
In TensorFlow 2 land, we don't really
have a better term for this.
But it's just that we're--
each worker is going to be running its own main function
that's going to sort of operate independently and schedule
operations on itself and the parameter server
tasks that it's communicating with.
This allows these workers to run asynchronously.
And that combination of working between graph and asynch
makes it easy to tolerate failures.
So you can run pre-emptable workers
at lower priority or whatever.
As long as your parameter servers stay up,
you can keep going without really any interruption.
There is a lot of RPC communication here.
And this sometimes shows up as a performance blip
at the beginning of a step when it's reading variables,
particularly if the runtime isn't clever about, like,
the order in which it reads the variables, which
can be because there's some operations it
can do on all the variables.
But it really needs certain variables
to do the first layers first.
So and there has been-- we've seen some performance hiccups
as a result of that.
Moving on to what I call the new central storage strategy, which
I think is still in review, but all basically,
we're talking about--
I'm going to be talking about, like,
the sort of vision of where we want to be.
This is the ParameterServerStrategy
writ small.
It's restricting the whole-- all-- everything
to single machine.
And now, we're distributing not across machines so much
as the accelerators within a machine.
But again, the implementation is almost identical.
It's just the configuration is different,
because all of the variables are stored
as a single copy on the CPU.
And this is also known as PSCPU in the TF CNN
benchmark language.
So on one machine, there's no need to have multiple clients
or worry about asynchrony.
This all run-- can run synchronously.
So that's a little bit different than it.
And you might use this in a situation
even on one machine, where if you have large embeddings,
it won't fit on your GPU or if your particular model
or hardware--
it finds-- is particularly well-suited to this.
I think on TF CNN benchmarks, they've
done-- have this giant matrix of what's
the best way for any particular model hardware combination.
Something to experiment with--
probably the thing that is most well-known
for tf.distribute.Strategy, though, is the all-reduce sync
training.
So this is great for where you have
quite a lot of variability and good connection between all
of your devices, because they're going to be operating basically
in lockstep.
We're going to do one training step across a whole job.
And it's all going to be coordinated
via a single client.
And so this is used both for TPUs and multi-GPUs
within one machine, using mirrored strategy or TPU
strategy.
And in this strategy, we mirror the variables of the model
onto each of the device.
So let's say you had a variable--
let's call it A--
in your model.
And we're going to replicate that by each device
having its own copy of the variable locally
so that it can just read it.
And there's no delay.
Together, these form a conceptual mirrored variable,
which is there's actually a mirrored variable
object that we return to the user instead of these component
variables to store as their, like, model member variables.
And then, we keep these in sync by applying identical updates.
And this is where we're going to need some way of communicating
between the different devices in order
to make sure those updates are all the same, which brings us
to all-reduce, which is a standard CS
algorithm for communicating between devices in a network
efficient manner.
So here, all means we're going to be communicating
from every device to every device.
And the reduce means we're going to do basically a sum or mean.
There's some other things like max and min
and so forth that you can also do.
But this is really where the synchronization is coming
between the devices and where we're also
going to be spending-- spend a significant amount of work
adding new capabilities into the runtime.
So how does synchronous training actually look?
So let's say we have a model, two layers on two devices.
And each of those layers has a variable.
So there is really four component-- variable components
here, because we're keeping separate copies of each
of the two variables on each of the two devices.
Each device gets a subset of the training data
and does a forward pass, using just those local copies
of the variables.
And then, in the backward pass, we, again, compute
gradients using those local variable values.
And then, we take those gradients and aggregate them
by using an all-reduce that communicates
a single aggregated gradient value out to all the replicas.
And each replica then applies that gradient
to its local variable.
Since the all-reduced produces the same value
in all the replicas, the updates are all the same.
And the values stay in sync across all
the different replicas.
So the next forward pass here can start immediately.
You know, there's no delay waiting for the values,
because we have all--
by the end of the step local copies of all updated
values for all variables.
And furthermore, we can get some parallelism here
by doing all reduces of the gradient's one layer overlapped
with the computation of the gradients of other layers.
So these go all the way in line.
And so we have keeping both the network communication
and the computation parts of whatever hardware
you have busy at the same time.
That's great for throughput and performance.
And this does perform-- we observe
that on many, many models, this performs well.
And we want to scale this up to multi-machine.
And so there's members of our team--
[INAUDIBLE] and [INAUDIBLE] have made collective ops
and a strategy for doing this on multiple machines,
so using these new collective ops.
So we have multi-worker mirrored strategy that implements this.
This is a little bit experimental.
We're working on it.
But it works today.
You can use it.
And it employs these new collective ops
with, again, between graph so that each worker is only
scheduling the ops that are running on that worker, which
is good for scalability.
But again, it's the same model as the mirrored strategy, where
everything's running in lockstep, synchronizing
the gradients on every step.
We have a couple of different all-reduce implementations.
I think it can use nickel.
But also there's-- like, you can do ring all-reduce within
a machine.
You can also do it across machines,
very similar to the multi-GPU situation.
Or you can aggregate within each machine,
communicate across machines, and then broadcast
in a sort of hierarchical way.
And in different situations, those--
one will perform better than the other.
Last strategy is OneDeviceStrategy, which
we currently don't expose.
But it's good for, like, you want
to be able to supply a strategy to a function that's going
to open that strategy scope.
And you want it to work and also in a non-distributed context.
So now, I'm going to talk a little bit about things
we don't have today but maybe are coming, possibly
depending upon interest.
So if you would like us to prioritize some of this stuff,
let us know.
We have some GitHub issues, where
we can collect use cases and interest in these things.
So once upon a time, now deprecated,
there's was the SyncReplicaOptimizer,
which you combined the sort of parameter server
style of variable storage but with the synchronized variable
updates.
You might also want to have a sort of a hybrid strategy,
where you have mirrored variables in all-reduce
for most of your variables.
But if you have a large embedding that
won't fit on your GPU, maybe you just put that on the CPU
as the central strategy.
So we have GitHub issues tracking both
of those possible features.
Similarly, model parallelism is something,
where we have some infrastructure in place so that
we can eventually add this-- but it is not there yet--
on the ideas you would specify a number of logical devices,
and then manually place the particular ops
or layers or whatever onto those logical devices.
And then, that computation would then
be spread across those logical devices
times the number of replicas, actual physical devices.
Now today, if you want to do model parallelism,
I'm going to recommend Mesh-TensorFlow.
It actually works and is out there.
And it has a different model of doing model parallelism, where
instead it's splitting operations
across lots of devices.
And in general, I think that's a good fit for backprop training,
because you have to keep all these intermediate variables
around in order to do the backwards step when
you're doing the forward step.
And that just-- if you just work it out, just is,
I think, a better fit for those type of training.
However, there is one case, where
the other sort of parallelism is really natural,
which would be to do input pre-processing
on a separate job.
And that is a good fit, because there's no gradients involved
in input processing, so no need to hold
on to your intermediate values from the forward pass.
If you're interested in this, another GitHub issue--
express your interest.
Explain your constraints.
There are some questions, for example,
when we implement this, like, do we
need to support complete reproducibility or
deterministic allocation?
Or maybe we could just have, like,
a bunch of queues and threads running as fast as they can.
And you get the records that in the order that they come.
It would be good to know if you're interested in this
what you need.
OK.
So that puts us into the next stage.
How do these studies actually work under the covers?
And when we were doing this, we took
a very sort of simple view of we just basically tried
to figure out what the code looked
like written with mirrored strategy
and with ParameterServiceStrategy.
And we saw what changed, anything
that changed to how to be the responsibility of the strategy.
And what you learned doing this exercise
is that keeping state and updating state
are the things that change.
And so when you switch strategies, things like,
you know, variables, batch norm updates, metrics--
all those things sort of need some little bit of support
to become more distributed and work well
in a distributed setting.
And so we need a little bit of help to say, OK.
We're about to compute an update for a variable.
This is something you're going to need to do specially.
So what we do is we've made the TF library, in general,
responsible for including any changes needed in order
to identify what is going to have to change
when you distribute it.
We can't really, at this point, just take
the sort of sea of ops that you get
from maybe saving your model to disk and in a graph def
and magically understanding the intent behind all those ops
in order to efficiently distribute it.
We really need those hints that the library provides
by delegated calling strategy APIs.
But the good news is is that almost all of that
is restricted to TensorFlow libraries
and in many cases, just base classes and not subclasses.
And I'll get more into that later.
So the best-- the simplest case is
if you're using something like Keras and Estimator
or Estimator or Keras and Estimator.
And we control in the TensorFlow library your training loop.
And in that case, we can make the experience very easy.
If you are in--
but we are very interested in enabling
new use cases, where the user wants
to control the training loop.
And in that case, you might need to do a little bit more.
But hopefully, we also gave you new capabilities and lots
of good performance.
And so you'll be happy to add a few things to your code
in order to distribute your custom training loop.
I will talk a bit about also if you're a TensorFlow developer
and you want to make your library work with distribution
strategy, because it somehow interacts with state
that needs to be distributed.
I'll talk about that some.
I'm probably not going to--
I'm not going to talk really directly
about making a new strategy.
If you want to make a new strategy, you're going to need,
at this time, pretty heavy involvement of the [INAUDIBLE]
team.
And so talk to us.
We'll help you.
Right now, we don't expect there to be a huge number of them.
We have reached out to the Horovod team.
And we have had some initial work
on making a strategy with them.
But now, I'm going to go a little--
so there's different API surfaces
for each of these use cases if you're making a new strategy.
But you have the whole of the library.
But there are much smaller slices, if,
in these other cases.
So in the simplest case of the Keras and Estimator
you just basically need to know the constructor
of these strategies and this scope thing.
So if you just add these two lines to your Keras training
loop with-- using compile fit, we should modular bugs
in the feature requests and things like making everything
work--
the intent is is that that just works.
You get mirrored for free.
You get a big performance.
But the most-- basically, when you say strategy-- put
everything inside the strategy scope,
you're selecting this strategy, saying
it's the current strategy.
And it's the strategy that should be used for everything.
Most important part of that is taking
control of variable creation.
So when you are saying, you know,
tf.keras.applications.ResNet50, it's
creating a bunch variables.
Or maybe it waits and tell you to model, compiler fit.
But whenever it creates the variables,
it's inside the strategy scope.
And so we control the variable creation.
We can use mirrored variables instead of regular variables.
And all of these Keras libraries and library function goals
have been made distribute aware.
So really, there's not much else for the user to do.
With Estimator, again, about two lines--
it gave me-- most of the API is just the constructor.
So all we really need to do is pass that distribution strategy
to the Estimator's RunConfig, either for training
or distribution--
I mean, training or avow or both.
And when the Estimator calls the user's model function,
it will call it once per replica inside the strategy scope
automatically and so that we use--
and as a result, it will use mirrored variables.
It also uses a special variable creator, so,
because we're going to call this function once for replica.
We want to make sure it uses the same variables each time,
maybe different components of the mirrored variables,
but the same mirrored variables.
And Estimator has been extended to know
how to merge the result of all of these model function
calls into a single and coherent answer.
So this sort of is an introduction to--
or a good time to talk about one of the concepts
that we have inside distribution strategy, which is mirrored
versus per-replica values.
Mirrored values are the same across all replicas.
So mirrored variables are an example
of this, where we go through a lot of effort
to make sure that they are in sync.
But there are other things that can be mirrored.
For example, the output of reduction--
when we aggregate the gradients, those
will also be mirrored across all the replicas.
And in general, when we update a mirrored variable,
we need to update it with mirrored values.
So we know that the updates are all the same.
Per-replica values are sort of almost everything else--
basically, all these things that are going to be different
on every replica, they--
like the different inputs, the different activations,
and so forth.
And we aggregate these values, either the mirrored
or per-replica values into these aggregated containers that have
a single value per-replica--
these container types, so like mirrored variables.
An example of one is something we actually
can't hand to the user--
for example, if the mirror variables
will be stored in the model as the model's member variables.
And we've done some operator overloading so that these--
in the cases where the operations
are safe to do on the mirrored variables,
we will do them for you.
Not all of the operations will be safe.
And I'll go into that later.
And then, there's this shared variable creator, which
is, in addition to us intercepting variable creation
in order to create mirrored variables instead
of regular variables, we want to make sure
that each call to the model function
produces the same variables.
And there's a number of heuristics
that we use to make sure that when you create
a variable in each model call that we
return the same variable as was created for the first model
call.
So going on to custom training loops,
we're going to now start getting a little bit more
into the programming model that distribution strategy expects
all callers to conform to.
So-- and it's a little more exposed
when you're writing a custom training loop--
so you start from data sources, which are typically
variables or data sets.
I guess, they could also be constants, which is not
the most interesting example.
But then, each replica--
again, this is data parallelism is
going to be running a computation on that data.
And then, we have to somehow combine the results.
And we have basically one tool here,
which is a reduction or potentially concatenation too.
But that's turns out to be not the common case.
Now, what do you do with that reduction?
So if you are using an optimizer, what
you do is you are going to add all the gradient updates--
gradients you get from all the different replicas--
and then broadcast that reduced value
to where all the variables are.
Now, in the mirrored strategy case,
the variables are in the same place as the computation.
So that's becomes an all-- that's just an all-reduce.
And then we use the-- to update variables.
Another really common thing is people want it,
like, the average loss or something like that.
So we just take the reduced value, return it to the user,
print it out.
Great.
A sort of new capability here when you're using distribution
strategy is to broadcast the aggregate value communicate--
from all the replicas back to all the replicas
and then do some more computation.
So hopefully, this is going to unlock some doors
by allowing more complicated distributed
algorithms from researchers.
So we'll hopefully see the sort of MPI style of distributed
computation a lot more now that distribution strategy is
available.
So this is just a picture representation
of what I was just talking about.
Now, the dotted arrows are-- or dashed arrows--
are per-replica values that--
you know, activations and so forth--
that can be the input to a reduce in order
to become a mirrored value, which, as I said,
can be used to update a variable or just return to the user.
Now, I'm going to have several slides
here, where I'm going to go in detail to an example of using
the custom training loop.
There's going to be a little bit here
that's going to be future work.
But it's all in the plan.
This example is going to take a few slides.
So I can go into some detail.
But I'm going to show you how to make a custom training loop.
Like in the Keras example before, we create a strategy,
open its scope.
Not every operation actually has to be inside the scope.
But it's much simpler if we just put everything inside the scope
since that works.
And that's just a simple rule.
And in the future, you won't have
to worry about what goes in and what goes out.
Just put everything inside--
works great.
So the first thing we do is, in this example,
is create a data set.
And we're going to pass it the global batch size,
just like in the Keras case.
It's the strategy's job to split that across replicas.
Now for now, we need users to explicitly wrap their data
sets using a strategy method we call
experimental_distribute_dataset.
In the future, we'll do this automatically
for any data set iterated inside a strategy scope.
If the automatic splitting algorithm
is inappropriate for whatever reason,
you can manually specify how to split your data
set, using a function that takes an input context
and returns a data set with a per-replica batch size.
So just like in the Keras case, again, the scope
controls variable creation.
So whenever you create your model,
it's best to do that inside the scope
so that any variables will be created
using the policy dictated by the strategy.
Originally, we tried making the Keras loss
classes automatically scale the loss values according
to the number of replicas.
We found that that did lead to some user confusion.
So for now, we've switched to requiring users
to explicitly specify a NONE reduction
and do the reduction as part of a later step
that you'll see in a future slide.
Or alternatively, you can just use any tensor
to tensor function directly.
In addition, optimizers have been made distribute-aware.
I'll talk about that in detail later.
So here, we define the function with the main computation
of our training loop that we're going to perform every step.
This function will be called once per replica, at least
in the mirrored strategy case.
And the model function may create variables, at least
in the first column.
And that's important, because that's frequently
something Keras will do if it was unavailable--
if the input shape was unavailable at the time
the model was created.
But this is fine since we're going
to be running this inside the strategy scope.
And variables will still use the strategy's policy.
Here's where we're going to average
the loss using the global batch size.
And that's a good policy independent of how many
replicas you have or whatever.
For regularization losses, we use the scale regularization
loss API, which divides by the number of replicas
so that when you add up across all the replicas,
you are going to get something that scales
with just the variables, not how many replicas you have.
By having an explicit call, we hope
to reduce the confusion that we saw with our earlier approach,
where we tried to automatically scale losses
by the number of replicas on user's behalf.
We're going to create a gradient-- computer gradient,
using ordinary TensorFlow 2 APIs.
This gradient is going to be local to each replica
and then passed to the optimizer, which
is distribute-aware and is going to deal
with aggregate ingredients, among other things, which
I will go into detail later.
So those first two slides of the custom training loop
were demonstrating computation in cross-replica mode.
And this last slide was computation in replica mode.
In replica mode, we're operating on ordinary tensors.
And we can use the full TensorFlow
API to specify the computation that
is going to be repeated on each replica device.
Cross-replica mode-- instead, you
are operating on aggregate values, which
are maps from the replica to the tensor or variable
on that particular replica.
In the future, we're going to actually add a logical device
component in order to support model parallelism, where
you're actually split a model across multiple logical
devices.
We also have an update mode that is
going to be used inside the optimizer
to update each variable.
It's going to run code on whatever devices
that variable resides.
In ParameterServerStrategy, this will
be a single device, but maybe a different device
for each variable, whereas in mirrored strategy,
this will run all on the same devices as the computation.
Moving on with our example, here we're
going to train a whole epoch.
We currently recommend running this
in graph mode, which we get with the tf function decorative
there at the top.
We have tests, though, that verify that our API is
working in your mode.
You'll likely want the performance of running
different replicas in parallel.
If you want to do some per step processing that requires eager,
we recommend using a P-- tf Python function.
So we're going to iterate over our data set.
Right now we are depending upon the call
to experimental distribute data set from the earlier slide
to split the data set across all the replicas.
The plan is to do this automatically
whenever you iterate inside a strategy scope.
Note that this is particularly tricky
in the multi-worker case.
In the one machine case, this is just splitting each batch.
But with multiple machines, we want
to do some sort of decentralized splitting
so you're not getting the same input on different workers.
In the body of the loop, we're going
to transition between cross-replica and replica mode,
which involves explicitly using strategy APIs.
The replica step function from the earlier slide
will be called in replica mode, once per replica
on different input shards.
There's a tricky situation here when we
are at the end of a data set.
So we don't have enough data to fill the batches
on all replicas.
In that situation, we need to pad the input with batch size
zero inputs to make sure all replicas perform
the same number of steps.
This way, all-reduce doesn't freak out
waiting for something from a replica that
isn't running a step at all.
Note that the all-reduce operations that we're
going to be doing inside the step are on gradients,
and those gradients are going to have
the same shape as the variables and not dimension
that depends on the batch size.
In those replicas where there's a batch size or input,
we're going to have a zero gradient,
but at least it'll be the right shape.
Experimental run V2 returns per-replica value combining
the return value of replica step from each replica.
In this case, each replica is returning a vector
with a per example loss value with size
equal to the per-replica batch size, where we then
use the reduce API to average the loss
into an ordinary tensor.
By specifying axis equals zero, it
will average across the batch dimension
and across all the replicas to convert a global batch of loss
values into a single scalar.
Lastly, here is a simple, standard outer loop.
We're going to iterate through all the epochs
that we're executing.
It runs outside of the function in eager mode,
so you have a lot of flexibility to run whatever logic you want.
For example, you could put early stopping logic here.
You can also after each epoch, checkpoint or maybe eval.
This is completely straightforward,
since myriad variables implement the checkpoint saving protocol.
So they save the same way as normal variables.
We have tests that verify that the resulting checkpoints can
be loaded by a non-distributed model and vice versa.
So I talked about using strategy.reduce at the end,
after the experimental run call.
There are some alternatives. strategy.concat--
not quite implemented yet--
but it's another way of getting values out
in a way that doesn't really depend upon how it was split up
into different pieces for the data parallel computation.
You might also want to call just get the results
on this local worker.
And that's really important if you
were going to do some further computation
and you don't want if you--
like, you're in one of these between graph settings
where you have multiple main loops,
and you don't want two main loops using the data
from any other worker.
This is-- the last thing I'm going
to cover is making a library that needs to be distributed,
remember, because it operates [INAUDIBLE] what
APIs you might use.
So the first, easiest thing to know about is
tf.distribute.get_strategy is how you get a strategy.
And the important thing to know about it
is that it always returns you something implementing
the strategy API, even if you're not into the strategy scope,
because there is a default strategy that does something
moderately sensible even if you don't have knowledge
about what's going on because you're in some strategy
scope that has a specific configuration.
So distribution strategy aware code is
just code that uses this get_strategy API
and does its work via those APIs.
And most of the work is already done for you
for the normal cases, as long as you're just
implementing a new metric and optimizer loss.
You just have to inherit from the base
class that has done all of the work to be distributed enabled.
There are new capabilities available to you,
though, if you want to be distribution-aware,
and I'll talk a little bit about that, those new APIs
and options available to you.
But first, I'm just going to sort explain
the implementation.
For losses, we provide helpers for per example
and regularization losses that are distributed-aware.
If you can supply the global batch size,
there is no actual need to do anything distribute-specific,
and we can just scale the loss by the value that
is constant for all batches, including
the last partial batch, and weigh each example equally.
Otherwise, we compute the per-replica batch size
from the tensor shape and scale it by the number of replicas
from the current strategy to get the global batch size.
This might be slightly wrong in that it'll
weight the last batch slightly more
but is the best we can do without knowing
the global batch size.
So now going into Optimizer--
Optimizer, we made much bigger changes.
So we're going to look at the Apply gradients call.
That's past a parallel list of gradients and variables.
And again, we get the strategy and then
we do the thing called a merge call.
Now merge call is the sort of secret weapon
we developed at the beginning of creating distribute strategy.
It allows-- it's basically our main tool
for doing things that cross replicas when
inside a per-replica call.
And so we can do synchronization or communication
inside a merge call.
The way it works is when the mirrored strategy is running
something on each replica, it's actually
running each of those functions in a separate thread.
So each thread corresponds to one replica,
but we only are running one replica thread at a time.
We use [INAUDIBLE] and so forth so
that the first replica runs until completion
or it gets to a merge call.
If it gets to a merge call we say, OK, pause that thread.
Run the next replica thread until it gets up
to the same point in the code, and it gets up
to that merge call.
Then repeat that until we've gotten
all of the replica threads up to that merge call point.
And we have args from each thread,
and now we aggregate all the args across
produced on all those threads into per-replica values.
And then we call that function that we pass to the merge call
once with these sort of aggregate values
across from all the merge calls.
Now that function now can do things,
like reductions and whenever.
They cross all those replicas, and whatever
is returned by that function is then
returned by it as the return value for all
of the merge calls.
And then we resume the first replica thread
until it finishes, and so on, and so forth.
So this distributed_apply function
is the argument, the thing that's inside the merge call.
So it's only executed once, even though apply gradients calls
executed for each replica.
And the grads and vars value here
is now, instead of being a list of variables
and a list of gradients, a list of mirrored variables
and a list of per-replica gradients.
Now, we want those per-replica gradients
to be aggregated across all the replicas,
so we do a reduction, where we add them up.
And this batch reduce too will reduce across
to all of the gradients at once, but it'll put each gradient
after aggregation on the devices where
the corresponding variable lives.
So in the mirrored case, this is an all reduction,
but in the parameter server case,
each gradient might be going to a variable living
on a different parameter server, potentially.
And so it know--
takes the variable as the destination,
so it can know where to put the aggregated gradient value.
And then with those reduced gradients,
we then call update, which calls this
apply gradient to update variable function on once
per device where the variable is.
And the gradients now, gradient values
here are now mirrored variables and update validates
that those are mirrored variables so
that we can be sure that the update is
the same across all copies of the variable.
So this is that sort of subset of the programming.
The state transition diagram that you can see
is restricted to just for normal variable updates.
This introduces another concept, which
is the sort of locality of the devices
where all these things are running.
Replica devices are the devices where we're
doing most of the computation.
Variable devices are devices where the variable lives,
which may be one or many, depending
on if it's mirrored or not.
And the reduce_to API is the bridge that
gets us from one to the other.
We also have this non-slot concept
that needed for in order to match the behavior ADAM
optimizer.
So we have the standard pattern that is generally
how we update state.
The merge_call is taking tensors for each replica
and producing per-replica containers.
Then reduce_to is turning those produced
per-replica containers into aggregate values are mirrored.
And then we update, taking the mirrored values
to update the values.
And we know because they have the mirrored type,
the updates are identical.
So I see that we're a little bit low on time,
so I'm just going to breeze through this.
This is the fact that we've overloaded the operations
on mirrored variables so that like, for example,
assign operations will do all of those steps for you
as long as you set an aggregation.
That aggregation ends up being the reduce operation
that we do.
One of the new things you get can do now with distribution
strategy, though, is say, we're going
to actually opt into a different model
for how we're going to update the variables.
We instead could sink--
instead of syncing on write, like we
do for mirrored variables, we're going
to sync-on-read, which means these variables are
going to be totally, when at write time, independent.
And we're going to keep writing, writing, writing, writing,
writing to them, assuming reads are rare.
And then when we read, that's when
we do the reduction to aggregate the value
across all the different variables
on the different replicas.
These aren't trainable, but they're
really great for at least a couple of cases we've seen.
One is metrics and batch norms statistics
that are ones that are used only in avow.
And you get those.
So we have the synchronization aggregation arguments
that we've added as new APIs in order
to access these new capabilities.
And so you can set, for example, synchronization to ON_READ
in order to get for metrics and batch norm statistics.
And variable aggregation can be set, even for mirrored
variables but really both.
And it lets you say what reduction
to do when you do operations.
You don't need to set that, though, if--
you don't need to set the aggregation if you're only
using optimizers.
Optimizers don't rely on this.
I'm just going to breeze through this.
OK, so here's an example.
Metrics just set-- change the add weight methods
to change the defaults so that SUM is the default
aggregation [INAUDIBLE] and ON_READ is the default
synchronization method.
And then when you implement a subclass of it,
so you'll see there's--
we just are computing the numerator and denominator
in the update method and those are using variables
created using add_weight.
This just is updating the per-replica values
within each replica, and we may be doing a parallel eval
on a bunch of different replicas.
And then at the end, we do this result function,
and only then do we aggregate across replicas.
And so we get a total numerator, a total denominator, and divide
and we get a value.
And you'll notice that we didn't have to use
anything strategy specific.
It's purely the operator overloading all the logic
in the sub needed for distributed enabling
is in the parent class.
You could imagine layers also wanting to use all-reduce.
Here's some code that will do it.
This does introduce synchronization replicas
and could be used optionally for batch norm layers.
This is one of the-- batch norm layers are
one of the very few cases where you will have interaction
between batch elements and so if you
don't have enough examples on each replica,
you want to communicate across replicas to get enough batch
norm statistics.
They also use sync-on-read for those avows,
as mentioned earlier.
A few more obscure methods, we generally
hide these in the strategy extended
so users don't have to see it.
So variable_created_in_scope is used for error checking
by Keras.
colacate_vars_with is used for slot variables and optimizers.
That's it.
Sorry I don't have a great conclusion,
except for if you have more to ask about this,
we have a component label on GitHub called dist-strat.
Ask away.
[APPLAUSE]
AUDIENCE: I have a question.
Say I want to implement a custom layer.
When I call add_weights, what do I
need to care about to make it work
with distribution strategy?
SPEAKER: Usually nothing.
The defaults are all set up in the base class
so that unless there's something weird about your layer
where like, it has interaction between different batch
elements, like in the batch norm case,
your layer probably just doesn't have to care.
AUDIENCE: I see.
SPEAKER: Because all that logic is in the base class.
AUDIENCE: I see.
And is mirrored strategy only have--
does it only have mirrored variables?
SPEAKER: Right now, you will get mirrored variables by default
if you use mirrored strategy, but you
can opt into these weird sync-on-read variables
by setting options that are not about default value.
AUDIENCE: OK, is optimizer.iterations
finding [INAUDIBLE] sync-on-read variable?
SPEAKER: I think it's supposed to be a mirrored variable.
AUDIENCE: It is a mirrored variable.
SPEAKER: It should be.
It should be mirrored variable, unless there's a bug.
AUDIENCE: I see.
AUDIENCE: It is a mirrored variable.
SPEAKER: OK, great.
We have our expert here to tell us.
AUDIENCE: Cool.
AUDIENCE: Do parameters servers work with accelerators?
Because you can imagine in the limit
that your accelerators have really, really
high interconnect, which I know a lot of them
do and are moving towards, that like,
mirrored would be too conservative.
And you'd like to say, round-robin your convolution
variables around the accelerators.
Like, can you do that, or is that planned?
SPEAKER: If you have exotic ideas, we can talk to you.
Right now, we are more focused on more basic use cases.
For accelerators with good interconnect,
we see that more all reduce style has been more efficient.
The problem you might run into, though, is either running out
of memory or if your steps take widely different times
because like, maybe you're doing an RMN
with widely different steps, that the synchronization
overhead of saying that we're doing everything in lockstep
might be a high cost.
AUDIENCE: In particular, I was thinking
about the memory of having [INAUDIBLE] copy on
[INAUDIBLE].
SPEAKER: Yeah, it's sort of hard to avoid
that because you need to--
you can do model parallelism to avoid reducing the memory.
But the mesh TensorFlow is probably a better path there.