Placeholder Image

字幕列表 影片播放

  • PRIYA GUPTA: Let's begin with the obvious question.

  • Why should one care about distributed training?

  • Training complex neural networks with large amounts of data

  • can often take a long time.

  • In the graph here, you can see training

  • the resident 50 model on a single but powerful GPU

  • can take up to four days.

  • If you have some experience running complex machine

  • learning models, this may sound rather familiar to you.

  • Bringing down your training time from days to hours

  • can have a significant effect on your productivity

  • because you can try out new ideas faster.

  • In this talk, we're going to talk

  • about distributed training, that is running training in parallel

  • on multiple devices such as CPUs, GPUs, or TPUs

  • to bring down your training time.

  • With the techniques that you-- we'll talk about in this talk,

  • you can bring down your training time from weeks or days

  • to hours with just a few lines of change of code

  • and some powerful hardware.

  • To achieve these goals, we're pleased to introduce

  • the new distribution strategy API.

  • This is an easy way to distribute your TensorFlow

  • training with very little modification to your code.

  • With distribution strategy API, you no longer

  • need to place ops or parameters on specific devices,

  • and you don't need to restructure a model in a way

  • that the losses and gradients get aggregated correctly

  • across the devices.

  • Distribution strategy takes care of all of that for you.

  • So let's go with what are the key goals of distribution

  • strategy.

  • The first one is ease of use.

  • We want you to make minimal code changes in order

  • to distribute your training.

  • The second is to give great performance out of the box.

  • Ideally, the user shouldn't have to change any--

  • change or configure any settings to get the most performance out

  • of their hardware.

  • And third we want distribution strategy

  • to work in a variety of different situations,

  • so whether you want to scale your training

  • on different hardware like GPUs or TPUs

  • or you want to use different APIs like Keras or estimator

  • or if you want to run distributed--

  • different distribution architectures

  • like synchronous or asynchronous training,

  • we have one distribution strategy to be useful for you

  • in all these situations.

  • So if you're just beginning with machine learning,

  • you might start your training with a multi-core CPU

  • on your desktop.

  • TensorFlow takes care of scaling onto a multi-core CPU

  • automatically.

  • Next, you may add a GPU to your desktop

  • to scale up your training.

  • As long as you build your program with the right CUDA

  • libraries, TensorFlow will automatically

  • run your training on the GPU and give you a nice performance

  • boost.

  • But what if you have multiple GPUs on your machine,

  • and you want to use all of them for your training?

  • This is where distribution strategy comes in.

  • In the next section, we're going to talk

  • about how you can use distribution strategy to scale

  • your training to multiple GPUs.

  • First, we'll look at some code to train the ResNet 50

  • model without any distribution.

  • We'll use a Keras API, which is the recommended TensorFlow

  • high level API.

  • We begin by creating some datasets

  • for training and validation using the TF data API.

  • For the model, we'll simply reuse

  • the ResNet 50 that's prepackaged with Keras and TensorFlow.

  • Then we create an optimizer that we'll be using in our training.

  • Once we have these pieces, we can compile the model providing

  • the loss and optimizer and maybe a few other things

  • like metrics, which I've omitted in the slide here.

  • Once a model's compiled, you can then begin your training

  • by calling model dot fit, providing the training

  • dataset that you created earlier, along with how many

  • epochs you want to run the training for.

  • Fit will train your model and update the models variables.

  • Then you can call evaluate with the validation dataset

  • to see how well your training did.

  • So given this code to run your training

  • on a single machine or a single GPU,

  • let's see how we can use distribution strategy

  • to now run it on multiple GPUs.

  • It's actually very simple.

  • You need to make only two changes.

  • First, create an instance of something called

  • mirrored strategy and second pass the strategy instance

  • to the compile call with the distribute argument.

  • That's it.

  • That's all the code changes you need

  • to now run this code on multiple GPUs using distribution

  • strategy.

  • Mirror strategy is a type of distribution strategy API

  • that we introduced earlier.

  • This API is available intensive on point 11 release,

  • which will be out very shortly.

  • And in the bottom of the slide, we've

  • linked to a complete example of training [INAUDIBLE]

  • with Keras and multiple GPUs that you can try out.

  • With mirror strategy, you don't need

  • to make any changes to your model code or your training

  • loop, so it makes it very easy to use.

  • This is because we've changed many underlying components

  • of TensorFlow to be distribution aware.

  • So this includes the optimizer, batch norm layers, metrics,

  • and summaries are all now distribution aware.

  • You don't need to make any changes to your input pipeline

  • as well as long as you're using the recommended TF data APIs.

  • And finally saving and checkpointing work

  • seamlessly as well.

  • So you can save with no or one distribution

  • strategy and a store with another seamlessly.

  • Now that you've seen some code on how

  • to use mirror strategy to scale to multiple GPUs,

  • let's look under the hood a little bit

  • and see what mirror strategy does.

  • In a nutshell, mirror strategy implements data parallelism

  • architecture.

  • It mirrors the variables on each device EGPU

  • and hence the name mirror strategy,

  • and it uses AllReduce to keep these variables in sync.

  • And using these techniques, it implements

  • synchronous training.

  • So that's a lot of terminology.

  • Let's unpack each of these a bit.

  • What is data parallelism?

  • Let's say you have end workers or end devices.

  • In data parallelism, each device runs the same model

  • and computation but for the different subset

  • of the input data.

  • Each device computes the loss and gradients

  • based on the training samples that it sees.

  • And then we combine these gradients

  • and update the models parameters.

  • The updated model is then used in the next round

  • of computation.

  • As I mentioned before, mirror strategy mirrors the variables

  • across the different devices.

  • So let's say you have a variable A your model.

  • It'll be replicated as A0, A1, A2, and A3

  • across the four different devices.

  • And together these four variables conceptually

  • form a single conceptual variable

  • called a mirrored variable.

  • These variables are kept in sync by applying identical updates.

  • A class of algorithms called AllReduce

  • can be used to keep variables in sync

  • by applying identical gradient updates.

  • AllReduce algorithms can be used to aggregate the gradients

  • across the different devices, for example,

  • by adding them up and making them available on each device.

  • It's a fused algorithm that can be very efficient

  • and reduce the overhead of synchronization by quite a bit.

  • There are many versions of algorithm--

  • AllReduce algorithms available based

  • on the communication available between the different devices.

  • One common algorithm is what is known as ring all-reduce.

  • In ring all-reduce, each device sends a chunk of its gradients

  • to its successor on the ring and receives another chunk

  • from its predecessor.

  • There are a few more such rounds of rate and exchanges,

  • and at the end of these exchanges,

  • each device has received a combined

  • copy of all the gradients.

  • Ring-all reduce also uses network bandwidth optimally

  • because it ensures that both the upload and download bandwidth

  • at each host is fully utilized.

  • We have a team working on fast implementations of all

  • reduce for various network topologies.

  • Some hardware vendors such as the Nvidia

  • provide specialized implementation

  • of all-reduce for their hardware, for example,

  • Nvidia [INAUDIBLE].

  • The bottom line is that AllReduce can be fast

  • when you have multiple devices on a single machine

  • or a small number of machines with strong connectivity.

  • Putting all these pieces together,

  • mirror strategy uses mirrored variables and all

  • reduce to implement synchronous training.

  • So let's see how that works.

  • Let's say you have two devices, device 0 and 1,

  • and your model has two layers, A and B. Each layer has

  • a single variable.

  • And as you can see, the variables

  • are replicated across the two devices.

  • Each device received one subset of the input data,

  • and it computes the forward pass using its local copy

  • of the variables.

  • It then computes a backward pass and computes the gradients.

  • Once agreements are computed on each device,

  • the devices communicate with each other

  • using all reduce to aggregate the gradients.

  • And once the gradients are aggregated,

  • each device updates its local copy of the variables.

  • So in this way, the devices are always kept in sync.

  • The next forward pass doesn't begin

  • until each device has received a copy of the combined gradients

  • and updated its variables.

  • All reduce can further optimize things and bring down

  • your training time by overlapping computation

  • of gradients at lower layers in the network with transmission

  • of gradients at the higher layers.

  • So in this case, you can see--

  • you can compute the gradients of layer A

  • while you're transmitting the gradients for layer B.

  • And this can further reduce your training time.

  • So now that we've seen how mirror strategy looks

  • under the hood, let's look at what type of performance

  • and scaling you can expect when using

  • mirror strategy with multi-- for multiple GPUs.

  • We use a ResNet 50 model with ImageNet dataset

  • for our benchmarking.

  • It's a very popular benchmark for performance measurement.

  • And we use Nvidia Teslas V100 GPUs on Google Cloud.

  • And we use a bat size of 128 per GPU.

  • On the x-axis here, you can see the number of GPUs,

  • and on the y-axis, you can see images per second process

  • during training.

  • As you can see, as we increase the number of GPUs

  • from one to two to four to eight,

  • the images per second processed is

  • close to doubling every time.

  • In fact, we're able to achieve 90% to 95% scaling out

  • of the box.

  • Note that these numbers were obtained by using the ResNet 50

  • model that's available in our official model garden depot,

  • and currently it uses the estimator API.

  • We're working on Keras performance actively.

  • So far, we've talked a lot about scaling onto multiple GPUs.

  • What about cloud TPUs?

  • TPU stands for a tensor processing units.

  • These are custom ASIC, designed and built by Google

  • especially for accelerating machine learning workloads.

  • In the picture here, you can see the various generations

  • of TPUs.

  • On the top left, you can see TPUE1.

  • In the middle you can see cloud TPUE2,

  • which is now generally available in Google Cloud.

  • And on the right side you can see

  • TPUE3, which was just announced in Google I/O a few months ago

  • and is now available in alpha.

  • And in the bottom of the slide, you

  • can see a TPU pod, which is a number of cloud TPUs

  • that are interconnected to each other using a custom network.

  • TPU pods are also now available in alpha.

  • So if you want to learn more about TPUs,

  • please attend Frank's talk tomorrow on cloud TPUs.

  • In this talk, we're just going to focus

  • on how you can use distribution strategy to scale

  • your training on TPUs.

  • It's actually very similar to what we just

  • saw with mirror strategy, but instead we'll

  • use TPU strategy this time.

  • So first you create an instance of a TPU cluster resolver

  • and give it the name of your cloud TPU resource.

  • Then you pass the clusters over to the TPU strategy constructor

  • along with another argument called steps per run, which

  • I'll come back to in a bit.

  • That's it.

  • Once you have the strategy instance,

  • you can pass it to your compile call as before,

  • and your training will now run on cloud TPUs.

  • So you can see, the distribution strategy

  • makes it really easy to switch between different types

  • of hardware.

  • This API will be available in the next TensorFlow

  • release, which is 1.12.

  • And in the bottom of the slide, we've

  • provided a link to training ResNet 50

  • with the estimator API using TPU strategy.

  • So let's talk a little bit about what TPU strategy does.

  • TPU strategy implements the same architecture

  • as mirror strategy.

  • That is it implements data parallelism

  • with synchronous training.

  • The cores on a TPU, there are eight cores

  • on a single cloud TPU.

  • And these cores are connected via fast interconnects.

  • And this means that you can do AllReduce really fast

  • to aggregate the gradients.

  • Coming back to those steps per run parameter

  • from the previous slide, for most models

  • the computation time of a single step

  • is small compared to the sum of the communication overheads.

  • So it makes sense to run multiple steps at a time

  • to amortize these overheads.

  • So setting this number to a high value like 100

  • will give you the best performance out of the TPUs.

  • The TPU teams are working on reducing these overhead so

  • that in the future you may not need to specify

  • this argument anymore.

  • And finally you can also use TPU strategy

  • to scale to cloud TPU pods, which are, as I mentioned,

  • in alpha release right now.

  • TPU pods consist of many clouds TPUs interconnected

  • via fast network.

  • And this means that AllReduce across these different all TPU

  • pods can be really fast as well.

  • So that's all about cloud TPUs.

  • I'll hand it off to my colleague Magnus to talk about scaling

  • onto multi-node with GPUs.

  • MAGNUS HYTTSTEN: Thank you.

  • So that was how we scale on multiple GPU cores

  • from the single node.

  • What about multiple nodes the way we have multiple computers?

  • Because the fact is that even though you

  • can cram in a lot of GPU cards, for example,

  • on a single computer, sooner or later, if you

  • do massive amounts of training, you

  • will need to consider an architecture where

  • you can scale out the multiple nodes as well.

  • So this is an example where we see four worker nodes with four

  • GPU cards in each of them.

  • In terms of support for multi-GPU--

  • multi-node support, we have currently

  • support for premade estimators in terms of [INAUDIBLE] 1.11,

  • which is subject to be released shortly.

  • And we are working very, very hard

  • with some awesome developers to get this support into Keras

  • as well.

  • So you should be aware that Keras support will

  • be there as soon as possible.

  • However, if you do want to use Keras

  • with a multi-node distribution strategy,

  • you can actually achieve that using a little trick that's

  • available in the Keras, and that's called--

  • it's a function called the estimator 2 model.

  • estimator 2 model-- the model 2 estimator.

  • TF dot Keras estimator-- model 2 estimator that takes a Keras

  • model as an argument and then it actually

  • returns an estimator that you can

  • use for multi-node training.

  • So how do we set up a multi-node training environment

  • in the first place?

  • This was a really, really difficult problem

  • up until the technology that's open source now

  • called Kubernetes was released.

  • And so we-- even though you can set up multi-node training

  • with TensorFlow without running Kubernetes,

  • it will certainly help to use Kubernetes as the orchestration

  • platform to fire up multiple modes.

  • And Kubernetes this is available in most clouds

  • GCP and I think AWS and others as well.

  • So how does that work?

  • Well, a Kubernetes cluster contains a set of nodes.

  • So in this particular picture, you can see three nodes.

  • In each of them is a worker node.

  • And what TensorFlow requires in order for this to work

  • is that each of these nodes have an environment variable called

  • TF underscore config defined.

  • So every single node that you're having your cluster

  • needs to have this variable defined.

  • And in this TF config, you have two parts, first of all,

  • the cluster part, which defines all

  • of the hosts that participates in the distributed training,

  • all the nodes in your cluster.

  • And the second one is really to specify

  • who am I. What is my identity within this cluster?

  • So you can see the task here is 0.

  • So this worker is host 1 port 1.

  • It's 1.

  • That's host 2 port, and it's 2, meaning that it's

  • host 3 and that-- at that port.

  • So that's how you need to configure your cluster in order

  • to do this.

  • So that is really cumbersome to go around and round to all

  • of the nodes and actually provide the specific

  • configuration and Kubernetes provides--

  • so how do you configure this--

  • Kubernetes provides an excellent way

  • of doing that through its deployment configuration,

  • the yaml file, so you can actually

  • distribute the configuration, the environment variables

  • to set on the respective nodes.

  • So how do we integrate that with TensorFlow?

  • Well, it's part of the initial support.

  • And this is just one way of doing it.

  • There are multiple ways, but this

  • is one way that we've tested.

  • You can use a template engine called Jinja.

  • And you create a file called a Jinja file,

  • and there is actually such a file

  • available in the TensorFlow slash ecosystem

  • repository, observe not the TensorFlow repository.

  • This is the ecosystem.

  • There will be a directory under that repository called

  • distribution underscore strategy that

  • contains useful functions to use with distribution strategies.

  • So you can use this file as a template

  • in order to automatically generate

  • the deployment dot yaml for the Kubernetes cluster.

  • So what would that look like for a configuration like this

  • where we have three nodes?

  • Well, it's really, really simple.

  • The only thing you need to do in this file--

  • the Jinja file-- is the highlighted configuration

  • up here.

  • You set the worker replicas to three nodes.

  • The rest is just code that you keep for all of the executions

  • you setup to do.

  • Make sense?

  • So this is actually a macro that populates TF config based

  • on this parameter up here.

  • So that's very simple, but what about the code?

  • We've now configured the Kubernetes cluster

  • to be able to do this distributed

  • training with TensorFlow, but there are also

  • some stuff we need to do with the code

  • as we had for the single node as well.

  • So it's approximately the same as for single node, the multi

  • GPU configuration.

  • So this is the estimator lingo.

  • So I provide a config here.

  • You see the run config?

  • It's just a standard estimator construct.

  • And I set the train distribute parameter to TF

  • dot config distribute collective AllReduce strategy, so not

  • mirrored strategy for multi-node configuration.

  • It's collective AllReduce strategy.

  • And then I specify the number of GPUs

  • I have available for each of these workers that I

  • have my cluster.

  • And that's it.

  • Given that I have that config object,

  • I can just put that as part of the config parameter

  • when I do the conversion from Keras over to an estimator.

  • And I now have multi-GPU--

  • multi-node, multi-GPU in each of the nodes

  • configured for TensorFlow.

  • And so let's look at this collective AllReduce strategy

  • because that's something different than what

  • we talked about previously with a mirrored strategy.

  • So what is that thing?

  • Well, it is specifically designed for multiple worker

  • nodes.

  • And it's essentially based on mirrored strategy,

  • but it adds functionality in order

  • to deal with multi-host or multi-workers in my cluster.

  • And the good thing about this is that it automatically

  • selects the best algorithm for doing reduce--

  • the AllReduce function across this cluster.

  • So what does that mean?

  • What kind of algorithms do we have

  • for doing AllReduce in a multi-node configuration?

  • Well, one of them is very simple--

  • very similar to what we have for a single node, which

  • is to ring-all reduce in which case the GPUs,

  • they just travel across the nodes

  • and they perform an overall ring reduce across multiple hosts

  • and GPUs.

  • So essentially the same as for single node.

  • It's just that they are traversing hosts

  • with all of the penalties associated of course

  • of doing that depending on the interconnect

  • between these hosts.

  • Another algorithm is hierarchical all reduced.

  • I think that this really complicated English word.

  • And what happens here is that we essentially

  • pass all of the variables up to a single GPU card

  • on the respective hosts.

  • See that.

  • We all send them missing an error-- two errors over

  • here-- with one arrow here.

  • Never mind that.

  • They're supposed to all send this stuff to GPU 0, GPU 1.

  • And then we do an AllReduce across the nodes there.

  • And the GPUs performing that operation then

  • propagates back to the individual GPUs

  • within its own node.

  • So depending on network and other characteristics

  • of your setup and hardware, one of these solutions

  • would work very well.

  • And the thing with collective overdue strategy

  • is they will automatically detect the best algorithm

  • to use in your distributed cluster.

  • So that was multi-node, multi-accelerator cards

  • within the nodes.

  • There are also other ways to scale to multiple nodes

  • with TensorFlow.

  • And one of them-- how many of you

  • are familiar with parameter server strategy?

  • Parameter servers?

  • This is the classical way of how you do TensorFlow distributed

  • training.

  • And eventually this-- actually this way,

  • the classical way, you should not continue to do that.

  • You should actually-- once we roll out

  • distribution strategies, that's the way to go.

  • So what I'm describing here is essentially the parameter

  • server strategy, but instead of describing it

  • in the old classical way of doing TensorFlow,

  • I'm going to describe how to do it

  • with distribution strategies.

  • Does that make sense?

  • Yeah.

  • If you didn't understand that and you haven't used TO1,

  • just don't worry about it.

  • Just listen to what I have to say here.

  • To get a recap of what the parameter service strategy is,

  • it's essentially a strategy where we have shared storage.

  • We have a number of worker nodes,

  • and they're working on batches of shared stories.

  • They're working completely independently.

  • Well, not completely we'll see shortly.

  • But they are working independently

  • calculating gradients based on batches.

  • And then we have a number of parameter servers.

  • So these workers, when they are finished with the batch,

  • they send it up to the parameter servers.

  • The parameter servers, they have the updates

  • from the other workers, so they calculate

  • the average of the gradients and then pass

  • all of those variables down to the workers.

  • So it's not synchronous.

  • These workers, they will get updates

  • on the variables in that synchronous fashion, which

  • has good sides and bad sides.

  • The good side is one worker can go out,

  • and the other workers can still execute as normal.

  • That's the way this works.

  • So how can we set this up in a distributed strategy cluster?

  • Well, it's real easy.

  • Instead of just specifying the worker replicas

  • in their Jinja file, we also specify

  • the PS underscore replicas.

  • So that's the number of parameter servers

  • that we have in our Kubernetes cluster.

  • So that is the Kubernetes setup.

  • Now what about the code?

  • So that's also really easy.

  • You saw the run config--

  • the config parameter previously.

  • Instead of using the collective AllReduce strategy--

  • I got that right this time-- collective AllReduce strategy,

  • you used the parameter server strategy.

  • See that?

  • So it's just another type there.

  • You still specified the number of GPUs per worker,

  • you specify the config object to--

  • Keras model to estimator function

  • call, and you're all done.

  • So very, very few lines of code needs

  • changing even though we're talking

  • about massively different way of doing distributed TensorFlow--

  • TensorFlow training.

  • There is one more configuration that we are working on.

  • I think we will have a release of this in 1.11 at least

  • we can try out.

  • That is a really, really cool setup

  • where you actually run distributed training

  • from your laptop.

  • And in this particular case, you have all of your model training

  • code here.

  • And the only thing you--

  • so forget about parameter server.

  • Now we're back to multiple workers and AllReduce here.

  • The only thing you fire up on these workers

  • is the TF underscore STD underscore server dot pi

  • or whatever variant of that you want

  • to use because this code is available also

  • in the TensorFlow ecosystem repository.

  • So you can go check it out how we did it

  • for this normal setup, and you can change it

  • to whatever way you want.

  • The thing is that this script and installation

  • of the workers, they don't have the model program at all.

  • So when we fire up the model training from our laptop

  • or workstation here, it will distribute that model over

  • to those.

  • So if you have any changes to your model code,

  • you can just make it locally, and it will automatically

  • distribute that out to all of the workers.

  • Now you may say, oh, that's a hassle because now I've

  • got to install this script on all the workers.

  • And you do not have to do that because the only thing you do

  • is just specify the script parameter in the Jinja file

  • that you've seen a couple of times now--

  • and we have the same number of workers here--

  • and that means that the scripts will actually

  • start on all of these nodes.

  • So what we're talking about here is the capability

  • to fire up a Kubernetes cluster with an arbitrary

  • number of nodes.

  • Without any installation of code,

  • you can use a local laptop, and it will automatically

  • distribute the model and the training

  • to all of these worker nodes just

  • by having these two lines here.

  • What about the code?

  • So again, we have the wrong config here.

  • And this time, we're going to set

  • a parameter called experimental distribute

  • to the distribute config.

  • And as part of distribute config,

  • we are going to embed a collective AllReduce

  • strategy with, as we saw before, the number of GPUs

  • we have per worker.

  • But the distributed config requires one more parameter,

  • and that is the remote cluster.

  • The cluster-- the master node here

  • needs to know the cluster to which it should

  • send all the model code for these demos that

  • are waiting there for the model code to be shared.

  • Make sense?

  • So you gotta specify that parameters.

  • Then you're finishing up your config object

  • in model testimony to specify the config object.

  • And as you've seen before, it's just

  • a couple of lines of difference between

  • these different configurations.

  • That's really it for TensorFlow multi-node training.

  • So let's summarize what we've talked about here today.

  • First of all, we went through the single node distribution

  • strategy setup.

  • We talked about the mirrored strategy for multiple GPUs

  • within a single node, and we talked about the TPU strategy

  • to distribute work to the TPUs.

  • We also went through the AllReduce algorithm,

  • which is used by distribution strategy

  • to be able to do this single load distribution.

  • Then we talked about multi-node distribution,

  • talked about using Kubernetes to distribute TensorFlow training

  • using these Jinja files that compiles or translates over

  • to the yaml file for deployment.

  • We talked about the AllReduce using

  • collective AllReduce strategy.

  • We talked about the parameter server setup

  • with distribution strategy.

  • And then finally we talked about distributed training

  • from a standalone client distributing the model code

  • over to the workers.

  • Work in progress.

  • So most of the stuff that we talked about today, you'll

  • find TensorFlow dot contrib dot distribution strategy--

  • you'll find that in the TensorFlow repository.

  • But as part of 1.11, many of these things

  • that we talked about you will be able to start to use.

  • If you really want try out, you can also check out Nike

  • and see how far we've gone.

  • But 1.11 should be out shortly.

  • We are working on performance still.

  • This is always going to be something

  • that we're going to work on to match state of the art.

  • As you saw with single node multi-GPU,

  • we've achieved 90% to 95% for scaling

  • a performance on the GPU cards.

  • We're continuously working on trying

  • to improve this for all of the different configurations

  • we've talked about.

  • TPU strategy will be available as part of 1.12 with Keras.

  • Right now it's only available in estimator.

  • But remember we have the estimator trick, model

  • to estimator within Keras, so you can actually

  • take a Keras model, convert to an estimator,

  • and still use GPUs strategy that will be part of 1.11.

  • Multi-worker GPU support is something we're also

  • working on as I said.

  • So that means that in Keras--

  • native Keras code, we can actually

  • specify multi-worker and GPU support and also

  • Eager execution.

  • How many of you are familiar with Eager execution?

  • Got to check that out.

  • That's a really important feature of TensorFlow.

  • So if you're not using Eager, you

  • should definitely stop using anything else

  • and start using Eager.

  • The entire getting started experience of TensorFlow

  • is based on Eager mode, and we will

  • have great performance bridges between Eager execution

  • and graph mode execution and all of this distribution.

  • So the entire architecture builds on this,

  • so you should check it out.

  • Eager execution is also something

  • we're working on so you can directly in Eager execution

  • mode utilize multiple GPU cards in multiple nodes

  • in the same way that we discussed in the setup.

  • And then when we have multi-worker GPUs,

  • obviously if one fails and we talk

  • about this AllReduce synchronous gradient updates,

  • we do have a discussion of fault tolerance.

  • So that's something we're looking into

  • to build into this, so we have more resilience with respect

  • to defaults.

  • So another summary, what did we talk about today?

  • We talked about the distribution API, which is very easy to use.

  • It's the new way of doing distributed TensorFlow

  • training.

  • Forget about anything that you did before.

  • Start to learn about how to do this.

  • We talked about distribution strategies

  • having great performance right out of the box.

  • We saw the scale in between 1 and 8 GPUs

  • on a Kubernetes cluster.

  • And then we looked at how it can scale across GPUs--

  • different accelerators, GPUs as well as TPUs,

  • single node as well as multi-node and TPU pod.

  • And that's it.

  • You should definitely take a picture of this slide

  • because this slide summarizes all of the resources

  • that we had.

  • And with that, we are done.

  • Thank you very much for listening to this.

  • If you have any questions--

PRIYA GUPTA: Let's begin with the obvious question.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

分佈式TensorFlow (TensorFlow @ O'Reilly AI Conference, San Francisco '18) (Distributed TensorFlow (TensorFlow @ O’Reilly AI Conference, San Francisco '18))

  • 4 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字