Placeholder Image

字幕列表 影片播放

  • [MUSIC PLAYING]

  • NOAM SHAZEER: Hi.

  • My name is Noam Shazeer.

  • I'm going to tell you about Mesh-TensorFlow,

  • which is a system we built for training giant models on TPU

  • pods.

  • So we're going to talk about data-parallelism

  • and model-parallelism, and also about why

  • I want to train giant models and why we need

  • model parallelism to do it.

  • Then I'll tell you about the Mesh-TensorFlow system.

  • So first data-parallelism-- this is how, roughly, everybody

  • trains neural networks if you have distributed hardware.

  • This is what we use on TPU pods at Google.

  • And generally, you put your entire model on every device,

  • split up the training batch into a lot of little trunks,

  • one on each device.

  • Run it.

  • Then add up all the gradients on the parameters

  • across all the devices and do your update.

  • This works really well.

  • The communication is fast on all kinds of networks,

  • on lots of different types of networks, in fact.

  • So this is roughly what everyone's doing.

  • The great things about it are it's universal.

  • You can use any model architecture.

  • It's fast to compile because you're writing a SPMD code.

  • You're writing code for what one device is doing,

  • and then you can compile it and send it

  • to every device which is using the same code.

  • And you get roughly full utilization

  • because every device is doing roughly the same thing.

  • So if they're all similar, then nobody

  • is waiting for anybody else.

  • And the communication is fast.

  • The only problem is that you cannot train giant models.

  • Because your entire model has to fit on every device.

  • So why do I want to train giant models?

  • Well, I like working on language modeling.

  • There are lots of important applications, machine

  • translation, question answering, dialogue, sentimental analysis,

  • lots of interesting things to do with language.

  • And we find that quality improves with model size.

  • So a bigger model tends to know more about the world,

  • understand things better, and give you

  • overall better results.

  • There's plenty of data out there to train a giant model.

  • Just download the text of the web, common crawl, whatever,

  • and you've got billions to trillions

  • of words of training data.

  • And in fact, you can train one big model

  • and then fine tune it to do lots of different things.

  • There's been a lot of research--

  • OpenAI, and BERT at Google on transfer learning and language.

  • So it's a great candidate for building giant models.

  • Now, as an example, I trained a transformer language model

  • with roughly 100 million parameters

  • on the text of Wikipedia.

  • The Abraham Lincoln article was in the dev set,

  • in the held out set, and I told it to generate a random Abraham

  • Lincoln article.

  • And it looks roughly grammatical.

  • It remembers somehow that he's a politician,

  • American politician.

  • There's plenty it doesn't know about the world,

  • like who Abraham Lincoln was, or that America

  • doesn't have a prime minister, and lots of other stuff.

  • But if you make the similar model,

  • just bigger, here's with five billion parameters instead

  • of 100 million.

  • And now it seems to have picked up a lot more about Abraham

  • Lincoln.

  • Roughly half of that stuff is correct.

  • But you know--

  • [LAUGHTER]

  • --mostly it's fake news.

  • But there are more important applications out there

  • than generating fake news.

  • But this is just a nice demonstration

  • that model size is important.

  • What would a model look like with a trillion parameters?

  • We have not done that yet.

  • But we hope to do that soon.

  • OK.

  • So if all the parameters will not fit on one core,

  • we need to do something called model-parallelism, which

  • means that we're splitting the model itself

  • between different devices.

  • And that should let us train really large models,

  • and it should also be very good for inference latency,

  • because now the computation for one example

  • can be split across multiple devices.

  • The problem is it's very tricky to design

  • these kinds of algorithms.

  • How do people tend to do it now?

  • Well, you use device placement.

  • You say this operation is going on this device.

  • This operation is going on that device.

  • TensorFlow makes it easy to do that.

  • Still, it's tricky to design an efficient algorithm.

  • And you end up with a giant graph,

  • if you're generating enough operations

  • to go on 2,000 different cores.

  • Here's an example of some model-parallelism

  • by device placement from Google Neural Multimachine

  • Translation.

  • They had eight LSTMs, which they distributed

  • across eight different GPUs.

  • The softmax layer they put somewhere else.

  • And you need some kind of interesting pipelining

  • to keep all the GPUs busy.

  • It works, but a lot of work to get this thing to work right.

  • Now we're going to take a totally different approach,

  • which is going to be inspired by what works well

  • about synchronous data-parallelism.

  • So we will have every processor involved in every operation.

  • We're going to use SPMD style programming, where

  • you'll have the same program on every device.

  • And it's going to use collective communication, like allreduce,

  • just like data-parallelism.

  • And our library for doing this is called Mesh-TensorFlow.

  • We should be able to implement data-parallelism,

  • model-parallelism.

  • We should be able to split--

  • in different dimensions, like split an image or video

  • spatially or any sorts of combinations of these things.

  • And we're targeting hardware where

  • you have a homogeneous set of similar processors,

  • ideally well-connected like a TPU pod.

  • We've got these two dimensional supercomputers at Google

  • that we've been using.

  • And you're going to view your set of processors

  • as an n-dimensional mesh.

  • It doesn't have to correspond to a physical n-dimensional mesh.

  • You could view a two dimensional mesh of processors

  • as a one dimensional mesh.

  • But, of course, performance will depend on those considerations.

  • So how does this all work?

  • Well, in data-parallelism, you can view it

  • as splitting the batch dimension of the computation across all

  • your processors.

  • So any tensor that has a batch dimension

  • is going to be split across all of the processors.

  • And any tensor that does not have a batch dimension, meaning

  • the parameters, gets fully replicated.

  • Now we're going to do the same thing,

  • but for model-parallelism, we will choose

  • different dimensions to split.

  • So maybe dimensions representing the sizes of hidden layers--

  • and we will decide to split those dimensions

  • across the set of processors.

  • And the communication will happen--

  • usually, an operation will not involve communication.

  • But some operations will involve collective communication,

  • like all reduce-- particularly, when you're

  • reducing out split dimensions.

  • And this is going to be somewhat similar to how

  • things work in synchronous data-parallelism.

  • So let's do an example, a simple three layer neural network.

  • Input layer X, hidden layer H, output layer

  • Y, and we have two weight matrices,

  • W and V. The data-parallel way to do

  • this is that we are going to split anything with a batch

  • dimension, meaning the activations X, H, and Y.

  • So all of those tensors are split evenly across processors.

  • And each tensor that does not have a batch dimension, W

  • and V, will be replicated across every processor.

  • So here it's showing what Processor 0 is doing,

  • what Processor 1 is doing.

  • They're both doing something roughly similar,

  • except they have different halves of the activation.

  • You don't see any communication in the forwards pass.

  • But if you were to see the backwards pass where you're

  • computing the parameter gradients,

  • you would see some matmuls where the split dimension b gets

  • reduced out, and there would be some all reduces in there.

  • So we're going to now, instead of splitting V,

  • let's split the size of the hidden layer, dimension H.

  • So we do that.

  • And now X and Y, the input and output, are fully replicated.

  • Because they do not have an H dimension.

  • But the hidden layer H is split because it

  • does have an H dimension.

  • And the parameter matrices, W and V,

  • also have an H dimension.

  • So they're split.

  • So again, you have a parallel-computation on the two

  • processors, and you see an all reduced communication

  • when you're computing Y because we're reducing out

  • the split dimension H.

  • We didn't have to split H. Instead,

  • we could have split V, which is the dimension of X and Y.

  • So in that case, you would have a different pattern of which

  • tensors are split and which ones are replicated.

  • And you'd have communication in different places.

  • And if you want to get really fancy,

  • let's do data-parallelism and model-parallelism at once.

  • We're going to split dimension b,

  • the batch across one axis of our two dimensional supercomputer.

  • And we're going to split the hidden layer

  • H across the other axis of our mesh of processors.

  • So now, we have different sensors being either split

  • in one dimension and replicated in the other,

  • or since tensor H has both of those dimensions,

  • it ends up piled among all the processors.

  • And there are going to be all reduced

  • communications in there, but not across all the processors.

  • They'll be partitioned all reduces just

  • across rows or just across columns.

  • So the general case is give all of the tensor dimensions names.

  • And we define the layout of the communication

  • as a map from tensor dimensions to mesh dimensions,

  • saying which tensor dimensions get split

  • across which mesh dimensions.

  • For example, in the previous slide

  • we had the batch tensor dimensions

  • split across processor rows, and the hidden size dimensions

  • split across processor columns.

  • We did this to our transformer, machine translation

  • slash language model, and here are

  • the layouts we used for data-parallelism,

  • model-parallelism and combined parallelism for that.

  • For the model-parallelism, it works

  • to split the size of the vocabulary,

  • the size of the feed forward hidden layer,

  • and the number of attention heads.

  • And if you do that, you end up splitting up

  • all of your communication very nicely

  • and get a nice model parallel algorithm.

  • And that can also be combined with data-parallelism

  • by splitting the batch across the other dimension

  • of the supercomputer.

  • Now, picking a good layout is, for now, something

  • that you need a well-trained human to do.

  • You need to make sure that all of your expensive operations

  • are split.

  • You're not allowed to split two dimensions of the same tensor

  • across different dimensions of the mesh.

  • And depending on what dimensions you chop up,

  • it may result in more or less communication.

  • So example, data-parallelism-- you'd like the batch to be big.

  • Otherwise, you're going to get a lot of communication.

  • Similarly, if you're splitting up a hidden layer,

  • you might want that layer to be really big.

  • So how do you use Mesh-TensorFlow?

  • Well, download our open source repository.

  • You build a graph in Python, much like a regular TensorFlow

  • graph except that you're using named dimensions.

  • You define what your mesh is and how it maps

  • to your physical processors.

  • You define your layout of what gets split across what.

  • And then Mesh-TensorFlow turns your mesh TensorFlow graph

  • into part of a TensorFlow graph.

  • And you still use TensorFlow for anything else

  • you want to use it for, like the data pipelines and everything

  • else.

  • So far we've trained transformer models on up to five

  • billion parameters on entire TPU pods,

  • getting a good performance out of the thing.

  • And these giant models give state

  • of the art quality on some benchmark tasks,

  • like Latin language modeling and machine translation,

  • not surprisingly.

  • Bigger models are better.

  • Lots of people are finding that out.

  • And in the future, we would like to try even bigger models,

  • I think, with some well-placed sparsity.

  • We would have the computation to train models

  • with a trillion parameters.

  • We've tried up to a couple hundred billion for now.

  • And it runs.

  • So next thing is to see if we can get a trillion parameter

  • model to run and give us great quality.

  • And this would be useful for other things,

  • like low-latency inference and situations

  • where you have giant inputs that you want to process.

  • And for now, what works?

  • Well, we're emitting SPMD code for TPU, including Cloud TPU.

  • So this runs nicely on TPU's pods.

  • And for CPU and GPU, it's still emitting the old fashioned

  • device placement code.

  • So it runs, but not as scalable.

  • Everything is out there on GitHub

  • and runs with TensorFlow 1-- not yet with TensorFlow 2.

  • And then, in the future, we want to use this

  • for different types of models.

  • It would be great to automate the process of choosing

  • a distributed layout.

  • Because then you wouldn't need to know about much TensorFlow

  • and it would just figure out how to distribute your computation

  • for you.

  • And we welcome contributions to the open source code

  • or contact us.

  • And I'd like to thank all of my collaborators,

  • the authors of our paper.

  • I'd also like to thank the TensorFlow teams and XLA

  • team for a lot of technical support

  • and help with all of this, implementing what

  • we needed to be implemented.

  • And everything's out there in our open source repository.

  • Thank you.

  • [APPLAUSE]

  • [MUSIC PLAYING]

[MUSIC PLAYING]

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

Mesh-TensorFlow:超級計算機的模型並行性(TF Dev Summit '19) (Mesh-TensorFlow: Model Parallelism for Supercomputers (TF Dev Summit ‘19))

  • 1 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字