Mesh-TensorFlow：超級計算機的模型並行性(TF Dev Summit '19) (Mesh-TensorFlow: Model Parallelism for Supercomputers (TF Dev Summit ‘19))

字幕列表影片播放

[MUSIC PLAYING]
NOAM SHAZEER: Hi.
My name is Noam Shazeer.
I'm going to tell you about Mesh-TensorFlow,
which is a system we built for training giant models on TPU
pods.
So we're going to talk about data-parallelism
and model-parallelism, and also about why
I want to train giant models and why we need
model parallelism to do it.
Then I'll tell you about the Mesh-TensorFlow system.
So first data-parallelism-- this is how, roughly, everybody
trains neural networks if you have distributed hardware.
This is what we use on TPU pods at Google.
And generally, you put your entire model on every device,
split up the training batch into a lot of little trunks,
one on each device.
Run it.
Then add up all the gradients on the parameters
across all the devices and do your update.
This works really well.
The communication is fast on all kinds of networks,
on lots of different types of networks, in fact.
So this is roughly what everyone's doing.
The great things about it are it's universal.
You can use any model architecture.
It's fast to compile because you're writing a SPMD code.
You're writing code for what one device is doing,
and then you can compile it and send it
to every device which is using the same code.
And you get roughly full utilization
because every device is doing roughly the same thing.
So if they're all similar, then nobody
is waiting for anybody else.
And the communication is fast.
The only problem is that you cannot train giant models.
Because your entire model has to fit on every device.
So why do I want to train giant models?
Well, I like working on language modeling.
There are lots of important applications, machine
translation, question answering, dialogue, sentimental analysis,
lots of interesting things to do with language.
And we find that quality improves with model size.
So a bigger model tends to know more about the world,
understand things better, and give you
overall better results.
There's plenty of data out there to train a giant model.
Just download the text of the web, common crawl, whatever,
and you've got billions to trillions
of words of training data.
And in fact, you can train one big model
and then fine tune it to do lots of different things.
There's been a lot of research--
OpenAI, and BERT at Google on transfer learning and language.
So it's a great candidate for building giant models.
Now, as an example, I trained a transformer language model
with roughly 100 million parameters
on the text of Wikipedia.
The Abraham Lincoln article was in the dev set,
in the held out set, and I told it to generate a random Abraham
Lincoln article.
And it looks roughly grammatical.
It remembers somehow that he's a politician,
American politician.
There's plenty it doesn't know about the world,
like who Abraham Lincoln was, or that America
doesn't have a prime minister, and lots of other stuff.
But if you make the similar model,
just bigger, here's with five billion parameters instead
of 100 million.
And now it seems to have picked up a lot more about Abraham
Lincoln.
Roughly half of that stuff is correct.
But you know--
[LAUGHTER]
--mostly it's fake news.
But there are more important applications out there
than generating fake news.
But this is just a nice demonstration
that model size is important.
What would a model look like with a trillion parameters?
We have not done that yet.
But we hope to do that soon.
OK.
So if all the parameters will not fit on one core,
we need to do something called model-parallelism, which
means that we're splitting the model itself
between different devices.
And that should let us train really large models,
and it should also be very good for inference latency,
because now the computation for one example
can be split across multiple devices.
The problem is it's very tricky to design
these kinds of algorithms.
How do people tend to do it now?
Well, you use device placement.
You say this operation is going on this device.
This operation is going on that device.
TensorFlow makes it easy to do that.
Still, it's tricky to design an efficient algorithm.
And you end up with a giant graph,
if you're generating enough operations
to go on 2,000 different cores.
Here's an example of some model-parallelism
by device placement from Google Neural Multimachine
Translation.
They had eight LSTMs, which they distributed
across eight different GPUs.
The softmax layer they put somewhere else.
And you need some kind of interesting pipelining
to keep all the GPUs busy.
It works, but a lot of work to get this thing to work right.
Now we're going to take a totally different approach,
which is going to be inspired by what works well
about synchronous data-parallelism.
So we will have every processor involved in every operation.
We're going to use SPMD style programming, where
you'll have the same program on every device.
And it's going to use collective communication, like allreduce,
just like data-parallelism.
And our library for doing this is called Mesh-TensorFlow.
We should be able to implement data-parallelism,
model-parallelism.
We should be able to split--
in different dimensions, like split an image or video
spatially or any sorts of combinations of these things.
And we're targeting hardware where
you have a homogeneous set of similar processors,
ideally well-connected like a TPU pod.
We've got these two dimensional supercomputers at Google
that we've been using.
And you're going to view your set of processors
as an n-dimensional mesh.
It doesn't have to correspond to a physical n-dimensional mesh.
You could view a two dimensional mesh of processors
as a one dimensional mesh.
But, of course, performance will depend on those considerations.
So how does this all work?
Well, in data-parallelism, you can view it
as splitting the batch dimension of the computation across all
your processors.
So any tensor that has a batch dimension
is going to be split across all of the processors.
And any tensor that does not have a batch dimension, meaning
the parameters, gets fully replicated.
Now we're going to do the same thing,
but for model-parallelism, we will choose
different dimensions to split.
So maybe dimensions representing the sizes of hidden layers--
and we will decide to split those dimensions
across the set of processors.
And the communication will happen--
usually, an operation will not involve communication.
But some operations will involve collective communication,
like all reduce-- particularly, when you're
reducing out split dimensions.
And this is going to be somewhat similar to how
things work in synchronous data-parallelism.
So let's do an example, a simple three layer neural network.
Input layer X, hidden layer H, output layer
Y, and we have two weight matrices,
W and V. The data-parallel way to do
this is that we are going to split anything with a batch
dimension, meaning the activations X, H, and Y.
So all of those tensors are split evenly across processors.
And each tensor that does not have a batch dimension, W
and V, will be replicated across every processor.
So here it's showing what Processor 0 is doing,
what Processor 1 is doing.
They're both doing something roughly similar,
except they have different halves of the activation.
You don't see any communication in the forwards pass.
But if you were to see the backwards pass where you're
computing the parameter gradients,
you would see some matmuls where the split dimension b gets
reduced out, and there would be some all reduces in there.
So we're going to now, instead of splitting V,
let's split the size of the hidden layer, dimension H.
So we do that.
And now X and Y, the input and output, are fully replicated.
Because they do not have an H dimension.
But the hidden layer H is split because it
does have an H dimension.
And the parameter matrices, W and V,
also have an H dimension.
So they're split.
So again, you have a parallel-computation on the two
processors, and you see an all reduced communication
when you're computing Y because we're reducing out
the split dimension H.
We didn't have to split H. Instead,
we could have split V, which is the dimension of X and Y.
So in that case, you would have a different pattern of which
tensors are split and which ones are replicated.
And you'd have communication in different places.
And if you want to get really fancy,
let's do data-parallelism and model-parallelism at once.
We're going to split dimension b,
the batch across one axis of our two dimensional supercomputer.
And we're going to split the hidden layer
H across the other axis of our mesh of processors.
So now, we have different sensors being either split
in one dimension and replicated in the other,
or since tensor H has both of those dimensions,
it ends up piled among all the processors.
And there are going to be all reduced
communications in there, but not across all the processors.
They'll be partitioned all reduces just
across rows or just across columns.
So the general case is give all of the tensor dimensions names.
And we define the layout of the communication
as a map from tensor dimensions to mesh dimensions,
saying which tensor dimensions get split
across which mesh dimensions.
For example, in the previous slide
we had the batch tensor dimensions
split across processor rows, and the hidden size dimensions
split across processor columns.
We did this to our transformer, machine translation
slash language model, and here are
the layouts we used for data-parallelism,
model-parallelism and combined parallelism for that.
For the model-parallelism, it works
to split the size of the vocabulary,
the size of the feed forward hidden layer,
and the number of attention heads.
And if you do that, you end up splitting up
all of your communication very nicely
and get a nice model parallel algorithm.
And that can also be combined with data-parallelism
by splitting the batch across the other dimension
of the supercomputer.
Now, picking a good layout is, for now, something
that you need a well-trained human to do.
You need to make sure that all of your expensive operations
are split.
You're not allowed to split two dimensions of the same tensor
across different dimensions of the mesh.
And depending on what dimensions you chop up,
it may result in more or less communication.
So example, data-parallelism-- you'd like the batch to be big.
Otherwise, you're going to get a lot of communication.
Similarly, if you're splitting up a hidden layer,
you might want that layer to be really big.
So how do you use Mesh-TensorFlow?
Well, download our open source repository.
You build a graph in Python, much like a regular TensorFlow
graph except that you're using named dimensions.
You define what your mesh is and how it maps
to your physical processors.
You define your layout of what gets split across what.
And then Mesh-TensorFlow turns your mesh TensorFlow graph
into part of a TensorFlow graph.
And you still use TensorFlow for anything else
you want to use it for, like the data pipelines and everything
else.
So far we've trained transformer models on up to five
billion parameters on entire TPU pods,
getting a good performance out of the thing.
And these giant models give state
of the art quality on some benchmark tasks,
like Latin language modeling and machine translation,
not surprisingly.
Bigger models are better.
Lots of people are finding that out.
And in the future, we would like to try even bigger models,
I think, with some well-placed sparsity.
We would have the computation to train models
with a trillion parameters.
We've tried up to a couple hundred billion for now.
And it runs.
So next thing is to see if we can get a trillion parameter
model to run and give us great quality.
And this would be useful for other things,
like low-latency inference and situations
where you have giant inputs that you want to process.
And for now, what works?
Well, we're emitting SPMD code for TPU, including Cloud TPU.
So this runs nicely on TPU's pods.
And for CPU and GPU, it's still emitting the old fashioned
device placement code.
So it runs, but not as scalable.
Everything is out there on GitHub
and runs with TensorFlow 1-- not yet with TensorFlow 2.
And then, in the future, we want to use this
for different types of models.
It would be great to automate the process of choosing
a distributed layout.
Because then you wouldn't need to know about much TensorFlow
and it would just figure out how to distribute your computation
for you.
And we welcome contributions to the open source code
or contact us.
And I'd like to thank all of my collaborators,
the authors of our paper.
I'd also like to thank the TensorFlow teams and XLA
team for a lot of technical support
and help with all of this, implementing what
we needed to be implemented.
And everything's out there in our open source repository.
Thank you.
[APPLAUSE]
[MUSIC PLAYING]