字幕列表 影片播放 列印英文字幕 [MUSIC PLAYING] NOAM SHAZEER: Hi. My name is Noam Shazeer. I'm going to tell you about Mesh-TensorFlow, which is a system we built for training giant models on TPU pods. So we're going to talk about data-parallelism and model-parallelism, and also about why I want to train giant models and why we need model parallelism to do it. Then I'll tell you about the Mesh-TensorFlow system. So first data-parallelism-- this is how, roughly, everybody trains neural networks if you have distributed hardware. This is what we use on TPU pods at Google. And generally, you put your entire model on every device, split up the training batch into a lot of little trunks, one on each device. Run it. Then add up all the gradients on the parameters across all the devices and do your update. This works really well. The communication is fast on all kinds of networks, on lots of different types of networks, in fact. So this is roughly what everyone's doing. The great things about it are it's universal. You can use any model architecture. It's fast to compile because you're writing a SPMD code. You're writing code for what one device is doing, and then you can compile it and send it to every device which is using the same code. And you get roughly full utilization because every device is doing roughly the same thing. So if they're all similar, then nobody is waiting for anybody else. And the communication is fast. The only problem is that you cannot train giant models. Because your entire model has to fit on every device. So why do I want to train giant models? Well, I like working on language modeling. There are lots of important applications, machine translation, question answering, dialogue, sentimental analysis, lots of interesting things to do with language. And we find that quality improves with model size. So a bigger model tends to know more about the world, understand things better, and give you overall better results. There's plenty of data out there to train a giant model. Just download the text of the web, common crawl, whatever, and you've got billions to trillions of words of training data. And in fact, you can train one big model and then fine tune it to do lots of different things. There's been a lot of research-- OpenAI, and BERT at Google on transfer learning and language. So it's a great candidate for building giant models. Now, as an example, I trained a transformer language model with roughly 100 million parameters on the text of Wikipedia. The Abraham Lincoln article was in the dev set, in the held out set, and I told it to generate a random Abraham Lincoln article. And it looks roughly grammatical. It remembers somehow that he's a politician, American politician. There's plenty it doesn't know about the world, like who Abraham Lincoln was, or that America doesn't have a prime minister, and lots of other stuff. But if you make the similar model, just bigger, here's with five billion parameters instead of 100 million. And now it seems to have picked up a lot more about Abraham Lincoln. Roughly half of that stuff is correct. But you know-- [LAUGHTER] --mostly it's fake news. But there are more important applications out there than generating fake news. But this is just a nice demonstration that model size is important. What would a model look like with a trillion parameters? We have not done that yet. But we hope to do that soon. OK. So if all the parameters will not fit on one core, we need to do something called model-parallelism, which means that we're splitting the model itself between different devices. And that should let us train really large models, and it should also be very good for inference latency, because now the computation for one example can be split across multiple devices. The problem is it's very tricky to design these kinds of algorithms. How do people tend to do it now? Well, you use device placement. You say this operation is going on this device. This operation is going on that device. TensorFlow makes it easy to do that. Still, it's tricky to design an efficient algorithm. And you end up with a giant graph, if you're generating enough operations to go on 2,000 different cores. Here's an example of some model-parallelism by device placement from Google Neural Multimachine Translation. They had eight LSTMs, which they distributed across eight different GPUs. The softmax layer they put somewhere else. And you need some kind of interesting pipelining to keep all the GPUs busy. It works, but a lot of work to get this thing to work right. Now we're going to take a totally different approach, which is going to be inspired by what works well about synchronous data-parallelism. So we will have every processor involved in every operation. We're going to use SPMD style programming, where you'll have the same program on every device. And it's going to use collective communication, like allreduce, just like data-parallelism. And our library for doing this is called Mesh-TensorFlow. We should be able to implement data-parallelism, model-parallelism. We should be able to split-- in different dimensions, like split an image or video spatially or any sorts of combinations of these things. And we're targeting hardware where you have a homogeneous set of similar processors, ideally well-connected like a TPU pod. We've got these two dimensional supercomputers at Google that we've been using. And you're going to view your set of processors as an n-dimensional mesh. It doesn't have to correspond to a physical n-dimensional mesh. You could view a two dimensional mesh of processors as a one dimensional mesh. But, of course, performance will depend on those considerations. So how does this all work? Well, in data-parallelism, you can view it as splitting the batch dimension of the computation across all your processors. So any tensor that has a batch dimension is going to be split across all of the processors. And any tensor that does not have a batch dimension, meaning the parameters, gets fully replicated. Now we're going to do the same thing, but for model-parallelism, we will choose different dimensions to split. So maybe dimensions representing the sizes of hidden layers-- and we will decide to split those dimensions across the set of processors. And the communication will happen-- usually, an operation will not involve communication. But some operations will involve collective communication, like all reduce-- particularly, when you're reducing out split dimensions. And this is going to be somewhat similar to how things work in synchronous data-parallelism. So let's do an example, a simple three layer neural network. Input layer X, hidden layer H, output layer Y, and we have two weight matrices, W and V. The data-parallel way to do this is that we are going to split anything with a batch dimension, meaning the activations X, H, and Y. So all of those tensors are split evenly across processors. And each tensor that does not have a batch dimension, W and V, will be replicated across every processor. So here it's showing what Processor 0 is doing, what Processor 1 is doing. They're both doing something roughly similar, except they have different halves of the activation. You don't see any communication in the forwards pass. But if you were to see the backwards pass where you're computing the parameter gradients, you would see some matmuls where the split dimension b gets reduced out, and there would be some all reduces in there. So we're going to now, instead of splitting V, let's split the size of the hidden layer, dimension H. So we do that. And now X and Y, the input and output, are fully replicated. Because they do not have an H dimension. But the hidden layer H is split because it does have an H dimension. And the parameter matrices, W and V, also have an H dimension. So they're split. So again, you have a parallel-computation on the two processors, and you see an all reduced communication when you're computing Y because we're reducing out the split dimension H. We didn't have to split H. Instead, we could have split V, which is the dimension of X and Y. So in that case, you would have a different pattern of which tensors are split and which ones are replicated. And you'd have communication in different places. And if you want to get really fancy, let's do data-parallelism and model-parallelism at once. We're going to split dimension b, the batch across one axis of our two dimensional supercomputer. And we're going to split the hidden layer H across the other axis of our mesh of processors. So now, we have different sensors being either split in one dimension and replicated in the other, or since tensor H has both of those dimensions, it ends up piled among all the processors. And there are going to be all reduced communications in there, but not across all the processors. They'll be partitioned all reduces just across rows or just across columns. So the general case is give all of the tensor dimensions names. And we define the layout of the communication as a map from tensor dimensions to mesh dimensions, saying which tensor dimensions get split across which mesh dimensions. For example, in the previous slide we had the batch tensor dimensions split across processor rows, and the hidden size dimensions split across processor columns. We did this to our transformer, machine translation slash language model, and here are the layouts we used for data-parallelism, model-parallelism and combined parallelism for that. For the model-parallelism, it works to split the size of the vocabulary, the size of the feed forward hidden layer, and the number of attention heads. And if you do that, you end up splitting up all of your communication very nicely and get a nice model parallel algorithm. And that can also be combined with data-parallelism by splitting the batch across the other dimension of the supercomputer. Now, picking a good layout is, for now, something that you need a well-trained human to do. You need to make sure that all of your expensive operations are split. You're not allowed to split two dimensions of the same tensor across different dimensions of the mesh. And depending on what dimensions you chop up, it may result in more or less communication. So example, data-parallelism-- you'd like the batch to be big. Otherwise, you're going to get a lot of communication. Similarly, if you're splitting up a hidden layer, you might want that layer to be really big. So how do you use Mesh-TensorFlow? Well, download our open source repository. You build a graph in Python, much like a regular TensorFlow graph except that you're using named dimensions. You define what your mesh is and how it maps to your physical processors. You define your layout of what gets split across what. And then Mesh-TensorFlow turns your mesh TensorFlow graph into part of a TensorFlow graph. And you still use TensorFlow for anything else you want to use it for, like the data pipelines and everything else. So far we've trained transformer models on up to five billion parameters on entire TPU pods, getting a good performance out of the thing. And these giant models give state of the art quality on some benchmark tasks, like Latin language modeling and machine translation, not surprisingly. Bigger models are better. Lots of people are finding that out. And in the future, we would like to try even bigger models, I think, with some well-placed sparsity. We would have the computation to train models with a trillion parameters. We've tried up to a couple hundred billion for now. And it runs. So next thing is to see if we can get a trillion parameter model to run and give us great quality. And this would be useful for other things, like low-latency inference and situations where you have giant inputs that you want to process. And for now, what works? Well, we're emitting SPMD code for TPU, including Cloud TPU. So this runs nicely on TPU's pods. And for CPU and GPU, it's still emitting the old fashioned device placement code. So it runs, but not as scalable. Everything is out there on GitHub and runs with TensorFlow 1-- not yet with TensorFlow 2. And then, in the future, we want to use this for different types of models. It would be great to automate the process of choosing a distributed layout. Because then you wouldn't need to know about much TensorFlow and it would just figure out how to distribute your computation for you. And we welcome contributions to the open source code or contact us. And I'd like to thank all of my collaborators, the authors of our paper. I'd also like to thank the TensorFlow teams and XLA team for a lot of technical support and help with all of this, implementing what we needed to be implemented. And everything's out there in our open source repository. Thank you. [APPLAUSE] [MUSIC PLAYING]
B1 中級 Mesh-TensorFlow:超級計算機的模型並行性(TF Dev Summit '19) (Mesh-TensorFlow: Model Parallelism for Supercomputers (TF Dev Summit ‘19)) 2 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字