字幕列表 影片播放 列印英文字幕 FRANK CHEN: So hi everyone. I'm Frank. And I work on the Google Brain team working on TensorFlow. And today for the first part of this talk, I'm going to talk to you about accelerating machine learning with Google Cloud TPUs. So the motivation question here is, why is Google building accelerators? I'm always hesitant to predict this, but if you look at the data, this has been-- the end of Moore's law has been going on for the past 10 or 15 years, where we don't really see the 52% year-on-year growth in single-threaded performance that we saw from the late 1980s through the early 2000s anymore, where now single-threaded performance for CPUs is really growing at a rate of about maybe 3% or 5% per year. So what this means is that I can't just wait 18 months for my machine learning models to train twice as fast. This doesn't work anymore. At the same time, organizations are dealing with more data than ever before. You have people uploading hundreds and hundreds of hours of video every minute to YouTube. People are leaving product reviews on Amazon. People are using chat systems, such as WhatsApp. People are talking about personal assistance and so on and so forth. So more data is generated than ever before. And organizations are just not really equipped to make sense of them to use them properly. And the third thread is that at the same time, we have this sort of exponential increase in the amount of compute needed by these machine learning models. This is a very interesting blog post by OpenAI. In late 2012, where we just had-- where deep learning was first becoming useful. We have like AlexNet, and we have Dropout, which used a fair amount of computing power, but not that much compared to in late 2017 where DeepMind published the AlphaGo Zero and AlphaGo. In the Alpha Zero paper, we see in about six, seven years, we see the compute demand increase by 300,000 times. So this puts a huge strain on companies' compute infrastructure. So what does this all mean? The end of Moore's law plus this sort of exponential increase in computer requirements means that we need a new approach for doing machine learning. At the same time, of course, everyone still wants to do compute, do machine learning, training faster and cheaper. So that's why Google is building specialized hardware. Now, the second question you might be asking is, what sort of accelerators is Google building? So from the title of my talk, you know that Google is building a type of accelerator that we call Tensor Processing Units, which are really specialized ASICs designed for machine learning. This is the first generation of our TPUs we introduced back in 2015 at Google I/O. The second generation of TPUs now called Cloud TPU version 2 that we introduced at Google I/O last year. And then these Cloud TPU version 2's can be combined into pods called Cloud TPU v2 Pods. And of course, at Google I/O this year, we introduced the third generation of cloud TPUs. From air cooled. Now it's liquid cooled. And of course, you can link a bunch of them up into a pod configuration as well. So what are the differences between these generations of TPUs? So the first version of TPUs, it was really designed for inference only. So it did about 92 teraops of innate. The second generation of TPUs does both training and inference. It operates on floating point numbers. It does about 180 teraflops. And it has about 64 gigs of HBM. And the third generation to TPUs, it's a big leap in performance. So now we are doing 420 teraflops. And we doubled the amount of memory. So now it's 128 gigs of HBM. And again, it does training and inference. And of course, we see the same sort of progress with Cloud TPU Pods as well. Our 2017 pods did about 11.5 petaflops. That is 11,500 teraflops of compute with 4 terabytes of HBM. And our new generation of pods does over 100 petaflops with 32 terabytes of HBM. And of course, the new generation of pods is also liquid cooled. We have a new chip architecture. So that's all well and good, but really, what we are looking for here is not just peak performance, but cost effective performance. So take this very commonly used image recognition model, called ResNet 50. If you train it on, again, a very common dataset called ImageNet, we achieve about 4,100 images per second on real data. We also achieve that while getting state of the art final accuracy numbers. So in this case, it's 93% top 5 accuracy on the ImageNet dataset. And we can train this ResNet model in about 7 hours and 47 minutes. And this is actually a huge improvement. If you look at the original paper by Kaiming He and others where they introduce the ResNet architecture, they took weeks and weeks to train one of these models. And now with one TPU, we can train it in 7 hours and 47 minutes. And of course, these things are available on Google Cloud. So the current training, so it takes about-- if you pay for the resource on demand, it's about $36. And if you pay for it using Google Cloud's preemptible instances, it is about $11. So it's getting pretty cheap to train. And of course, we want to do the cost effective performance at scale. So if you're trying the same model, ResNet 50, on a Cloud TPU version 2 Pod, you are getting something like 219,000 images per second of training performance. You get the same finer accuracy. And training time goes from about eight hours to about eight minutes. So again, that's a huge improvement. And this gets us into the region of we can just iterate on-- you can just go train a model, go get a cup of coffee, come back, and then you can see the results. So it gets into almost interactive levels of machine learning, of being able to do machine learning research and development. So that's great. Then the next question will be, how do these accelerators work? So today we are going to zoom in on the second generation of Cloud TPUs. So again, this is what it looks like. This is one entire Cloud TPU board that you see here. And the first thing that you want to know is that Cloud TPUs are really network-attached devices. So if I want to use a Cloud TPU on Google Cloud, what happens is that I create it. I go to the Google Cloud Console, and I create a Cloud TPU. And then I create a Google Compute Engine VM. And then under VM, I just have to install TensorFlow. So literally, I have to do PIP install TensorFlow. And then I can start writing code. I don't have drivers to install. You can use a clean Ubuntu image. You can use the machine learning images that we provide. So it's really very simple to get started with. So each TPU is connected to a host server with 32 lanes of PCI Express. So each TPU-- so the thing here to note is that the TPU itself is like an accelerator. So you can think of it like GPUs. So it doesn't run. You can't run Linux on it by itself. So it's connected to the host server by 32 lanes of PCI Express to make sure that we can transfer training data in. We can get our results back out quickly. And of course, you can see on this board clearly there are four fairly large heat sinks. Underneath each heat sink is a Cloud TPU chip. So zooming in on the chip, so here's a very simplified diagram of the chip layout. So as you can see, each chip has two cores. It's connected to 16 gigabytes of HBM each. And there are very fast interconnects that connect these chips to other chips on the board and across the entire pod. So each chip does about-- each core does about 22.5 teraflops. And each core consists of a scalar unit, a vector unit, and a matrix unit. And we are operating mostly on full 32's with one exception. So zooming on a matrix unit, this is where all the dense matrix math and dense convolution happens. So the matrix unit is implemented as a 128 by 128 systolic array that does bfloat16 multiplies and float32 accumulates. So there are two terms here that you might not be familiar with, bloat16 and systolic arrays. So I'm going to go through each of these in turn. So here's a brief guide to floating point formats. So if you are doing machine learning training and inference today, you're probably using fp32, or what's called single-precision IEEE floating point format. So in this case, you have one signed bit, eight exponent bits, and about 23 significant bits. And that allows you to represent a range of numbers from 10 to the negative 38 to about 10 to the 38. So it's a fairly wide range of numbers that you can represent. So in recent years, people have been trying to train neural networks on fp16, or what's half-precision IEEE floating point format. And people at TensorFlow and across the industry have been trying to make this work well and seamlessly, but the truth of the matter is you have to make some modifications to many models for it to train properly if you're only using fp16, mainly because of issues like managing gradient, or you have to do log scaling, all sorts of things. And the reason is because the range of representable numbers for fp16 is much narrower than for fp32. So the range here is just from about 6 to the 6 times 10 to the negative 8 to about 65,000. So that's a much narrower range of numbers. So what did the folks at Google Brain do? So what Google Brain did is that we came up with a floating point format called bfloat16. So what bfloat16 is, it is like float32, except we drop the last 16 bits of mantissa. So this results in the same bit, the same exponent bits, but only 7 bits of mantissa instead of 23 bits. In this way we can represent the same range of numbers, just at a much lower position. And it turns out that you don't need all that much precision for neural network training, but you do actually need all the range. And then the second term is systolic arrays. So rather than trying to describe what a systolic array is, I will just show you a little animation I made up. So in this case, we are computing y equals a very simple matrix times vector computation. So you're computing y equals w times x, where w is a 3-by-3 matrix and x is a three-element vector. And we are computing x with a batch size of three.