Placeholder Image

字幕列表 影片播放

  • FRANK CHEN: So hi everyone.

  • I'm Frank.

  • And I work on the Google Brain team working on TensorFlow.

  • And today for the first part of this talk,

  • I'm going to talk to you about accelerating machine learning

  • with Google Cloud TPUs.

  • So the motivation question here is, why is Google

  • building accelerators?

  • I'm always hesitant to predict this,

  • but if you look at the data, this has been--

  • the end of Moore's law has been going on

  • for the past 10 or 15 years, where we don't really see

  • the 52% year-on-year growth in single-threaded performance

  • that we saw from the late 1980s through the early 2000s

  • anymore, where now single-threaded performance

  • for CPUs is really growing at a rate of about maybe 3% or 5%

  • per year.

  • So what this means is that I can't just

  • wait 18 months for my machine learning models

  • to train twice as fast.

  • This doesn't work anymore.

  • At the same time, organizations are

  • dealing with more data than ever before.

  • You have people uploading hundreds and hundreds

  • of hours of video every minute to YouTube.

  • People are leaving product reviews on Amazon.

  • People are using chat systems, such as WhatsApp.

  • People are talking about personal assistance

  • and so on and so forth.

  • So more data is generated than ever before.

  • And organizations are just not really

  • equipped to make sense of them to use them properly.

  • And the third thread is that at the same time,

  • we have this sort of exponential increase

  • in the amount of compute needed by these machine learning

  • models.

  • This is a very interesting blog post by OpenAI.

  • In late 2012, where we just had--

  • where deep learning was first becoming useful.

  • We have like AlexNet, and we have

  • Dropout, which used a fair amount of computing power,

  • but not that much compared to in late 2017 where

  • DeepMind published the AlphaGo Zero and AlphaGo.

  • In the Alpha Zero paper, we see in about six, seven years,

  • we see the compute demand increase by 300,000 times.

  • So this puts a huge strain on companies'

  • compute infrastructure.

  • So what does this all mean?

  • The end of Moore's law plus this sort of exponential increase

  • in computer requirements means that we need a new approach

  • for doing machine learning.

  • At the same time, of course, everyone still

  • wants to do compute, do machine learning,

  • training faster and cheaper.

  • So that's why Google is building specialized hardware.

  • Now, the second question you might be asking

  • is, what sort of accelerators is Google building?

  • So from the title of my talk, you

  • know that Google is building a type of accelerator

  • that we call Tensor Processing Units, which are really

  • specialized ASICs designed for machine learning.

  • This is the first generation of our TPUs

  • we introduced back in 2015 at Google

  • I/O. The second generation of TPUs

  • now called Cloud TPU version 2 that we introduced

  • at Google I/O last year.

  • And then these Cloud TPU version 2's

  • can be combined into pods called Cloud TPU v2 Pods.

  • And of course, at Google I/O this year,

  • we introduced the third generation of cloud TPUs.

  • From air cooled.

  • Now it's liquid cooled.

  • And of course, you can link a bunch of them

  • up into a pod configuration as well.

  • So what are the differences between these generations

  • of TPUs?

  • So the first version of TPUs, it was really

  • designed for inference only.

  • So it did about 92 teraops of innate.

  • The second generation of TPUs does both training

  • and inference.

  • It operates on floating point numbers.

  • It does about 180 teraflops.

  • And it has about 64 gigs of HBM.

  • And the third generation to TPUs,

  • it's a big leap in performance.

  • So now we are doing 420 teraflops.

  • And we doubled the amount of memory.

  • So now it's 128 gigs of HBM.

  • And again, it does training and inference.

  • And of course, we see the same sort of progress

  • with Cloud TPU Pods as well.

  • Our 2017 pods did about 11.5 petaflops.

  • That is 11,500 teraflops of compute

  • with 4 terabytes of HBM.

  • And our new generation of pods does over 100 petaflops

  • with 32 terabytes of HBM.

  • And of course, the new generation of pods

  • is also liquid cooled.

  • We have a new chip architecture.

  • So that's all well and good, but really,

  • what we are looking for here is not just peak performance,

  • but cost effective performance.

  • So take this very commonly used image recognition model,

  • called ResNet 50.

  • If you train it on, again, a very common dataset

  • called ImageNet, we achieve about 4,100 images

  • per second on real data.

  • We also achieve that while getting state of the art

  • final accuracy numbers.

  • So in this case, it's 93% top 5 accuracy

  • on the ImageNet dataset.

  • And we can train this ResNet model

  • in about 7 hours and 47 minutes.

  • And this is actually a huge improvement.

  • If you look at the original paper by Kaiming He

  • and others where they introduce the ResNet architecture,

  • they took weeks and weeks to train one of these models.

  • And now with one TPU, we can train it

  • in 7 hours and 47 minutes.

  • And of course, these things are available on Google Cloud.

  • So the current training, so it takes about--

  • if you pay for the resource on demand, it's about $36.

  • And if you pay for it using Google Cloud's

  • preemptible instances, it is about $11.

  • So it's getting pretty cheap to train.

  • And of course, we want to do the cost effective performance

  • at scale.

  • So if you're trying the same model, ResNet 50,

  • on a Cloud TPU version 2 Pod, you

  • are getting something like 219,000 images per second

  • of training performance.

  • You get the same finer accuracy.

  • And training time goes from about eight hours

  • to about eight minutes.

  • So again, that's a huge improvement.

  • And this gets us into the region of we can just

  • iterate on-- you can just go train a model,

  • go get a cup of coffee, come back,

  • and then you can see the results.

  • So it gets into almost interactive levels

  • of machine learning, of being able to do machine learning

  • research and development.

  • So that's great.

  • Then the next question will be, how do these accelerators work?

  • So today we are going to zoom in on the second generation

  • of Cloud TPUs.

  • So again, this is what it looks like.

  • This is one entire Cloud TPU board that you see here.

  • And the first thing that you want to know

  • is that Cloud TPUs are really network-attached devices.

  • So if I want to use a Cloud TPU on Google Cloud, what happens

  • is that I create it.

  • I go to the Google Cloud Console,

  • and I create a Cloud TPU.

  • And then I create a Google Compute Engine VM.

  • And then under VM, I just have to install TensorFlow.

  • So literally, I have to do PIP install TensorFlow.

  • And then I can start writing code.

  • I don't have drivers to install.

  • You can use a clean Ubuntu image.

  • You can use the machine learning images that we provide.

  • So it's really very simple to get started with.

  • So each TPU is connected to a host server

  • with 32 lanes of PCI Express.

  • So each TPU-- so the thing here to note

  • is that the TPU itself is like an accelerator.

  • So you can think of it like GPUs.

  • So it doesn't run.

  • You can't run Linux on it by itself.

  • So it's connected to the host server

  • by 32 lanes of PCI Express to make sure that we

  • can transfer training data in.

  • We can get our results back out quickly.

  • And of course, you can see on this board clearly

  • there are four fairly large heat sinks.

  • Underneath each heat sink is a Cloud TPU chip.

  • So zooming in on the chip, so here's

  • a very simplified diagram of the chip layout.

  • So as you can see, each chip has two cores.

  • It's connected to 16 gigabytes of HBM each.

  • And there are very fast interconnects

  • that connect these chips to other chips on the board

  • and across the entire pod.

  • So each chip does about-- each core does about 22.5 teraflops.

  • And each core consists of a scalar unit, a vector unit,

  • and a matrix unit.

  • And we are operating mostly on full 32's with one exception.

  • So zooming on a matrix unit, this

  • is where all the dense matrix math and dense convolution

  • happens.

  • So the matrix unit is implemented as a 128

  • by 128 systolic array that does bfloat16 multiplies

  • and float32 accumulates.

  • So there are two terms here that you might not be familiar with,

  • bloat16 and systolic arrays.

  • So I'm going to go through each of these in turn.

  • So here's a brief guide to floating point formats.

  • So if you are doing machine learning training and inference

  • today, you're probably using fp32,

  • or what's called single-precision IEEE

  • floating point format.

  • So in this case, you have one signed bit,

  • eight exponent bits, and about 23 significant bits.

  • And that allows you to represent a range of numbers from 10

  • to the negative 38 to about 10 to the 38.

  • So it's a fairly wide range of numbers that you can represent.

  • So in recent years, people have been

  • trying to train neural networks on fp16,

  • or what's half-precision IEEE floating point format.

  • And people at TensorFlow and across the industry

  • have been trying to make this work well and seamlessly,

  • but the truth of the matter is you

  • have to make some modifications to many models

  • for it to train properly if you're only using fp16,

  • mainly because of issues like managing gradient,

  • or you have to do log scaling, all sorts of things.

  • And the reason is because the range

  • of representable numbers for fp16

  • is much narrower than for fp32.

  • So the range here is just from about 6 to the 6 times

  • 10 to the negative 8 to about 65,000.

  • So that's a much narrower range of numbers.

  • So what did the folks at Google Brain do?

  • So what Google Brain did is that we came up

  • with a floating point format called bfloat16.

  • So what bfloat16 is, it is like float32,

  • except we drop the last 16 bits of mantissa.

  • So this results in the same bit, the same exponent bits,

  • but only 7 bits of mantissa instead of 23 bits.

  • In this way we can represent the same range of numbers, just

  • at a much lower position.

  • And it turns out that you don't need all that much precision

  • for neural network training, but you do actually

  • need all the range.

  • And then the second term is systolic arrays.

  • So rather than trying to describe

  • what a systolic array is, I will just show you

  • a little animation I made up.

  • So in this case, we are computing y

  • equals a very simple matrix times vector computation.

  • So you're computing y equals w times

  • x, where w is a 3-by-3 matrix and x

  • is a three-element vector.

  • And we are computing x with a batch size of three.