Placeholder Image

字幕列表 影片播放

  • FRANK CHEN: So hi everyone.

  • I'm Frank.

  • And I work on the Google Brain team working on TensorFlow.

  • And today for the first part of this talk,

  • I'm going to talk to you about accelerating machine learning

  • with Google Cloud TPUs.

  • So the motivation question here is, why is Google

  • building accelerators?

  • I'm always hesitant to predict this,

  • but if you look at the data, this has been--

  • the end of Moore's law has been going on

  • for the past 10 or 15 years, where we don't really see

  • the 52% year-on-year growth in single-threaded performance

  • that we saw from the late 1980s through the early 2000s

  • anymore, where now single-threaded performance

  • for CPUs is really growing at a rate of about maybe 3% or 5%

  • per year.

  • So what this means is that I can't just

  • wait 18 months for my machine learning models

  • to train twice as fast.

  • This doesn't work anymore.

  • At the same time, organizations are

  • dealing with more data than ever before.

  • You have people uploading hundreds and hundreds

  • of hours of video every minute to YouTube.

  • People are leaving product reviews on Amazon.

  • People are using chat systems, such as WhatsApp.

  • People are talking about personal assistance

  • and so on and so forth.

  • So more data is generated than ever before.

  • And organizations are just not really

  • equipped to make sense of them to use them properly.

  • And the third thread is that at the same time,

  • we have this sort of exponential increase

  • in the amount of compute needed by these machine learning

  • models.

  • This is a very interesting blog post by OpenAI.

  • In late 2012, where we just had--

  • where deep learning was first becoming useful.

  • We have like AlexNet, and we have

  • Dropout, which used a fair amount of computing power,

  • but not that much compared to in late 2017 where

  • DeepMind published the AlphaGo Zero and AlphaGo.

  • In the Alpha Zero paper, we see in about six, seven years,

  • we see the compute demand increase by 300,000 times.

  • So this puts a huge strain on companies'

  • compute infrastructure.

  • So what does this all mean?

  • The end of Moore's law plus this sort of exponential increase

  • in computer requirements means that we need a new approach

  • for doing machine learning.

  • At the same time, of course, everyone still

  • wants to do compute, do machine learning,

  • training faster and cheaper.

  • So that's why Google is building specialized hardware.

  • Now, the second question you might be asking

  • is, what sort of accelerators is Google building?

  • So from the title of my talk, you

  • know that Google is building a type of accelerator

  • that we call Tensor Processing Units, which are really

  • specialized ASICs designed for machine learning.

  • This is the first generation of our TPUs

  • we introduced back in 2015 at Google

  • I/O. The second generation of TPUs

  • now called Cloud TPU version 2 that we introduced

  • at Google I/O last year.

  • And then these Cloud TPU version 2's

  • can be combined into pods called Cloud TPU v2 Pods.

  • And of course, at Google I/O this year,

  • we introduced the third generation of cloud TPUs.

  • From air cooled.

  • Now it's liquid cooled.

  • And of course, you can link a bunch of them

  • up into a pod configuration as well.

  • So what are the differences between these generations

  • of TPUs?

  • So the first version of TPUs, it was really

  • designed for inference only.

  • So it did about 92 teraops of innate.

  • The second generation of TPUs does both training

  • and inference.

  • It operates on floating point numbers.

  • It does about 180 teraflops.

  • And it has about 64 gigs of HBM.

  • And the third generation to TPUs,

  • it's a big leap in performance.

  • So now we are doing 420 teraflops.

  • And we doubled the amount of memory.

  • So now it's 128 gigs of HBM.

  • And again, it does training and inference.

  • And of course, we see the same sort of progress

  • with Cloud TPU Pods as well.

  • Our 2017 pods did about 11.5 petaflops.

  • That is 11,500 teraflops of compute

  • with 4 terabytes of HBM.

  • And our new generation of pods does over 100 petaflops

  • with 32 terabytes of HBM.

  • And of course, the new generation of pods

  • is also liquid cooled.

  • We have a new chip architecture.

  • So that's all well and good, but really,

  • what we are looking for here is not just peak performance,

  • but cost effective performance.

  • So take this very commonly used image recognition model,

  • called ResNet 50.

  • If you train it on, again, a very common dataset

  • called ImageNet, we achieve about 4,100 images

  • per second on real data.

  • We also achieve that while getting state of the art

  • final accuracy numbers.

  • So in this case, it's 93% top 5 accuracy

  • on the ImageNet dataset.

  • And we can train this ResNet model

  • in about 7 hours and 47 minutes.

  • And this is actually a huge improvement.

  • If you look at the original paper by Kaiming He

  • and others where they introduce the ResNet architecture,

  • they took weeks and weeks to train one of these models.

  • And now with one TPU, we can train it

  • in 7 hours and 47 minutes.

  • And of course, these things are available on Google Cloud.

  • So the current training, so it takes about--

  • if you pay for the resource on demand, it's about $36.

  • And if you pay for it using Google Cloud's

  • preemptible instances, it is about $11.

  • So it's getting pretty cheap to train.

  • And of course, we want to do the cost effective performance

  • at scale.

  • So if you're trying the same model, ResNet 50,

  • on a Cloud TPU version 2 Pod, you

  • are getting something like 219,000 images per second

  • of training performance.

  • You get the same finer accuracy.

  • And training time goes from about eight hours

  • to about eight minutes.

  • So again, that's a huge improvement.

  • And this gets us into the region of we can just

  • iterate on-- you can just go train a model,

  • go get a cup of coffee, come back,

  • and then you can see the results.

  • So it gets into almost interactive levels

  • of machine learning, of being able to do machine learning

  • research and development.

  • So that's great.

  • Then the next question will be, how do these accelerators work?

  • So today we are going to zoom in on the second generation

  • of Cloud TPUs.

  • So again, this is what it looks like.

  • This is one entire Cloud TPU board that you see here.

  • And the first thing that you want to know

  • is that Cloud TPUs are really network-attached devices.

  • So if I want to use a Cloud TPU on Google Cloud, what happens

  • is that I create it.

  • I go to the Google Cloud Console,

  • and I create a Cloud TPU.

  • And then I create a Google Compute Engine VM.

  • And then under VM, I just have to install TensorFlow.

  • So literally, I have to do PIP install TensorFlow.

  • And then I can start writing code.

  • I don't have drivers to install.

  • You can use a clean Ubuntu image.

  • You can use the machine learning images that we provide.

  • So it's really very simple to get started with.

  • So each TPU is connected to a host server

  • with 32 lanes of PCI Express.

  • So each TPU-- so the thing here to note

  • is that the TPU itself is like an accelerator.

  • So you can think of it like GPUs.

  • So it doesn't run.

  • You can't run Linux on it by itself.

  • So it's connected to the host server

  • by 32 lanes of PCI Express to make sure that we

  • can transfer training data in.

  • We can get our results back out quickly.

  • And of course, you can see on this board clearly

  • there are four fairly large heat sinks.

  • Underneath each heat sink is a Cloud TPU chip.

  • So zooming in on the chip, so here's

  • a very simplified diagram of the chip layout.

  • So as you can see, each chip has two cores.

  • It's connected to 16 gigabytes of HBM each.

  • And there are very fast interconnects

  • that connect these chips to other chips on the board

  • and across the entire pod.

  • So each chip does about-- each core does about 22.5 teraflops.

  • And each core consists of a scalar unit, a vector unit,

  • and a matrix unit.

  • And we are operating mostly on full 32's with one exception.

  • So zooming on a matrix unit, this

  • is where all the dense matrix math and dense convolution

  • happens.

  • So the matrix unit is implemented as a 128

  • by 128 systolic array that does bfloat16 multiplies

  • and float32 accumulates.

  • So there are two terms here that you might not be familiar with,

  • bloat16 and systolic arrays.

  • So I'm going to go through each of these in turn.

  • So here's a brief guide to floating point formats.

  • So if you are doing machine learning training and inference

  • today, you're probably using fp32,

  • or what's called single-precision IEEE

  • floating point format.

  • So in this case, you have one signed bit,

  • eight exponent bits, and about 23 significant bits.

  • And that allows you to represent a range of numbers from 10

  • to the negative 38 to about 10 to the 38.

  • So it's a fairly wide range of numbers that you can represent.

  • So in recent years, people have been

  • trying to train neural networks on fp16,

  • or what's half-precision IEEE floating point format.

  • And people at TensorFlow and across the industry

  • have been trying to make this work well and seamlessly,

  • but the truth of the matter is you

  • have to make some modifications to many models

  • for it to train properly if you're only using fp16,

  • mainly because of issues like managing gradient,

  • or you have to do log scaling, all sorts of things.

  • And the reason is because the range

  • of representable numbers for fp16

  • is much narrower than for fp32.

  • So the range here is just from about 6 to the 6 times

  • 10 to the negative 8 to about 65,000.

  • So that's a much narrower range of numbers.

  • So what did the folks at Google Brain do?

  • So what Google Brain did is that we came up

  • with a floating point format called bfloat16.

  • So what bfloat16 is, it is like float32,

  • except we drop the last 16 bits of mantissa.

  • So this results in the same bit, the same exponent bits,

  • but only 7 bits of mantissa instead of 23 bits.

  • In this way we can represent the same range of numbers, just

  • at a much lower position.

  • And it turns out that you don't need all that much precision

  • for neural network training, but you do actually

  • need all the range.

  • And then the second term is systolic arrays.

  • So rather than trying to describe

  • what a systolic array is, I will just show you

  • a little animation I made up.

  • So in this case, we are computing y

  • equals a very simple matrix times vector computation.

  • So you're computing y equals w times

  • x, where w is a 3-by-3 matrix and x

  • is a three-element vector.

  • And we are computing x with a batch size of three.

  • So we have already loaded all the weights

  • into the matrix unit.

  • And if we start the first clock cycle,

  • you'll see that the first element of the first vector

  • is loaded into the matrix unit.

  • And then we multiply the position 1,

  • 1 of w with the first element of the first vector.

  • In the second clock cycle, what happens

  • is that more weights are loaded.

  • So we are doing more multiplications.

  • At the same time, we are pushing the results

  • from the previous round of multiplication [INAUDIBLE]..

  • So that in the case of the yellow box

  • right there, we are not just doing the multiplication.

  • We are also summing the result of the multiplication that

  • happens within the box with the result from the box

  • to the left of it.

  • And then this continues.

  • As you can see, you are utilizing a lot more compute

  • now until you get the outputs out.

  • So what this effectively is, is a 2D field of compute.

  • So it allows us to put a lot of compute units

  • within a very small amount of chip area.

  • So if we optimize on the cost of the chip, because the bigger

  • the chip, the bigger--

  • the higher the cost.

  • And with a chip architecture that's also built

  • for pipelining-- that is we can fill the--

  • so in this previous example, we only had a batch size of three.

  • But if you have bigger batch sizes,

  • if your chip architecture is built for this,

  • you can just always fill the matrix units.

  • And this means that we get very high throughput for our matrix

  • multiplications, which is really at the heart of a lot

  • of these deep learning models.

  • So OK, cool.

  • How do I use these accelerators?

  • So our recommendation is that you start with our Cloud TPU

  • Reference Models.

  • These are high performance, open source models.

  • They are licensed under, I think, the Apache license.

  • They implement very common and also cutting-edge

  • model architectures.

  • Internally, we test them for performance and accuracy.

  • And you can use these and get up and running really quickly.

  • And you can modify them as needed.

  • So you can train and run, of course, on assembled data,

  • on your own data, and so on and so forth.

  • And we have a lot of reference models.

  • So I gave you examples of ResNet 50 and other image recognition

  • networks, but you can also do things

  • like machine translation, language modeling, speech

  • recognition, image generation.

  • We have all these models just as sample models for cloud

  • TPUs if you want to get started with them.

  • Great.

  • So remember these?

  • Remember those pods?

  • It turns out for a lot of our models,

  • we have not only optimized them for single TPUs,

  • we've also optimized for TPU pods.

  • For instance, take the ResNet 50 example

  • that I quoted performance figures for earlier.

  • In this case, you've got training on a single Cloud TPU.

  • This is really literally all you do.

  • You download the-- you start a TPU.

  • You download TensorFlow.

  • You clone the Git repository.

  • And then you just basically call Python, and just

  • say, point it to the TPU.

  • Point it where our data is.

  • Tell me what the batch size is.

  • Tell me how many steps you want to train for.

  • And then bam.

  • Off you go.

  • It turns out that training on the Cloud TPU Pod

  • is not that different.

  • Instead of starting a Cloud TPU, you start a Cloud TPU pod.

  • And really, the only things you have to modify

  • is the name of the TPU, the training batch

  • size, and the number of training steps.

  • So the reference model-- so in this case, the reference model

  • for ResNet 50 uses like fairly recent techniques,

  • such as the LARS optimizer and label smoothing

  • to achieve the target accuracy so

  • that you don't have to re-implement all these changes.

  • We have already done it for you.

  • So a lot of our reference models scale up from one

  • TPU all the way to a pod.

  • So of course, you aren't limited to reference models.

  • So when you build your own models, of course,

  • you build them with TensorFlow.

  • And when you build models with TensorFlow,

  • there are really two things that you have to think about.

  • There is the thing that most people focus their energy on,

  • which is the network architecture itself, which

  • is running on the accelerator.

  • But a lot of what people neglect is the input pipeline.

  • So basically, moving our training data,

  • reading them, decompressing them, parsing them,

  • performing data augmentation, and patching them,

  • and then sending it into the accelerators.

  • A lot of people don't think about this as a problem,

  • but really, for these sort of high performance accelerators,

  • this sort of limits performance, because if your training

  • pipelines slow, then accelerator is just idle half the time.

  • So phase one, build an input pipeline.

  • So this is a very simple input pipeline for ResNet 50.

  • So you have an input function.

  • You list a bunch of files.

  • You shuffle them.

  • You repeat them.

  • And then you send it out.

  • So this is great.

  • Guess what the performance of this.

  • This is 150 images per second.

  • So even if you run this on the Cloud TPU,

  • you're getting 150 images per second for training, which

  • is not great, because Cloud TPUs can do 4,000 images per second.

  • So what you do?

  • You have a bottleneck.

  • So how do you improve performance?

  • You find the bottleneck.

  • You optimize the bottleneck.

  • And of course, you repeat until you

  • get the desired performance.

  • And Cloud TPUs actually provide a fairly comprehensive set

  • of profiling tools.

  • So in this case, you can see what's--

  • in this case, this is TensorBoard.

  • So you can bring up a profile of what's happening on your TPU,

  • on the host, and so on and so forth.

  • And then you can see that, oh, there are large gaps.

  • So this means that the CPU is idle waiting for data.

  • And this is not great.

  • So a simplified like representation

  • of what's happening on TensorBoard right now

  • is something like this.

  • So in this case, we have an extract.

  • We have a transformer with a load.

  • And then we have the training on the accelerator.

  • And they are all happening sequentially.

  • And this is not great, right?

  • Because what is really happening here

  • is that you're leaving the CPU idle.

  • And you're leaving the accelerator idle.

  • And these two things are the biggest cost factors

  • in your training pipeline.

  • So what you really want to do is to do something like this.

  • You're overlapping every single step.

  • And you are utilizing all of the expensive bits in your computer

  • to the fullest extent.

  • So the accelerator is 100% utilized.

  • The CPU is only idle slightly.

  • And the disk is idle, but that's fine.

  • And to do pipelining is really easy.

  • So you just have to really modify one thing.

  • So you see the second to last line.

  • Instead of doing-- just do dataset.prefetch.

  • And this just ensures that everything above

  • is pipeline with accelerator training.

  • And of course, you also want to do parallel reads,

  • because reading from many files is faster than

  • reading from one.

  • And there are many other techniques

  • that I won't go into today, because I don't have time.

  • So you can use sloppy interleave, fused dataset

  • operators.

  • We have a good performance guide on the TensorFlow website

  • that tells you how you can optimize your input pipelines.

  • I encourage you to take a look.

  • But this is sort of a partially optimized input pipeline.

  • It's slightly longer than our simple one,

  • but this does over 2,000 images per second.

  • And if you want the fully optimized image pipeline,

  • please take a look at our TPU sample code.

  • OK.

  • Cool.

  • Now comes the fun part, building your model.

  • So the first way you can build your model

  • is actually with Keras.

  • So we have experimental Keras integration available

  • starting with TensorFlow 1.11, which will be coming out

  • in about two to three weeks.

  • So you can build--

  • so you can write your models in Keras as per normal.

  • And the only real thing that you have to modify

  • is basically create what's called a cluster resolver,

  • give it a name, create a distribution strategy,

  • and call the keras_to_tpu_model function.

  • And this will transform your model

  • to something that's compatible for the TPU.

  • And then after that, you can just

  • do the simple sort of model.compile, model.fit,

  • and all the Keras goodness that you know and love.

  • And in TensorFlow 1.12, which is the release after this,

  • we are going to make it even easier.

  • So you don't even have to call keras_to_tpu_model anymore.

  • You can just call model.compile directly.

  • And then this will work.

  • Great.

  • You don't want to use Keras.

  • You want to use something lower level.

  • So we also have a solution for that.

  • You can use something called TensorFlow Distribution

  • Strategy.

  • I think there was a talk about Distribution Strategy

  • yesterday.

  • So if you missed that, I think the video will be online soon.

  • So you should take a look at that.

  • So in this case, this is using the estimator

  • of Distribution Strategy.

  • So you can write your model function

  • like you see on the left.

  • You can write your input function

  • like you see on the top right.

  • And again, the only thing you really have to modify

  • is a couple lines.

  • Again, create a cluster resolve, or create a TPU strategy.

  • And then you can just pass it in through the estimator function,

  • train.distribute.

  • So this will let it work on TPUs.

  • So that's all great.

  • And so are people using these TPUs?

  • People are, in fact.

  • So here's a case study of an architecture search project

  • that's done by a group from Stanford and MIT.

  • So they did parallel runs using hundreds and hundreds

  • of cloud TPUs from the TensorFlow Research Cloud

  • Program, which is where we are providing 1,000 free TPUs

  • to academic researchers.

  • So if you're academic researchers,

  • I encourage you to look into this program.

  • So each blue dot in this image is a run on a TPU training

  • an ImageNet scale, a convolution RNN.

  • So each run used to take hours and hours

  • to train on other hardware, but on TPUs,

  • because they have access to so many TPUs,

  • they can do hundreds and hundreds of these runs.

  • So what they were trying to do was

  • that they were trying to search for a model that was a better

  • fit for the data that you record, say,

  • if you put electrodes in my brain

  • and look at what my visual cortex is trying

  • to do when I look at things.

  • So they are trying to find analogs,

  • trying to find a neural network that

  • was a closer analog to the primate visual cortex.

  • So it turns out that-- so here's a diagram of the space

  • that they were searching.

  • And it turns out that across a population

  • of many different models, they found

  • that the red connections were sort of selected

  • for the search versus the others.

  • And what happens is that they went back

  • and compared the models to some of the signals

  • that were recording, that the biologists were recording,

  • and they found that the convolution RNNs were

  • a much better fit for neural signals,

  • for instance, in v4, in IT, than in other

  • [INAUDIBLE],, like convolution, or feet forward models

  • that you see in the literature today.

  • So this is a really new and exciting direction

  • that a research group was able to do from scratch with access

  • to lots of compute.

  • So you can not just train models on TPUs,

  • you can search for them basically automatically, too.

  • And so, finally, of course, Cloud TPUs today,

  • Cloud TPU version 2 today is available,

  • is generally available on Google Cloud.

  • If you want to learn more about them,

  • go to cloud.google.com/tpu to get started.

  • All right.

  • So now Alex will present some new functionality

  • that lets you write the accelerator code more easily.

  • Alex.

FRANK CHEN: So hi everyone.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

雲計算TPU(TensorFlow @ O'Reilly AI大會,18年舊金山)。 (Cloud TPUs (TensorFlow @ O’Reilly AI Conference, San Francisco '18))

  • 3 0
    林宜悉 發佈於 2021 年 01 月 14 日
影片單字