雲計算TPU（TensorFlow @ O'Reilly AI大會，18年舊金山）。 (Cloud TPUs (TensorFlow @ O’Reilly AI Conference, San Francisco '18))

字幕列表影片播放

FRANK CHEN: So hi everyone.
I'm Frank.
And I work on the Google Brain team working on TensorFlow.
And today for the first part of this talk,
I'm going to talk to you about accelerating machine learning
with Google Cloud TPUs.
So the motivation question here is, why is Google
building accelerators?
I'm always hesitant to predict this,
but if you look at the data, this has been--
the end of Moore's law has been going on
for the past 10 or 15 years, where we don't really see
the 52% year-on-year growth in single-threaded performance
that we saw from the late 1980s through the early 2000s
anymore, where now single-threaded performance
for CPUs is really growing at a rate of about maybe 3% or 5%
per year.
So what this means is that I can't just
wait 18 months for my machine learning models
to train twice as fast.
This doesn't work anymore.
At the same time, organizations are
dealing with more data than ever before.
You have people uploading hundreds and hundreds
of hours of video every minute to YouTube.
People are leaving product reviews on Amazon.
People are using chat systems, such as WhatsApp.
People are talking about personal assistance
and so on and so forth.
So more data is generated than ever before.
And organizations are just not really
equipped to make sense of them to use them properly.
And the third thread is that at the same time,
we have this sort of exponential increase
in the amount of compute needed by these machine learning
models.
This is a very interesting blog post by OpenAI.
In late 2012, where we just had--
where deep learning was first becoming useful.
We have like AlexNet, and we have
Dropout, which used a fair amount of computing power,
but not that much compared to in late 2017 where
DeepMind published the AlphaGo Zero and AlphaGo.
In the Alpha Zero paper, we see in about six, seven years,
we see the compute demand increase by 300,000 times.
So this puts a huge strain on companies'
compute infrastructure.
So what does this all mean?
The end of Moore's law plus this sort of exponential increase
in computer requirements means that we need a new approach
for doing machine learning.
At the same time, of course, everyone still
wants to do compute, do machine learning,
training faster and cheaper.
So that's why Google is building specialized hardware.
Now, the second question you might be asking
is, what sort of accelerators is Google building?
So from the title of my talk, you
know that Google is building a type of accelerator
that we call Tensor Processing Units, which are really
specialized ASICs designed for machine learning.
This is the first generation of our TPUs
we introduced back in 2015 at Google
I/O. The second generation of TPUs
now called Cloud TPU version 2 that we introduced
at Google I/O last year.
And then these Cloud TPU version 2's
can be combined into pods called Cloud TPU v2 Pods.
And of course, at Google I/O this year,
we introduced the third generation of cloud TPUs.
From air cooled.
Now it's liquid cooled.
And of course, you can link a bunch of them
up into a pod configuration as well.
So what are the differences between these generations
of TPUs?
So the first version of TPUs, it was really
designed for inference only.
So it did about 92 teraops of innate.
The second generation of TPUs does both training
and inference.
It operates on floating point numbers.
It does about 180 teraflops.
And it has about 64 gigs of HBM.
And the third generation to TPUs,
it's a big leap in performance.
So now we are doing 420 teraflops.
And we doubled the amount of memory.
So now it's 128 gigs of HBM.
And again, it does training and inference.
And of course, we see the same sort of progress
with Cloud TPU Pods as well.
Our 2017 pods did about 11.5 petaflops.
That is 11,500 teraflops of compute
with 4 terabytes of HBM.
And our new generation of pods does over 100 petaflops
with 32 terabytes of HBM.
And of course, the new generation of pods
is also liquid cooled.
We have a new chip architecture.
So that's all well and good, but really,
what we are looking for here is not just peak performance,
but cost effective performance.
So take this very commonly used image recognition model,
called ResNet 50.
If you train it on, again, a very common dataset
called ImageNet, we achieve about 4,100 images
per second on real data.
We also achieve that while getting state of the art
final accuracy numbers.
So in this case, it's 93% top 5 accuracy
on the ImageNet dataset.
And we can train this ResNet model
in about 7 hours and 47 minutes.
And this is actually a huge improvement.
If you look at the original paper by Kaiming He
and others where they introduce the ResNet architecture,
they took weeks and weeks to train one of these models.
And now with one TPU, we can train it
in 7 hours and 47 minutes.
And of course, these things are available on Google Cloud.
So the current training, so it takes about--
if you pay for the resource on demand, it's about $36.
And if you pay for it using Google Cloud's
preemptible instances, it is about $11.
So it's getting pretty cheap to train.
And of course, we want to do the cost effective performance
at scale.
So if you're trying the same model, ResNet 50,
on a Cloud TPU version 2 Pod, you
are getting something like 219,000 images per second
of training performance.
You get the same finer accuracy.
And training time goes from about eight hours
to about eight minutes.
So again, that's a huge improvement.
And this gets us into the region of we can just
iterate on-- you can just go train a model,
go get a cup of coffee, come back,
and then you can see the results.
So it gets into almost interactive levels
of machine learning, of being able to do machine learning
research and development.
So that's great.
Then the next question will be, how do these accelerators work?
So today we are going to zoom in on the second generation
of Cloud TPUs.
So again, this is what it looks like.
This is one entire Cloud TPU board that you see here.
And the first thing that you want to know
is that Cloud TPUs are really network-attached devices.
So if I want to use a Cloud TPU on Google Cloud, what happens
is that I create it.
I go to the Google Cloud Console,
and I create a Cloud TPU.
And then I create a Google Compute Engine VM.
And then under VM, I just have to install TensorFlow.
So literally, I have to do PIP install TensorFlow.
And then I can start writing code.
I don't have drivers to install.
You can use a clean Ubuntu image.
You can use the machine learning images that we provide.
So it's really very simple to get started with.
So each TPU is connected to a host server
with 32 lanes of PCI Express.
So each TPU-- so the thing here to note
is that the TPU itself is like an accelerator.
So you can think of it like GPUs.
So it doesn't run.
You can't run Linux on it by itself.
So it's connected to the host server
by 32 lanes of PCI Express to make sure that we
can transfer training data in.
We can get our results back out quickly.
And of course, you can see on this board clearly
there are four fairly large heat sinks.
Underneath each heat sink is a Cloud TPU chip.
So zooming in on the chip, so here's
a very simplified diagram of the chip layout.
So as you can see, each chip has two cores.
It's connected to 16 gigabytes of HBM each.
And there are very fast interconnects
that connect these chips to other chips on the board
and across the entire pod.
So each chip does about-- each core does about 22.5 teraflops.
And each core consists of a scalar unit, a vector unit,
and a matrix unit.
And we are operating mostly on full 32's with one exception.
So zooming on a matrix unit, this
is where all the dense matrix math and dense convolution
happens.
So the matrix unit is implemented as a 128
by 128 systolic array that does bfloat16 multiplies
and float32 accumulates.
So there are two terms here that you might not be familiar with,
bloat16 and systolic arrays.
So I'm going to go through each of these in turn.
So here's a brief guide to floating point formats.
So if you are doing machine learning training and inference
today, you're probably using fp32,
or what's called single-precision IEEE
floating point format.
So in this case, you have one signed bit,
eight exponent bits, and about 23 significant bits.
And that allows you to represent a range of numbers from 10
to the negative 38 to about 10 to the 38.
So it's a fairly wide range of numbers that you can represent.
So in recent years, people have been
trying to train neural networks on fp16,
or what's half-precision IEEE floating point format.
And people at TensorFlow and across the industry
have been trying to make this work well and seamlessly,
but the truth of the matter is you
have to make some modifications to many models
for it to train properly if you're only using fp16,
mainly because of issues like managing gradient,
or you have to do log scaling, all sorts of things.
And the reason is because the range
of representable numbers for fp16
is much narrower than for fp32.
So the range here is just from about 6 to the 6 times
10 to the negative 8 to about 65,000.
So that's a much narrower range of numbers.
So what did the folks at Google Brain do?
So what Google Brain did is that we came up
with a floating point format called bfloat16.
So what bfloat16 is, it is like float32,
except we drop the last 16 bits of mantissa.
So this results in the same bit, the same exponent bits,
but only 7 bits of mantissa instead of 23 bits.
In this way we can represent the same range of numbers, just
at a much lower position.
And it turns out that you don't need all that much precision
for neural network training, but you do actually
need all the range.
And then the second term is systolic arrays.
So rather than trying to describe
what a systolic array is, I will just show you
a little animation I made up.
So in this case, we are computing y
equals a very simple matrix times vector computation.
So you're computing y equals w times
x, where w is a 3-by-3 matrix and x
is a three-element vector.
And we are computing x with a batch size of three.
So we have already loaded all the weights
into the matrix unit.
And if we start the first clock cycle,
you'll see that the first element of the first vector
is loaded into the matrix unit.
And then we multiply the position 1,
1 of w with the first element of the first vector.
In the second clock cycle, what happens
is that more weights are loaded.
So we are doing more multiplications.
At the same time, we are pushing the results
from the previous round of multiplication [INAUDIBLE]..
So that in the case of the yellow box
right there, we are not just doing the multiplication.
We are also summing the result of the multiplication that
happens within the box with the result from the box
to the left of it.
And then this continues.
As you can see, you are utilizing a lot more compute
now until you get the outputs out.
So what this effectively is, is a 2D field of compute.
So it allows us to put a lot of compute units
within a very small amount of chip area.
So if we optimize on the cost of the chip, because the bigger
the chip, the bigger--
the higher the cost.
And with a chip architecture that's also built
for pipelining-- that is we can fill the--
so in this previous example, we only had a batch size of three.
But if you have bigger batch sizes,
if your chip architecture is built for this,
you can just always fill the matrix units.
And this means that we get very high throughput for our matrix
multiplications, which is really at the heart of a lot
of these deep learning models.
So OK, cool.
How do I use these accelerators?
So our recommendation is that you start with our Cloud TPU
Reference Models.
These are high performance, open source models.
They are licensed under, I think, the Apache license.
They implement very common and also cutting-edge
model architectures.
Internally, we test them for performance and accuracy.
And you can use these and get up and running really quickly.
And you can modify them as needed.
So you can train and run, of course, on assembled data,
on your own data, and so on and so forth.
And we have a lot of reference models.
So I gave you examples of ResNet 50 and other image recognition
networks, but you can also do things
like machine translation, language modeling, speech
recognition, image generation.
We have all these models just as sample models for cloud
TPUs if you want to get started with them.
Great.
So remember these?
Remember those pods?
It turns out for a lot of our models,
we have not only optimized them for single TPUs,
we've also optimized for TPU pods.
For instance, take the ResNet 50 example
that I quoted performance figures for earlier.
In this case, you've got training on a single Cloud TPU.
This is really literally all you do.
You download the-- you start a TPU.
You download TensorFlow.
You clone the Git repository.
And then you just basically call Python, and just
say, point it to the TPU.
Point it where our data is.
Tell me what the batch size is.
Tell me how many steps you want to train for.
And then bam.
Off you go.
It turns out that training on the Cloud TPU Pod
is not that different.
Instead of starting a Cloud TPU, you start a Cloud TPU pod.
And really, the only things you have to modify
is the name of the TPU, the training batch
size, and the number of training steps.
So the reference model-- so in this case, the reference model
for ResNet 50 uses like fairly recent techniques,
such as the LARS optimizer and label smoothing
to achieve the target accuracy so
that you don't have to re-implement all these changes.
We have already done it for you.
So a lot of our reference models scale up from one
TPU all the way to a pod.
So of course, you aren't limited to reference models.
So when you build your own models, of course,
you build them with TensorFlow.
And when you build models with TensorFlow,
there are really two things that you have to think about.
There is the thing that most people focus their energy on,
which is the network architecture itself, which
is running on the accelerator.
But a lot of what people neglect is the input pipeline.
So basically, moving our training data,
reading them, decompressing them, parsing them,
performing data augmentation, and patching them,
and then sending it into the accelerators.
A lot of people don't think about this as a problem,
but really, for these sort of high performance accelerators,
this sort of limits performance, because if your training
pipelines slow, then accelerator is just idle half the time.
So phase one, build an input pipeline.
So this is a very simple input pipeline for ResNet 50.
So you have an input function.
You list a bunch of files.
You shuffle them.
You repeat them.
And then you send it out.
So this is great.
Guess what the performance of this.
This is 150 images per second.
So even if you run this on the Cloud TPU,
you're getting 150 images per second for training, which
is not great, because Cloud TPUs can do 4,000 images per second.
So what you do?
You have a bottleneck.
So how do you improve performance?
You find the bottleneck.
You optimize the bottleneck.
And of course, you repeat until you
get the desired performance.
And Cloud TPUs actually provide a fairly comprehensive set
of profiling tools.
So in this case, you can see what's--
in this case, this is TensorBoard.
So you can bring up a profile of what's happening on your TPU,
on the host, and so on and so forth.
And then you can see that, oh, there are large gaps.
So this means that the CPU is idle waiting for data.
And this is not great.
So a simplified like representation
of what's happening on TensorBoard right now
is something like this.
So in this case, we have an extract.
We have a transformer with a load.
And then we have the training on the accelerator.
And they are all happening sequentially.
And this is not great, right?
Because what is really happening here
is that you're leaving the CPU idle.
And you're leaving the accelerator idle.
And these two things are the biggest cost factors
in your training pipeline.
So what you really want to do is to do something like this.
You're overlapping every single step.
And you are utilizing all of the expensive bits in your computer
to the fullest extent.
So the accelerator is 100% utilized.
The CPU is only idle slightly.
And the disk is idle, but that's fine.
And to do pipelining is really easy.
So you just have to really modify one thing.
So you see the second to last line.
Instead of doing-- just do dataset.prefetch.
And this just ensures that everything above
is pipeline with accelerator training.
And of course, you also want to do parallel reads,
because reading from many files is faster than
reading from one.
And there are many other techniques
that I won't go into today, because I don't have time.
So you can use sloppy interleave, fused dataset
operators.
We have a good performance guide on the TensorFlow website
that tells you how you can optimize your input pipelines.
I encourage you to take a look.
But this is sort of a partially optimized input pipeline.
It's slightly longer than our simple one,
but this does over 2,000 images per second.
And if you want the fully optimized image pipeline,
please take a look at our TPU sample code.
OK.
Cool.
Now comes the fun part, building your model.
So the first way you can build your model
is actually with Keras.
So we have experimental Keras integration available
starting with TensorFlow 1.11, which will be coming out
in about two to three weeks.
So you can build--
so you can write your models in Keras as per normal.
And the only real thing that you have to modify
is basically create what's called a cluster resolver,
give it a name, create a distribution strategy,
and call the keras_to_tpu_model function.
And this will transform your model
to something that's compatible for the TPU.
And then after that, you can just
do the simple sort of model.compile, model.fit,
and all the Keras goodness that you know and love.
And in TensorFlow 1.12, which is the release after this,
we are going to make it even easier.
So you don't even have to call keras_to_tpu_model anymore.
You can just call model.compile directly.
And then this will work.
Great.
You don't want to use Keras.
You want to use something lower level.
So we also have a solution for that.
You can use something called TensorFlow Distribution
Strategy.
I think there was a talk about Distribution Strategy
yesterday.
So if you missed that, I think the video will be online soon.
So you should take a look at that.
So in this case, this is using the estimator
of Distribution Strategy.
So you can write your model function
like you see on the left.
You can write your input function
like you see on the top right.
And again, the only thing you really have to modify
is a couple lines.
Again, create a cluster resolve, or create a TPU strategy.
And then you can just pass it in through the estimator function,
train.distribute.
So this will let it work on TPUs.
So that's all great.
And so are people using these TPUs?
People are, in fact.
So here's a case study of an architecture search project
that's done by a group from Stanford and MIT.
So they did parallel runs using hundreds and hundreds
of cloud TPUs from the TensorFlow Research Cloud
Program, which is where we are providing 1,000 free TPUs
to academic researchers.
So if you're academic researchers,
I encourage you to look into this program.
So each blue dot in this image is a run on a TPU training
an ImageNet scale, a convolution RNN.
So each run used to take hours and hours
to train on other hardware, but on TPUs,
because they have access to so many TPUs,
they can do hundreds and hundreds of these runs.
So what they were trying to do was
that they were trying to search for a model that was a better
fit for the data that you record, say,
if you put electrodes in my brain
and look at what my visual cortex is trying
to do when I look at things.
So they are trying to find analogs,
trying to find a neural network that
was a closer analog to the primate visual cortex.
So it turns out that-- so here's a diagram of the space
that they were searching.
And it turns out that across a population
of many different models, they found
that the red connections were sort of selected
for the search versus the others.
And what happens is that they went back
and compared the models to some of the signals
that were recording, that the biologists were recording,
and they found that the convolution RNNs were
a much better fit for neural signals,
for instance, in v4, in IT, than in other
[INAUDIBLE],, like convolution, or feet forward models
that you see in the literature today.
So this is a really new and exciting direction
that a research group was able to do from scratch with access
to lots of compute.
So you can not just train models on TPUs,
you can search for them basically automatically, too.
And so, finally, of course, Cloud TPUs today,
Cloud TPU version 2 today is available,
is generally available on Google Cloud.
If you want to learn more about them,
go to cloud.google.com/tpu to get started.
All right.
So now Alex will present some new functionality
that lets you write the accelerator code more easily.
Alex.