字幕列表 影片播放
[MUSIC PLAYING]
MARTIN GORNER: Hi, everyone, and thank you for being here
at 8:30 in the morning, and welcome to this session
about TPUs and TPU pods.
So those are custom made accelerators
that Google has designed to accelerate machine learning
workloads.
And before I tell you everything about them, me and Kaz,
I would like to do something.
Of course, this is live, so you want to see a live demo.
And I would like to train with you here, onstage, using
a TPU pod, one of those big models
that used to take days to train.
And we'll see if we can finish the training
within this session.
So let me start the training.
I will come back to explaining exactly what I'm doing here.
I'm just starting it.
Run all cells.
Seems to be running.
OK, I'm just checking.
I'm running this on a 128 core TPU pod.
So one of the things you see in the logs here
is that I have all my TPUs appearing.
0, 1, 2, 6, and all the way down to 128.
All right, so this is running.
I'm happy with it.
Let's hear more about TPUs.
So first of all, what is this piece of silicon?
And this is the demo that I've just launched.
It's an object detection demo that
is training on a wildlife data set of 300,000 images.
Why wildlife?
Because I can show you cute pandas.
And I can show you cute electronics as well.
So this is a TPU v2.
And we have a second version now, a TPU v3.
Those are fairly large boards.
It's large like this, roughly.
And as you can see, they have four chips on them.
Each chip is dual core, so each of these boards
has 8 TPU cores on them.
And each core has two units.
That is a vector processing unit.
That is a fairly standard data-oriented processor,
general purpose processor.
What makes this special for machine learning
is the matrix multiply unit.
TPUs have a built-in hardware-based matrix
multiplier that can multiply to 128 by 128 matrices in one go.
So what is special about this architecture?
There are two tricks that we used
to make it fast and efficient.
The first one is, I would say, semi-standard.
It's reduced precision.
When you train neural networks, reducing the precision
from 32-bit floating points to 16-bit
is something that people quite frequently do,
because neural networks are quite resistant to the loss
of precision.
Actually, it even happens sometimes
that the noise that is introduced by reduced precision
acts as a kind of regularizer and helps with convergence.
So sometimes you're even lucky when you reduce precision.
But then, as you see on this chart, float16 and float32,
the floating point formats, they don't have the same number
of exponent bits, which means that they
don't cover the same range.
So when you take a model and downgrade all your float32s
into float16s, you might get into underflow or overflow
problems.
And if it is your model, it's usually not
so hard to go in and fix.
But if you're using code from GitHub
and you don't know where to fix stuff,
this might be very problematic.
So that's why on TPUs, we chose a different--
actually we designed a different floating point
format called bfloat16.
And as you can see, it's actually
exactly the same as float32 with just the fractional bits
cut off.
So the point is it has exactly the same number
of exponent bits, exactly the same range.
And therefore, usually, it's a drop-in replacement
for float32 and reduced precision.
So typically for you, there is nothing
to do on your model to benefit from the speed of reduced
precision.
The TPU will do this automatically, on ship,
in hardware.
And the second trick is architectural.
It's the design of this matrix multiply unit.
So that you understand how this works,
try to picture, in your head, how to perform a matrix
multiplication.
And one result, one point of the resulting
matrix, try to remember calculus from school, is a dot product.
A dot product of one line of one matrix and one column
of the second matrix.
Now what is a dot product?
A dot product is a series of multiply-accumulate operations,
which means that the only operation you
need to perform a matrix multiplication
is multiply and accumulate.
And multiply-accumulate in 16 bits,
because we're using bfloat16 reduced precision.
That is a tiny, tiny piece of silicon.
A 16-bit multiply-accumulator is a tiny piece of silicon.
And if you wire them together as an array, as you see here.
So this in real life would be a 128 by 128 array.
It's called a systolic array.
Systolic in Greek means flow.
Because you will flow the data through it.
So the way it works is that you load one matrix into the array,
and then you flow the second matrix through the array.
And you'll have to believe me, or maybe
spend a little bit more time with the animation,
by the time the gray dots have finished
flowing through those multiply-accumulators,
out of the right side come all the dot products that
make the resulting matrix.
So it's a one-shot operation.
There are no intermediate values to store anywhere, in memory,
in registers.
All the intermediate values flow on the wires
from one compute unit to the second compute units.
It's very efficient.
And what is more, it's only made of
those tiny 16-bit multiply-accumulators,
which means that we can cram a lot of those into one chip.
128 by 128 is 16,000 multiply-accumulators.
And that's how much you get in one TPU core, twice that
in two TPU cores.
So this is what makes it dense.
Density means power efficiency.
And power efficiency in the data center means cost.
And of course, you want to know how cheap
or how fast these things are.
Some people might remember from last year
I did a talk about what I built, this planespotting model,
so I'm using this as a benchmark today.
And on Google Cloud's AI platform,
it's very easy to get different configurations,
so I can test how fast this trains.
My baseline is-- on a fast GPU, this model
trains in half in four and a half hours.
But I can also get 5 machines with powerful GPUs
in a cluster.
And on those five machines, five GPUs,
this model will train in one hour.
And I've chosen this number because one hour is exactly
the time it takes for this model to train on one TPU v2.
So the rule of thumb I want you to remember
is that roughly 1 TPU v2, with its 4 chips,
is roughly as fast as five powerful GPUs.
That's in terms of speed.
But as you can see, it's almost three times cheaper.
And that's the point of optimizing the architecture
specifically for neural network workloads.
You might want to know how this works in software as well.
So when you're using TensorFlow, or Keras in TensorFlow,
your Python code TensorFlow Python code
generates a computational graph.
That is how TensorFlow works.
So your entire neural network is represented as a graph.
Now, this graph is what is sent to the TPU.
Your TPU does not execute Python code.
This graph is processed through XLA, the Accelerated Linear
Algebra compiler, and that is how
it becomes TPU microcode to be executed on the TPU.
And one nice side-effect of this architecture
is that if, in your TensorFlow code,
you load your data through the standard tf.data.Dataset
API, as you should, and as is required with TPUs, then
even the data loading part, or imagery resizing, or whatever
is in your data pipeline, ends up in the graph,
ends up executed on the TPU.
And the TPU will be pulling data from Google Cloud Storage
directly during training.
So that is very efficient.
How do you actually write this with code?
So let me show you in Keras.
And one caveat, this is Keras in TensorFlow 1.14,
which should be out in these next days.
The API is slightly different in TensorFlow 1.13
today, but I'd rather show you the one that will be--
the new one, as of tomorrow or next week.
So it's only a couple of lines of code.
There is the first line, TPUClusterResolver.
You can call it without parameters on most platforms,
and that finds the connected TPU.
The TPU is a remotely-connected accelerator.
This finds it.
You initialize the TPU and then you use the new distribution
API in TensorFlow to define a TPU strategy based on this TPU.
And then you say with strategy.scope,
and everything that follows is perfectly normal Keras code.
Then you define your model, you compile it,
you do model.fit, model.evaluate, model.predict,
anything you're used to doing in Keras.
So in Keras, it's literally these four lines of code
to add--
to work on a TPU.
And I would like to point out that these four lines of code
also transform your model into a distributed model.
Remember a TPU, even a single GPU,
is a board with eight cores.
So from the get go it's distributed computing.
And these four lines of code put in place
all the machinery of distributed computing for you.
One parameter to notice.
You see in the TPU strategy, there
is the steps_per_run equals 100.
So that's an optimization.
This tells the TPU, please run 100 batches worth of training
and don't report back until you're finished.
Because it's a network attached accelerator,
you don't want the TPU to be reporting back
after each batch for performance reasons.
So this is the software.
If you don't want to write your own code,
I encourage you to do so.
But if you don't, we have a whole library
of TPU optimized models.
So you will find them on the TensorFlow/tpu GitHub
repository.
And there is everything in the image--
in the vision space, in the machine translation,
and language, and NLP space, in speech recognition.
Even you can play with GaN models.
The one that we are demoing on stage,
remember we are training the model right now, is RetinaNet.
So this one is an object detection model.
And I like this model, so let me say a few words
about how this works.
In object detection, you put an image, and what you get
is not just the label--
this is a dog, this is a panda--
but you actually get boxes around where those objects are.
In object detection models, you have two kinds.
There are one shot detectors that
are usually fast but kind of inaccurate,
and then two-stage detectors that are much more
accurate but much slower.
And I like RetinaNet because they actually
found a trick to make this both the fastest
and the most accurate model that you can
find in object detection today.
And it's a very simple trick.
I'm not going to explain all the math behind it,
but basically in these detection models,
you start with candidate detections.
And then you prune them to find only the detections--
the boxes that have actual objects in them.
And the thing is that all those blue boxes that you see,
there is nothing in them.
So even during training, they will very easily
be classified as nothing to see, move along boxes,
with a fairly small error.
But you've got loads of them, which
means that when you compute the loss of this model, in the loss
you have a huge sum of very small errors.
And that huge sum of very small errors might in the end
be very big and overwhelm the useful signal.
So the two-stage detectors resolve that
by being much more careful about those candidate boxes.
In one-stage detectors, you start
with a host of candidate boxes.
And the trick they found in RetinaNet
is a little mathematical trick on the loss
to make sure that the contribution of all
those easy boxes stays small.
The upshot, it's both fast and accurate.
So let me go back here.
I actually want to say a word about now what I did, exactly,
when I launched this demo.
I guess most of you are familiar with the Google Cloud Platform.
So here I am opening the Google Cloud Platform console.
And in the Google Cloud Platform,
I have a tool called AI platform, which,
for those who know it, has had a facility for running training
jobs and for deploying models behind the REST API
for serving.
But there is a new functionality called Notebooks.
In AI platform, you can today provision ready all-installed
notebook for working in--
yeah, so let me switch to this one--
for working either in TensorFlow,
in PyTorch, with GPUs.
It's literally a one click operation.
NEW INSTANCE, I want a TensorFlow instance
with Jupyter notebook installed, and what you get
is here an instance that is running but with the link
to open Jupyter.
For example, this one-- and it will open Jupyter,
but it's already open.
So it's asking me to select something else, but it's here.
And here, you can actually work normally
in your Jupyter environment with a powerful accelerator.
You might have noticed that I don't have a TPU
option, actually not here, but here,
for adding an accelerator.
That's coming.
But here I am using Jupyter notebook instances
that are powered by a TPU v3 128-core pod.
How did I do it?
It's actually possible on the command line.
I give you the command line here.
There is nothing fancy about it.
There is one gcloud compute command line
to start to the instance and a second gcloud compute command
line to start the TPU.
You provision a TPU just as you would a virtual machine
in Google's cloud.
So this is what I've done.
And that is what is running right now.
So let's see if what we are.
Here it's still running.
As you see enqueue next 100 batches.
And it's training.
We are step 4,000 out of 6,000 roughly.
So we'll check back on this demo at the end of the session.
This demo, when I was doing it, to run it on stage,
I've been able also to run a comparison between how fast our
TPU v3s versus v2s.
In theory, v3s are roughly twice as powerful as v2s,
but that only works if you feed them enough
work to make use of all the hardware.
So here on RetinaNet, you can train
on images of various sizes.
Of course, if you train on smaller images, 256
pixel images, it will be much faster,
in terms of images per second.
And I've tried both--
TPU v2s and v3s.
You see with small images, you get a little bump
in performance from TPU v3s, but nowhere near double.
But as you get to bigger and bigger images,
you are feeding the hardware with more work.
And on 640 pixel images, the speed up you get from TPU v3
is getting close to the theoretical x2 factor.
So for this reason, I am running this demo here
at the 512 pixel image size on a TPU v3 pod.
I'm talking about pods.
But what are these pods, exactly?
To show you more about TPU pods, I
would like to give the lectern to Kaz.
Thank you Kaz.
KAZ SATO: Thank you, Martin.
[APPLAUSE]
So in my part, I directly introduce Cloud TPU pods.
What are pods?
It's a large cluster of Cloud TPUs.
The version two pod is now available as public beta, which
provides 11.6 petaflops, with 512 TPU cores.
The next generation version three pod
is also public beta now, which achieves
over 100 petaflops with 2,048 TPU cores
So those performance numbers are as high as the greatest
supercomputers.
So Cloud TPU pods are AI supercomputer
that Google have built from scratch.
But some of you might think, what's
the difference between a bunch of TPU instances and a Cloud
TPU pod?
The difference is the interconnect.
Google has developed ultra high-speed interconnect
hardware derived from a supercomputer technology,
for connecting thousands of TPUs with very short latency.
What does it do for you?
As you can see on the animation, every time you update
a single parameter on a single TPU,
that will be synchronized with all the other thousands
of TPUs, in an instant, by the hardware.
So in short, TensorFlow users can use the whole pod
as a single giant machine with thousands
of TPU cores inside it.
It's as easy as using a single computer.
And you may wonder, because it's an AI supercomputer,
you may also take super high cost.
But it does not.
You can get started with using TPU pods
with 32 cores at $24 per hour, without any initial cost.
So you don't have to pay millions of dollars
to build your own supercomputer from scratch.
You can just rent it for a couple of hours from the cloud.
Version three pod also can be provisioned with 32 cores.
That costs only $32 per hour.
For larger sizes, you can ask our service contact
for the pricing.
What is the cost benefit of a TPU pods over GPUs?
Here's a comparison result. With a full version two pod,
with 512 TPU cores, you can train the same ResNet-50 models
at 27 times faster speed at 38% lower cost.
This shows the clear advantage of the TPU pods
to a typical GPU-based solutions.
And there are other benefits you could get from the TPU pods.
Let's take a look at eBay's case.
eBay has over 1 billion product listings.
And to make it easier to search specific products
from 1 billion products, they built a new visual search
feature.
And to train the models, they have used 55 million images.
So it's a really large scale training for them.
And they have used Cloud TPU pods, and eBay
was able to get a 100 times faster training time,
compared with existing GPU service.
And they will also get a 10% accuracy boost.
Why is that?
TPU itself is not designed to increase the accuracy
that much.
But because if you can't increase the training speed
10 times or 100 times, that means the data
scientists or researchers can have 10 times 100 times more
iterations for the trials, such as trying out
a different combination of the hyperparameters
or different preprocessings and so on.
So that ended up at least 10% accuracy boost in eBay's case.
Let's see what kind of TensorFlow code
you would write to get those benefits from TPU pods.
And before taking a look at the actual code,
I try to look back.
What are the efforts required, in the past,
to implement the large scale distributed training?
Using many GPUs or TPUs for a single machine
running training, that is so-called distributed training.
And there are two ways.
One is data parallel and another is model parallel.
Let's talk about the data parallel first.
With data parallel, as you can see on the diagram,
you have to split the training data into the multiple GPU
or TPU nodes.
And also you have to share the same parameter set, the model.
And to do that, you have to set up a cluster of GPUs or TPUs
by yourself.
And also you have to set up a parameter server that
shares all the updates of our parameters
among all the GPU or TPUs.
So it's a complex setup.
And also in many cases, you have to--
there's going to be synchronization overhead.
So if you have hundreds or as thousands
of the TPUs or GPUs in a single cluster,
that's going to be a huge overhead for that.
And that limits the scalability.
But with TPU pods, the hardware takes care of it.
Your high-speed interconnect synchronizes
all of the parameter updates in a single TPU
with the other thousands of TPUs in an instant,
with very short latency.
So there's no need to set up the parameter server,
or there's no need to set up the large cluster of GPUs
by yourself.
And also you can get almost linear scalability
to add more on the more TPU cores in your training.
And Martin will show you the actual scalability
result later.
And as I mentioned earlier, TensorFlow users
can use the whole TPU pods as a single giant computer
and with thousands of TPU cores inside it.
So it's as easy as using a single computer.
For example, if you have Keras code running on a single TPU,
it also runs on a 2,000 TPU cores without any changes.
This is exactly the same code Martin showed earlier.
So under the hood, all the complexity for the data
parallel training, such as splitting the training
data into the multiple TPUs, or the sharing
the same parameters, those are all
taken care of by the TPU pods' interconnect,
and XLA compilers, and the new TPUStrategy
API in the TensorFlow 1.14.
The one thing you may want to change is the batch size.
As Martin mentioned, a TPU core has a matrix processor
that has 128 by 128 matrix multipliers.
So usually, you will get the best performance
by setting in the batch size to 128 times the number of TPU
cores.
So if you have 10 TPU cores, that's going to be 1,280.
The benefit of TPU pods is not only the training times.
It also enables the training of giant modules
by using gear the Mesh TensorFlow.
Data parallel has been a popular way of distributed training,
but there's one downside.
It cannot train a big model.
Because all departments are shared with all the GPUs
or TPUs, you cannot bring a big model that doesn't fit
into the memory of a single GPU or a TPU.
So there's another way of distributed training called
a model parallel.
With model parallel, you can split the giant model
into the multiple GPUs or TPUs so that you
can train much larger models.
But that has not been a popular way.
Why?
Because it's much harder to implement.
As you can see on the diagrams, you
have to implement all the communications
between the fraction of the models.
It's like stitching between the models.
And again, you have to set up the complex cluster,
and in many cases, the communication
between the models.
Because if you have hundreds of thousands
of CPU or GPU or TPU cores, then that's
going to be a huge overhead for that.
So those are the reasons why model parallel has not
been so popular.
To solve those problems, TensorFlow team
has developed a new library called Mesh TensorFlow.
It's a new way of distributed training,
with the multiple computing nodes,
such as TPU pods, or multiple GPUs, or multiple CPUs.
TensorFlow provides an abstraction layer
that sees those computing nodes as a logical n-dimensional
mesh.
Mesh TensorFlow is now available as open source
code on the TensorFlow GitHub repository.
To see how it works with imaging,
you could have a simple neural network
like this for recognizing the MNIST model.
This network has the batch size as 512,
and data dimension as 784, and one hidden layer
with 100 nodes, and output as 10 classes.
And if you want to train that network with the model
parallel, you can just specify, I
want to split the parameters into four TPUs to the Mesh
TensorFlow, and that's it.
You don't have to think about how
to implement the communication between the split model
and how to worry about the communication overhead.
What kind of a code you would write?
Here is the code to use the model parallel.
At first, you have to define the dimensions
of both data and the model.
In this code, you are defining the batch dimension as 512,
and the data has a 784 dimensions,
and hidden layer has 100 nodes, and the 10 classes.
And then you define your own network
by using Mesh TensorFlow APIs, such as two sets of weights
and one hidden layers, and one logits and loss function,
by using those dimensions.
Finally, you define how many TPU or GPUs have in the mesh,
and what is the layout rule you want to use.
In this code example, it is defining a hidden layer
dimensions for splitting the model parameters into the four
TPUs.
And that's it.
So that the Mesh TensorFlow can take a look at this code
and automatically split the model parameters into the four
TPUs.
And it shares the same training data with all the TPUs.
You can also combine both data and the model parallel.
For example, you can define the 2D mesh like this.
And you use the rows of the mesh for the data parallel
and use the column of the mesh or the model parallel,
so that you can get the benefits from both of them.
And again, it's easy to define with Mesh TensorFlow.
You can just specify batch dimension
for the rows and hidden layer dimensions for the columns.
This is an example where you are using the Mesh
TensorFlow for training a transformer model.
Transformer model is a very popular language model,
and I don't go deeper into the transformer model.
But as you can see, it's so easy to map
each layer of a transformer model
to the layer load of Mesh TensorFlow
so that you can efficiently map the large data and large model
into the hundreds of thousands of TPU cores
by using Mesh TensorFlow.
So what's the benefit?
By using the Mesh TensorFlow running with to TPU pods,
the Google AI team was able to train the language module
and translation model with the billion word scale.
And they were able to achieve the state-of-the-art scores,
as you can see on those numbers.
So for those use cases, the larger the model, the better
accuracy you get.
The model parallel with TPU pods give the big advantage
on achieving those state-of-the-art scores.
Let's take a look at another use case of the large scale model
parallel I just call BigGAN.
And I don't go deeper into what is GAN or how the GAN works.
But here's the basic idea.
You have the two defined networks.
One is called discriminator D and another
is called generator G. And you define a loss function
so that the D to be trained to recognize whether an image is
a fake image or real image.
And at the same time, the generator will be trained
to generate a realistic image so that a D cannot find
it's a fake.
It's like a minimax game you are playing with those two
networks.
And eventually, you will have a generic G
that can generate a photo-realistic fake images,
artificial images.
Let's take a look at the demo video.
So this is not big spoiler.
I have already loaded the bigger models
that is trained on the TPU pod.
And as you can see, these are all
the artificial synthesized image at high quality.
You can also specify the category
of the generated images, such as ostrich,
so that you can generate the ostrich images.
These are all synthesized artificial images.
None of them are real.
And because BigGAN can have the so-called latent space that
has the seeds to generate those images,
you can interpolate between two seeds.
In this example, it is interpolating
between golden retriever and Lhasa.
And you can try out a different combination
of the interpolation, such as west highland white terrier
and golden retriever.
Again, those are all fake images.
So this bigger model was trained with the TPU version three
pod with 512 cores.
And that took 24 hours to 48 hours.
Why BigGAN takes so many TPU cores and so long time?
The reasons are the model size and the batch size.
The quality of a GAN model, measured by GAN model,
are measured by the inception score, or IS score.
That represents how much an inception model
thinks those images are real.
And that also represents the variety of generated images.
The BigGAN paper says that you get
better IS score when you are having
more parameters in the model and when
you are using the larger batch size for the training.
So that means the larger scale model parallel
on the hundreds of TPU cores is crucial for BigGAN model
to increase the quality of those generated images.
So we have seen two use cases.
A BigGAN use case and language model use cases.
And those are the first applications of the model
parallel on TPU pods.
But they are only the starters.
So TPU pods are available to everyone from now.
So we expect to see more and more exciting
use cases coming from the new TPU pods users
and also from the applications.
So that's it for my part.
Back to Martin.
MARTIN GORNER: So now it's time to check on our demo.
Did our model actually train?
Checking here, yeah, it looks like it has finished training.
A saved model has been saved.
So the only thing that is to do is
to verify if this model can actually predict something.
So on a second machine I will reload the exact same model.
OK.
I believe that's the one.
And let's go and reload it.
So I'll skip training this time and just go here
to inference and loading.
Whoops, sorry about that.
I just hope the demo gods will be with me today.
All right.
That's because I'm loading the wrong directory.
The demo gods are almost with me.
It's this one where my model has been saved.
All right.
Yes.
Indeed.
It wasn't the same.
Sorry about that.
No training, just inference.
And this time, it looks like my model is loading.
And once it's loaded, I will see if it can actually
detect animals in images, and here we are.
So this leopard is actually a leopard.
This bird is a bird.
The lion is a lion.
This is a very tricky image.
So I'm showing you not cherry-picked images.
This is a model I have trained on stage, here with you.
No model is perfect.
We will see bad detections, like this one.
But that's a tricky one.
It's artwork.
It's not an actual lion.
The leopard is spot on.
The lion is spot on.
And see that the boxing actually works very well.
The leopard has been perfectly identified in the image.
So let's move to something more challenging.
Even this inflatable artwork lion has been identified,
which is not always the case.
This is a complicated image--
a flock of birds.
So you see it's not seeing all of them.
But all of them at least are birds,
which is a pretty good job.
The leopard is fine.
Oh, and this is the most complex we have.
There is a horse and cattle.
Well, we start seeing a couple of bad detections here.
Of course, that cow is not a pig.
As I said, no model is perfect.
But here the tiger is the tiger, and we
have our two cute pandas.
And those two cute pandas are actually quite difficult,
because those are baby pandas.
And I don't believe that this model
has had a lot of baby animals in its 300,000 images data set.
So I'm quite glad that it managed to find the two pandas.
So moving back, let me finish by giving you
a couple of feeds and speeds on those models.
So here, this model has a RetinaNet 50 backbone,
plus all the detection layers that produced the boxes.
And we have been training it on a TPU v3 pod with 128 cores.
It did finish in 20 minutes.
You don't have to just believe me for that.
Let me show you.
Here I had a timer read my script.
Yep, 19 minutes and 18 seconds.
So I'm not cheating.
This was live.
But I could also have run this model on a smaller pod.
Actually, I tried on a TPU v2-32.
On this chart, you see the speed on this axis
and the time on this axis.
This is to show you that a TPU v2-32 is actually
a very useful tool to have.
We've been talking about huge models up to now.
But it's debatable whether this is a huge model.
This definitely was a huge model a year ago.
Today, with better tools, I can train it
in an hour on a fairly modest TPU v2 32-core pod.
So even as an individual data scientist,
that is a very useful tool for me to have handy when
I need to do a round of trainings on a model like this,
because someone wants an animal detection model.
And bringing the training down to the one hour
space, or 20 minutes space, allows
me to work a lot faster and iterate a lot faster
on the hyperparemeters, on the fine tuning, and so on.
You see on a single TPU v3, it's the bottom line.
And if we were to train this on a GPU--
so remember our rule of thumb from the beginning.
One TPU v2, roughly five GPUs.
Therefore 1 TPU v3, roughly 10 GPUs.
So the GPU line would be one tenth of the lowest
line on this graph.
I didn't put it because it would barely register there.
That shows you the change of scale
at which you can be training your models using TPUs.
You might be wondering about this.
So as you scale, one thing that might happen
is that you have to adjust your learning rate schedule.
So this is actually the learning rate schedule
I have used to train the model on the 128 core TPU pod.
Just a couple of words, because it might not
be the most usual learning rate schedule you have ever seen.
There is this ramp up.
So the second part is exponential decay.
That's fairly standard.
But the ramp up part, that is because we
are starting from ResNet-50, initialized
with pre-trained weights.
But we still leave those weights trainable.
So we are training the whole thing.
It's not transfer learning.
It's just fine tuning of pre-trained ResNet-50.
And when you do that, and you train
very fast, using big batches, as we do here,
the batch size here is 64 times 128.
So it's a very big batch size.
You might actually break those pre-trained weights
in ways that harm your precision.
So that's why it's quite usual to have a ramp up period
to make sure that the network, in its initial training
phases, when it doesn't know what it's doing,
does not completely destroy the information
in the pre-trained weights.
So we did it.
We did train this model here on stage in 20 minutes.
And the demo worked, I'm really glad about that.
So this is the end.
What we have seen is TPUs and TPU pods.
Fast, yes.
But mostly cost effective.
Very cost effective way and a good tool
to have for any data scientist.
Also, and more specifically for very large models,
but for what used to be large models in the past
and which are normal models today, such as a ResNet-50
[INAUDIBLE].
It's a very useful tools.
And then Cloud TPU pods, where you can actually
enable not only data, but model parallelism,
using this new library called Mesh TensorFlow.
A couple of links here with more information
if you would like to know more.
Yes, you can take a picture.
And if you have more questions, we
will be at the AI ML pod, the red one,
in front of one TPU rack.
So you can see this one live and get
a feel for what kind of computer it is.
And with that, thank you very much.
[APPLAUSE]
[MUSIC PLAYING]
