沒有博士學位的TensorFlow和深度學習，第一部分（Google Cloud Next '17）。 (TensorFlow and Deep Learning without a PhD, Part 1 (Google Cloud Next '17))

字幕列表影片播放

[MUSIC PLAYING]
MARTIN GORNER: Hello.
Hi, everyone.
So thank you for coming in such great numbers
to this TensorFlow session.
Apologies, it's quite late in the afternoon.
I will need all your brains for this session because today,
I want with you to build a neural network.
So no, I don't need your brains to build on, no brain
surgery in the session.
But it's a crash course to get developers up
to speed on machine learning and deep learning and neural
networks.
So I need all your attention.
The dataset we will be using is a very classical one.
It's this one here, hand-written digits.
Academia has been working on this dataset for the past 20
years.
So you should go to the website where it's hosted.
You will actually see 20 years of research papers
and that's what we will do together today.
We'll go on this dataset trying to build a network that
recognizes this hand-written digits from the simplest
possible network all the way to 99% accuracy.
So let's start.
Just a question, beforehand.
Who has done some work with neural networks before?
Oh, wow.
OK.
Quite a few people.
So feel free to help me and I hope this will not
be too basic for you and I hope it
will at least be a good introduction to TensorFlow.
But if you have never done anything with neural networks,
that's fine and I will explain everything from the start.
So this is the simplest possible neural network
we can imagine to recognize our hand-written digits.
So the digits, they come as 28 by 28 pixel images
and the first thing we do is that we flatten
all those pixels into one big vector of pixels
and these will be our inputs.
Now, we will use exactly 10 neurons.
The neurons are the white circles.
What a neuron does is always the same thing.
A neuron does a weighted sum of all of its inputs,
here the pixels.
It adds another constant that is called a bias.
That's just an additional degree of freedom.
And then it will feed this sum through an activation function.
And that is just a function-- number in, transform,
number out.
We will see several of those activation functions
and the one thing they have in common in neural networks
is that they are non-linear.
So why 10 neurons?
Well, simply because we are classifying those digits
in 10 categories.
We are trying to recognize a zero, a one, a two,
on to the nine.
So what we are hoping for here is that one of those neurons
will light up and tell us, with a very strong output,
that I have recognized here an eight.
All right.
And for that, since this is a classification problem,
we are going to use a very specific activation
function, one that, well, researchers
tell us works really well on classification problems.
It's called softmax and it's simply
an exponential normalized.
So what you do is that you make all those weighted sums,
then you elevate that to the exponential.
And once you have your 10 exponentials,
you compute the norm of this vector
and divide it by its norm so that you get
values between zero and one.
And those values, you will be able to interpret them
as probabilities, probabilities of this being an eight, a one,
or something else.
You will be asking which norm?
Any norm, doesn't matter--
the length of the vector.
You pick your favorite norm.
There are several.
Usually, for softmax, we use L1, but L2
which is the Euclidean normal would work just as well.
So what does softmax do actually?
You see, it's an exponential so it's a very steeply increasing
function.
It will pull the data apart, increase the differences,
and when you divide all of that, when you normalize
the whole vector, you usually end up with one of the values
being very close to one and all the other values
being very close to zero.
So it's a way of pulling the winner out on top
without actually destroying the information.
All right.
So now we need to formalize this using a matrix multiply.
I will remind you of what a matrix multiply is,
but we will do it not one image, we
are going to do this for a batch of 100 images at a time.
So what we have here in my matrix
is 100 images, one image per line.
The images are flattened, all the pixels on one line.
So I take my matrix of weights, for the time being,
I don't know what these weights are,
it's just weights so I'm doing weighted sums.
And I start the matrix multiplication.
So I do a weighted sum of all the pixels of the first image.
Here it is.
And then if I continue this matrix multiply
using the second column of weights,
I get a weighted sum of all the pixels
of the first image for the second neuron and then
for the third neuron and the fourth and so on.
What is left is to add the bias's,
just an additional constant.
Again, we don't know what it is for the time being.
And there is one bias per neuron,
that's why we have 10 biases.
And now if I continue this matrix multiply,
I'm going to obtain these weighted sums
for the second image, and the third image,
and so on, until I have processed all my images.
I would like to write this as a simple formula there.
You see there is a problem, x times w,
you know that's a matrix of 10 columns by 100 images,
and I have only 10 biases.
I can't simply add them together.
Well, never mind.
We will redefine addition and it's OK
if everybody accepts it.
And actually, people have already accepted it.
It's called a broadcasting add and that's
the way you do additions in NumPy,
for instance, which is the numerical library for Python.
The way a broadcasting add works is
that if you're trying to add two things which don't match, not
the same dimensions, you can't do the addition,
you try to replicate the small one as much
as needed to make the sizes match
and then you do the addition.
That's exactly what we need to do here.
We have only those 10 biases.
So it's the same biases on all the lines.
We just need to replicate this bias vector on all the lines,
and that's exactly what this generalized broadcasting
add does.
So we will just write it as a plus.
And this is where I wanted to get to.
I want you to remember this as the formula describing
one layer in a neural network.
So let's go through this again.
In x, we have a batch of images, 100 images,
all the pixels on one line.
In w, we have all of our weights for the 10 neurons,
all the weights in the system.
x times w are all about weighted sums.
We add the biases, and then we feed this
through our activation function, in this case softmax, the way
it works is lined by line.
Line by line, we take the 10 values,
elevate them to the exponential, normalize the line.
Next line, 10 values, elevate them to the exponential,
normalize the line, and so on.
So what we get in the output is, for each image, 10 values
which look like probabilities and which are our predictions.
So, of course, we still don't know
what those weights and biases are
and that's where the trick is in neural networks.
We are going to train this neural network
to actually figure out the correct weights
and biases by itself.
Well, this is how we write this in TensorFlow.
You see, not very different.
OK.
TensorFlow has this in n library for neural network
which has all sorts of very useful functions
for neural networks, for example, softmax and so on.
So let's go train.
When you train, you've got images,
but you know what those images are.
So your network, you initialize your weights and biases
at random value and your network will output some probability.
Since you know what this image is,
you can tell it that it's not this, it should be that.
So that is called a one-hot encoded vector.
It's a not very fancy way of encoding numbers.
Basically, here are our numbers from zero to nine.
We encode them as 10 bits, all at zero and just one of them
is a one at the index of the number we want to encode.
Here are six.
Why?
Well, because then, it's in the same shape as our predictions
and we can compute a distance between those two.
So again, many ways of computing distances.
The Euclidean distance, the normal distance, sum
of differences squared would work, not a problem.
But scientists tell us that for classification problems,
this distance, the cross entropy, works slightly better.
So we'll use this one.
How does it work?
It's the sum across the vectors of the values
on the top multiplied by the logarithms of the values
on the bottom, and then we add in minus sign
because all the values on the bottom are less than one,
so all the logarithms are negative.
So that's the distance.
And of course, we will tell the system
to minimize the distance between what it thinks is the truth
and what we know to be true.
So this we will call our error function
and the training will be guided by an effort
to minimize the error function.
So let's see how this works in practice.
So in this little visualization, I'm
showing you over there, my training images.
You see it's training so you see this batches of 100 training
images being fed into the system.
On the white background, you have the images
that have been already correctly recognized by the system.
On a red background, images that are still missed.
So then, on the middle graph, you
see our error function, computed both on the training dataset
and we also kept aside a set of images which we have never seen
during training for testing.
Of course, if you want to test the real world
performance of your neural network,
you have to do this on a set of images which you have never
seen during training.
So here we have 60,000 training images
and I set aside 10,000 test images which you see
in the bottom graph over there.
They are a bit small.
You see only 1,000 of them here.
So imagine, there are nine more screens of pictures like that.
But I sorted all the badly recognized one at the top.
So you see all the ones that have been badly recognized
and below are nine screens of correctly recognized images,
here after 2,000 rounds of training.
So there is a little scale on the side here.
It shows you that it's already capable of recognizing
92% of our images with this very simple model, just 10 neuron's,
nothing else.
And that's what you get on the top graph, the accuracy graph,
as well.
That's simply the percentage of correctly recognized images,
both on test and training data.
So what else do we have?
We have our weights and biases, those two diagrams are simply
percentiles, so it shows you the spread
of all the weights and biases.
And that's just useful to see that they are moving.
They both started at zero and they took some values
for the weights between one and minus one
for biasses between two and minus two.
It's helpful to keep an eye on those diagrams
and see that we are not diverging completely.
So that's the training algorithm.
You give it training images, it gives you a prediction,
you compute the distance between the prediction
and what you know to be true.
You use that distance as an error
function to guide a mechanism that will drive the error down
by modifying weights and biases.
So now let's write this in TensorFlow.
And I'll get more explicit about exactly how
this training works.
So we need to write this in TensorFlow.
The first thing you do in TensorFlow
is define variables and placeholders.
A variable is a degree of freedom
of our system, something we are asking TensorFlow to compute
for us through training.
So in our case, those are our weights and biases.
And we will need to feed in training data.
So for this data that will be fed in at training time,
we define a placeholder.
You see here x is a placeholder for our training images.
Let's look at the shape in brackets.
What you have is the shape of this multidimensional matrix,
which we call a tensor.
So the first dimension is none.
It says I don't know yet so this will be the number of images
in a batch.
This will be determined at training time.
If we give 100 images, this will be 100.
Then 28 by 28 is the size of our images
and one is the number of values per pixel.
So that's not useful at all because we
are handling grayscale images.
I just put it there.
In case you wanted to handle color images, that would
be three values per pixel.
So OK.
We have our placeholders, we have our variables,
now we are ready to write our model.
So that line you see on the top is our model.
It's what we have determined to be the line representing
one layer of a neural network.
The only change is that reshape operation.
You remember, our images, they come in as 28
by 28 pixel images and we want to flatten them
as one big vector of pixels.
So that's what reshape does.
784 is 28 by 28.
It's all the pixels in one line.
All right.
I need a second placeholder for the known answers,
the labels of my training images,
labels like this is a one, this is a zero, this is a seven,
this is a five.
And now that I have my predictions and my known
labels, I'm ready to compute my error function, which
is the cross entropy using the formula we've seen before.
So the sum across the vector of the elements of the labels
multiplied by elements of the logarithm of the predictions.
So now I have my error function.
What do I do with it?
What you have on the bottom, I won't go into that.
That is simply the computation of the percentage
of correctly recognized images.
You can skip that.
OK.
Now we get to the actual heart of what
TensorFlow will do for you.
So we have our error function.
We pick an optimizer.
There is a full library of them.
They have different characteristics.
And we ask the optimizer to minimize our error function.
So what is this going to do?
When you do this, TensorFlow takes your error function
and computes the partial derivatives of that error
function relatively to all the weights and all the biases
in the system.
That's a big vector because there are lots
of weights and lots of biases.
How many?
w, the weights, is a variable of almost 8,000 values.
So this vector we get mathematically
is called a gradient.
And the gradient has one nice property.
Who knows what is the nice property of the gradient?
It points-- Yeah.
Almost.
It points up, we had a minus sign, it points down, exactly.
Down in which space?
We are in the space of all the weight and all the variables
and the function we are computing
is our error function.
So when we say down in this space,
it means it gives us a direction in the space of weights
and biases into which to go to modify our weights
and biases in order to make our error function smaller.
So that is the training.
You compute this gradient and it gives you an arrow.
You take a little step along this arrow.
Well, you are in the space of weights and biases,
so taking a little step means you modify your weights
and biases by this little delta, and you get into a location
where the error is now smaller.
Well, that's fantastic.
That's exactly what you want.
Then you repeat this using a second batch
of training images.
And again, using a third batch of training images, and so on.
So it's called gradient descent because you follow
the gradient to head down.
And so we are ready to write our training loop.
There is one more thing I need to explain
to you about TensorFlow.
TensorFlow has a deferred execution model.
So everything we wrote up to now,
all the tf dot something here commands,
does not actually-- when that is executed,
it doesn't produce values.
It builds a graph, a computation graph, in memory.
Why is that important?
Well, first of all, this derivation trick here,
the computation of the gradient, that
is actually a formal derivation.
TensorFlow takes the formula that you
give it to define your error function
and does a formal derivation on it.
So it needs to know the full graph of how you computed this
to do this formal derivation.
And the second thing it will use this graph for
is that TensorFlow is built for distributed computing.
And there, as well, to distribute a graph
on multiple machine, it helps to know what the graph is.
OK.
So this is all very useful, but it means for us
that we have to go through an additional loop
to actually get values from our computations.
The way you do this in TensorFlow
is that you define a session and then in the session,
you call sess.run on one edge of your computation graph.
That will give you actual values,
but of course, for this to work, you
have to fill in all the placeholders
that you have defined now with real values.
So for this to work, I will need to fill in the training images
and the training labels for which
I have defined placeholders.
And the syntax is simply the train_data dictionary there.
You see the keys of the dictionary, x and y underscore,
are the placeholders that I have defined.
And then I can sess.run on my training step.
I pass in this training data and that is
where the actual magic happens.
Just a reminder, what is this training step?
Well it's what you got when you asked the optimizer to minimize
your error function.
So the training step, when executed,
is actually what computes this gradient using
the current batch of images, training images and labels,
and follows it a little to modify the weights and biases
and end up with better weights and biases.
I said a little.
I come back to this.
What is that learning rate over there?
Well, I can't make a big step along the gradient.
Why not?
Imagine you're in the mountains, you know where down is.
We have senses for that.
We don't know to derive anything.
We know where down is.
And you want to reach the bottom of the valley.
Now, if every step you make is a 10 mile step,
you will probably be jumping from one side of the valley
to the other without ever reaching the bottom.
So if you want to reach the bottom,
even if you know where a down is,
you have to make small steps in that direction,
and then you will reach the bottom.
So the same here, when we compute this gradient,
we multiplied by this very small value so as to take small steps
and be sure that we not jumping from one side of the valley
to the other.
All right.
So let's finish our training.
Basically, in a loop, we load a batch of 100 training images
and labels.
We run this training step which adjusts our weights and biases.
And we repeat.
All the rest of the stuff on the bottom, it's just for display.
I'm computing the accuracy and the cross entropy
on my training data and again, on my test data
so that I can show you four curves over there.
It is just for display.
It has nothing to do with the training itself.
All right.
So that was it.
That's the entire code here on one slide.
Let's go through this again.
At the beginning, you define variables for everything
that you want TensorFlow to compute for you.
So here are our weights and biasses.
You define placeholders for everything
that you will be feeding during the training, namely our images
and our training labels.
Then you define your model.
Your model gives you predictions.
You can compare those predictions
with your known labels, compare the distance
between the two, which is the cross entropy here,
and use that as an error function.
So you pick an optimizer and you ask the optimizer to minimize
your error function.
That gives all the gradients and all
that, it gives you a training step.
And now, in a loop, you load a batch of images.
You're on your training step.
You load a batch of images and labels,
you run your training step, and you do this in a loop,
and hoping this will converge, and usually it does.
You see here, it did converge and with this approach,
we got 92% accuracy.
Small recap of all the ingredients we put in our pot
so far.
We have a softmax activation function.
We have the cross entropy as an error function.
And we did this mini batching thing
where we train on 100 images at a time, do one step,
and then load another batch of images.
So is 92% accuracy good?
No, it's horrible.
Imagine you're actually using this in production.
I don't know, in the post office, your decoding zip
codes.
92% out of 100 digits, you have eight bad values?
No, not usable in production.
Forget it.
So how do we fix it?
Well deep learning.
We'll go deep.
We can just stack those layers.
How do we do that?
Well, it's very simple.
Look at the top layer of neurons.
It does what we just did.
It computes weighted sums of pixels.
But we can just as easily add a second layer
of neurons that will compute weighted sums all
the outputs of the first layer.
And that's how you stack layers to produce
a deep neural network.
Now we are going to change our activation function.
We keep softmax for the output layer
because softmax has these nice properties
of pulling a winner apart and producing
numbers between zero and one.
But for the rest, we use a very classical activation function.
In neural networks, it's called the sigmoid,
and it's basically, the simplest possible continuous function
that goes from zero to one.
OK.
All right.
Let's write this model.
So we have now one set of weights and one set of biasses
per layer.
That's why we use C5 pairs here.
And our model will actually look very familiar to you.
Look at the first line.
It's exactly what we have seen before for one
layer of a neural network.
Now what we do with the output, Y1,
is that we use it as the input in the second line,
and so on, we chain those.
It's just that on the last line, the activation
function we use is the softmax.
So that's all the changes we did.
And we can try to run this again.
So oops.
This one.
Run.
Run.
Run.
And it's coming.
Well, I don't like this slope here.
It shouldn't be shooting up really sharp.
It's a bit slow.
Actually, I have a solution for that.
I lied to you when I said that the sigmoid was the most widely
used activation function.
That was true in the past, and today, people
invented a new activation function, which is called
the Relu, and this is a relu.
It's even simpler.
It's just zero for all negative values and identity
for all positive values.
Now this actually works better.
It has lots of advantages.
Why does it work better?
We don't know.
People tried it, it worked better.
[LAUGHTER]
I'm being honest here.
If you had a researcher here, he would
fill your head with equations and prove it,
but he would have done those equations after the fact.
People already tried it, it worked better.
Actually, they got inspiration from biology.
It is said, I don't know if it is true,
but I heard that the sigmoid was the preferred
model of biologists for our actual biological neurons
and that today, biologist thinks that neurons in our head
work more like this.
And the guys in computer science got
inspiration from that, tried it, works better.
How better?
Well, this is just the beginning of the training.
This is what we get with our sigmoids, just 300 iterations,
so just the beginning.
And this is what we get from relus.
Well, I prefer this.
The accuracy shoots up really sharp.
The cross entropy goes down really sharp.
It's much faster.
And actually, here on this very simple problem,
the sigmoid would have recovered, it's not an issue,
but in very deep networks, sometimes with the sigmoid,
you don't converge at all.
And the relu solves that problem to some extent.
So the relu it is for most of our issues.
OK.
So now let's train.
Let's do this for 10,000 iterations, five layers,
look at that.
98% accuracy.
First of all, oh, yeah.
We went from 92 to 98 just by adding layers.
That's fantastic.
But look at those curves.
They're all messy.
What is all this noise?
Well, when you see noise like that,
it means that you are going too fast.
You're actually jumping from one side of the valley
to the other, without critically reaching the bottom
of your error function.
So we have a solution for that, but it's not just
to go slower, because then you would spend 10 times more time
training.
The solution, actually, is to start fast and then
slow down as you train.
It's called learning rate decay.
We usually decay the learning rates on an exponential curve.
So yes, I hear you.
It sounds very simple, why this little
trick, but let me play you the video of what this does.
It's actually quite spectacular.
So it's almost there.
Should I have the end of it on a slide.
Yeah, that's it.
So this is what we had using a fixed learning rate
and just by switching to a decaying learning rate, look,
it's spectacular.
All the noise is gone.
And for the first--
just with this little trick--
really, this is not rocket science,
it's just going slightly slower towards the end
and all the noise is gone.
And look at the blue curve, the training accuracy curve.
Towards the end, it's stuck at 100%.
So here, for the first time, we built a neural network
that was capable of learning all of our training set perfectly.
It doesn't make one single mistake in the entire training
set which doesn't mean that it's perfect in the real world.
As you see on the test dataset, it has a 98% accuracy.
But, well, it's something.
We got 100% at least on the training.
All right.
So we still have something that is a bit bizarre.
Look at those two curves.
This is our error function.
So the blue curve, the test error function,
that is what we minimize.
OK?
So as expected, it goes down.
And the error function computed on our test data
at the beginning, well, it follows.
That's quite nice.
And then it disconnects.
So this is not completely unexpected, you know.
We are minimizing the training at our function.
That's what we are actively minimizing.
We are not doing anything at all on the test side.
It's just a byproduct of the way neural networks work
that the training you do on your training data,
actually carries over to your test data to the real world.
Well, it carries over or it doesn't.
So as you see here, until some point,
it does and then, there is a disconnect,
it doesn't carry over anymore.
You keep optimizing the error on the training data,
but it has no positive effect on the test
performance, the real work performance, anymore.
So if you see curves like this, you take the textbook,
you look it up, it's called overfitting.
You look at the solutions, they tell you overfitting,
you need regularization.
OK.
Let's regularize.
What regularization options do we have?
My preferred one is called dropout.
It's quite dramatic.
You shoot the neurons.
No, really.
So this is how it works.
You take your neural network, and pick a probability,
let's say 50%.
So at each training iteration, you
will shoot, physically remove from the network,
50% percent of your neurons.
Do the pass, then put them back, next iteration, again, randomly
shoot 50% of your neurons.
Of course, when you test , you don't test with a half brain
dead neural network, you put all the neurons back.
But that's what you do for training.
So in TensorFlow, there is a very simple function
to do that, which is called dropout, That you apply
at the outputs of the layer.
And what it simply does is it takes the probability
and in the output of that layer, it
will replace randomly some values by zeros
and small technicality, it will actually
boost the remaining values proportionally
so that the average stays constant,
that's a technicality.
So why does shooting neurons help?
Well, first of all, let's see if it helps.
So let's try to recap all the tricks we tried to play
with our neural network.
This is what we had initially with our five layers
using the sigmoid as an activation function.
The accuracy got up to 97.9% using five layers.
So first, we replaced the sigmoid by the relu activation
function.
You see, it's faster to converge at the beginning
and we actually gained a couple of fractions
of percentage of accuracy.
But we have these messy curves.
So we train slower using the exponential learning rate decay
and we get rid of the noise, and now we are stable or above 98%
accuracy.
But we have that weird disconnect
between the error on our test data
and the error on our training data.
So let us try to add dropout.
This is what you get with dropout.
And actually, the cross entropy function,
the test cross entropy function, the red one
over there on the right, has been largely brought
under control.
You see, there is still some disconnect,
but it's not shooting up as it was before.
That's very positive.
Let's look at the accuracy.
No improvement.
Actually, I'm even amazed that it hasn't gone down
seeing how brutal this technique is, you shoot neurons
while you train.
But here, I was very hopeful to get it up.
No, nothing.
We have to keep digging.
So what is really overfitting?
Let's go beyond the simple recipe in the textbook.
Overfitting, in a neural network,
is primarily when you give it too many degrees of freedom.
Imagine you have so many neurons and so many
weights in a neural network that it's somehow
feasible to simply store all the training images
in those weights and variables.
You have enough room for that.
And the neural network could figure out some cheap trick
to pattern match the training images in what it has stored
and just perfectly recognize your training images
because it has stored copies of all of them.
Well, if it has enough space to do that,
that would not translate to any kind of recognition performance
in the real world.
And that's the trick about neural networks.
You have to constrain their degrees of freedom
to force them to generalize.
And mostly, when you get overfitting
is because you have too many neurons.
You need to get that number down to force the network
to produce generalizations that will then
produce good predictions, even in the real world.
So either you get the number of neurons down
or you apply some trick, like dropout,
that is supposed to mitigate the consequences of too
many degrees of freedom.
The opposite of too many neurons if you have a very small
dataset, well, even if you have only a small number of neurons,
if the dataset, the training dataset is very small,
it can still fit it all in.
So that's a general truth in neural networks.
You need big datasets for training.
And then what happened here?
We have a big data set, 60,000 digits, that's enough.
We know that we don't have too many neurons because we added
five layers, that's a bit overkill, but I tried,
I promise, with four and three and two.
And we tried dropout which is supposed
to mitigate the fact that you have too many neurons.
And it didn't do anything to the accuracy.
So the conclusion here that we come to
is that our network, the way it is built, is inadequate.
It's not capable by its architecture
to extract the necessary information from our data.
And maybe someone here can pinpoint something really
stupid we did at the beginning.
Someone has an idea?
Remember, we have images?
Images with shapes like curves and lines.
And we flattened all the pixels in one big vector.
So all that shape information is lost.
This is terrible.
That's why we are performing so badly.
We lost all of the shape information.
So what is the solution?
Well, people have invented a different type
of neural networks to handle specifically
images and problems where shape is important.
It's called convolutional networks.
Here we go back to the general case of an image,
of a color image.
So that's why it has red, green, and blue components.
And in a convolutional network, one neuron
will still be doing weighted sums of pixels,
but only a small patch of pixels above its head, only
a small patch.
And the next neuron would, again,
be doing weighted sum of the small patch of pixels
above itself, but using the same weights.
OK?
That's the fundamental difference
from what we have seen before.
The second neuron is using the same weights
as the first neuron.
So we are actually taking just one set of weights
and we are scanning the image in both directions,
using that set of weights and producing weighted sums.
So we scan it in both directions and we obtain
one layer of weighted sums.
So how many weights do we have?
Well, as many weights as we have input values
in that little highlighted cube, that's 4 times 4 times
3, which is around 48.
What?
48?
We had 8,000 degrees of freedom in our simplest network
with just 10 neurons.
How can it work with such a drastic reduction
in the number of weights?
Well, it won't work.
We need more degrees of freedom.
How do we do that?
Well, we pick a second set of weights and do this again.
And we obtain the second--
let's call it a channel of values using different weights.
Now since those are multi-dimensional matrices,
it's fairly easy to write those two matrices as one
by simply adding a dimension of dimension two
because we have two sets of values.
And this here will be the shape of the weight made matrix
for one convolutional layer in a neural network.
Now, we still have one problem left
which is that we need to bring the amount of information down.
At the end, we still want only 10 outputs
with our 10 probabilities to recognize what this number is.
So traditionally, this was achieved by what
we call a subsampling layer.
I think it's quite useful to understand
how this works because it gives you a good feeling for what
this network is doing.
So basically, we were scanning the image using
a set of weights and during training, these weights
will actually specialize in some kind of shape recognizer.
There will be some weights that will become
very sensitive to horizontal lines
and some weights that will become
very sensitive to vertical lines, and so on.
So basically, when you scan the image, if you simplify,
you get an output which is mostly I've seen nothing,
I've seen nothing, I've seen nothing,
oh, I've seen something, I've seen nothing,
I've seen nothing, oh, I've seen something.
The subsampling basically takes four of those outputs, two
by two, and it takes the maximum value.
So it retains the biggest signal of I've seen something
and passes that down to the layer below.
But actually, there's a much simpler way
of condensing information.
What if we simply play with the stride of the convolution?
Instead of scanning the image pixel by pixel,
we scan it every two pixels, we jumped by two pixels
between each weighted sum.
Well, mechanically, instead of obtaining 28
by 28 output values, we obtain only 14 by 14 output values.
So we have condensed our information.
And mostly today, I'm not saying this is better,
but it's just simpler.
And mostly today, people who build convolutional networks
just use convolutional layers and play
with the step to condense the information and it's simpler.
You don't need, in this way, to have these subsampling layers.
So this is the network that I would like to build with you.
Let's go through it.
There is a first convolutional layer that uses patches of five
by five.
I'm reading through the W1 tensor.
And we have seen that in this shape,
the two first digits is the size of the patch you pass.
The third digits is the number of channels
it's reading from the input.
So here I'm back to my real example.
This is a grayscale image.
It has one value per pixel.
So I'm reading one channel of information.
And I will be applying four of those patches to my image.
So I obtain four channels of output values.
OK?
Now second convolutional later, this time, my stride is two.
So here, my outputs become plains of 14 by 14 values.
So let's go through it.
I'm applying patches of four by four.
I'm reading in four channels of values
because that's why I output in the first layer.
And this time, I will be using eight different batches,
so I will actually produce eight different channels
of weighted sums.
Nextly, again, a stride of two.
That's why I'm getting down from 14 by 14 to seven by seven.
Batch is of four by four, reading
in eight channels of values because that's
what I had in the previous layer,
and outputting 12 channels of values
this time because I used 12 different batches.
And now I apply a fully connected layer.
So the kind of layer we've seen before.
OK?
This fully connected layer, I remember the differences
in this one, each neuron does a weighted sum
of all the values in the little cube of values above,
not just a batch, all the values.
In the next neuron in the fully connected network does,
again, a weighted sum of all the values using its own weights.
It's not sharing weights.
That's the normal neural network layer as we have seen before.
And finally, I apply my softmax layer with my 10 outputs.
All right.
So can we write this in TensorFlow?
Well, we need one set of weights and biases for each layer.
The only difference is that for the convolutional layers,
our weights will have this specific shape
that we have seen before.
So choose numbers for the filter size,
one number for the number of input channels,
and one number for the number of batches
which corresponds to the number of output channels
that you produce.
For our normal layers, we have the weights and bias as
defined as before.
And so you see this truncated normal thingy up there?
That's just random.
OK?
Its a complicated way of saying random.
So we initialize those weights to random values, initially.
And now this is what our model will look like.
So TensorFlow has these helpful conv2d function.
If you give it the weights' matrix and a batch of images,
it will scan them in both directions.
Its just a double loop to scan the image in both directions
and produce the weighted sums.
So we do those weighted sums.
We had a bias .
We feed this through an activation function,
in this case, the relu, and that's our outputs.
And again, the way of stacking these layers
is to feed why one, the first output,
has the input of the next layer.
All right.
After our three convolutional layers,
we need to do a weighted sum this time
of all the values in this seven by seven by 12 little cube.
So to achieve that, we will flatten this cube
as one big vector of values.
That's what the Reshape here does.
And then, two additional lines that you should recognize,
those are normal neural network layers as we have seen before.
All right.
How does this work?
So this time, it takes a little bit more time
to process so I have a video.
You see the accuracy's shooting up really fast.
I will have to zoom.
And the promise to 99% accuracy is actually not too far.
We're getting there.
We're getting there.
Are we getting there?
We're not getting there.
Oh, damn.
I'm so disappointed again.
I really wanted to bring this to 99% accuracy.
We'll have to do something more, 98.9.
Dammit, that was so close.
All right.
Yes.
Exactly.
This should be your WTF moment.
What is that?
On the cross entropy loss curve.
OK, let me zoom on it.
You see that?
That disconnect?
Do we have a solution for this?
Dropout.
Yes.
Let's go shooting our neurons.
It didn't work last time, maybe this time it will.
So actually, what we will do here, it's a little trick.
It's almost a methodology for coming up
with the ideal neural network for a given situation.
And what I like doing is to restrict the degrees of freedom
until it's apparent that it's not optimal.
It's hurting the performance.
Here, I know that I can get about 99%.
So I erased a little bit too much.
And from that point, I give it a little bit more freedom
and apply dropout to make sure that this additional freedom
will not result in overfitting.
And that's basically how you obtain
a pretty optimal neural network for a given problem.
So that's what I have done here.
You see, the batches are a slightly bigger, six, six,
five, five, four, four, instead of five,
five, four, four, and so on.
And I've used a lot more batches.
So six patches in the first layer, 12 in the second layer,
and 24 in the third layer, instead of four, eight, and 12.
And, I applied dropout in the fully connected layer.
So why not in the other layers?
I tried both, it's possible to apply dropout
in convolutional layers.
But actually, if you count the number of neurons,
there is a lot more neurons in the fully connected layer.
So it's a lot more efficient to be shooting them there.
I mean, it hurts a little bit too much
to shoot neurons where you have only a few of them.
So with this, let's run this again.
So again, the accuracy shoots up very fast.
I will have to zoom in.
Look where the 99% is and we are above!
Yes!
[APPLAUSE]
Thank you.
I promised you will get above 99 and we are actually
quite comfortably above.
We get to 99.3%.
In this time, let's see what our dropout actually did.
So this is what we had with a five layer network
and already a little more degrees of freedom.
So more patches in each layer.
You see, we are already above 99%.
But we have this big disconnect between the test
and the training cross entropy.
Letters apply dropout, boom.
The test cross entropy function is brought in under control.
It's not shooting up as much.
And look, this time, we actually had
a problem and this fixed it.
With just applying dropout, we got 2/10 of a percent
more accuracy.
And here, we are fighting for the last percent,
between 99 and 100.
So getting 2/10 is enormous with just a little trick.
All right.
So there we have it.
We built this network and brought it all the way
to 99% accuracy.
The Cliff's Notes is just a summary.
And to finish, so this was mostly about TensorFlow.
We also have a couple of pre-trained APIs, which
you can use just as APIs if your problem is standard enough
to fit into one of those Cloud Vision, Cloud Speech, Natural
Language, or Translate APIs.
And if you want to run your TensorFlow jobs in the cloud,
we also have this Cloud ML Engine service
that allows you to execute your TensorFlow
jobs in the cloud for training.
And what is even more important, with just the click
of a button, you can take a train model
and push it to production behind an API
and start serving predictions from the model in the cloud.
So I think that's a little technical detail,
but from an engineering perspective,
it's quite significant that you have a very easy way of pushing
something to prod.
Thank you.
You have the code on GitHub and this slide deck is freely
available at that URL.
And with that, we have five minutes for questions,
if you have any.
[APPLAUSE]
Thank you.
[MUSIC PLAYING]