深度學習 - Computerphile (Deep Learning - Computerphile)

字幕列表影片播放

I wanted to talk a little bit more about deep learning and some of a kind of slightly more,
Large and interesting architectures that have been coming along in the last couple of years, last few years.
So just a very brief recap, right? We've got videos on this
I'm going to draw my network from the top down this time. So rather than there being a square input image
I'm just going to draw a line which is the image from the top
So you can work with your animation magic and sort this all out for me. Brilliant.
So I'm going to be talking about deep learning and convolutional neural networks.
So a convolutional neural network is one where you have some input like an image.
You filter it using a convolution operation.
And then you repeat that process a number of times.
To learn something interesting about that image.
Some interesting features.
And then you make a classification decision based on it.
That is usually what you do, right?
So you might decide, well this have got a cat in it or this one's got a dog in it.
Or this one's got a cat and a dog in it and that's very exciting.
So from the top down right because I've always
My pens gonna run out of ink if I start trying to draw too many boxes.
You've got an input image, but it's quite large usually.
So here's an input image and I'm gonna draw it like this.
This is from the top.
So if this is my image, I'm gonna go to the top and look at it straight down.
Which I realized sort of like that. Does that work?
Now there's three input channels because of course we had usually red green and blue, right?
So in some sense, this is multi-dimensional. We're gonna have our little filtering so I'm going to draw a couple of kernels.
Let's maybe draw four. We're gonna do a convolution operation using this one on here
So it's going to look over all of these three channels
it's going to scan along and it's going to calculate some kind of features like an edge or something like this and that's going to
Produce another feature right and now there's four kernels of each gonna do this. So we're gonna have four outputs. Don't worry
I'm not going to do an 800 layer deep network this way
So each of these gets to look at all of the three something that's a bit a bit of a sort of quirk of deep
Learning but maybe isn't explained
Often enough, but actually these I'll have an extra dimension that lets them. Look at these
so the next layer along will look at all four of these ones and so on what we also then do and I'm going to
Sort of get why not? Why not use multiple colors?
We then sometimes also spatially down sample. So we take the maximum of a region of pixels.
So that we can make the whole thing smaller and fit it better on our graphics card.
We're gonna downsample this so it's gonna look like this and then okay, I'll just do a yellow one. Why not?
Can we see yellow on this? We'll soon find out. Yeah. Yeah
So let's say there's two kernels here and you can kind of see it.
I think we need to go pink here. Pink? Pink! Alright pink, forget yellow.
No yellow on white. That was what I was told when I first started using PowerPoint.
I like pink. Yeah, that kinda, that can work.
It kinda looks a bit like the red.
So that's going to look at all these four so and there's two of them. So there's going to be two outputs, right?
Just think of in terms of four inputs two outputs. So that's going to be sort of like this
I'm just going to go back to my blue and forget the colors now and you just repeat this process for quite a while
Right depending on the network. There are more advanced architectures like resinates, but let this become very very deep you
Know hundreds of layers sometimes but for the sake of argument
Let's just say it's into the dozens
usually so we're gonna down sample a bit more and so on and then we'll get some kind of
final feature vector
Hopefully a summary of everything that's in all these images sort of summarized for us
And that's where we do our classification
so we attach a little neural network to this here and that all connects to all of these and then this is our reading of
Whether it's a cat or not, that's the idea the problem with this is that these number of connections here are fixed
This is the big drawback of this kind of network
You're using this to do this very interesting feature calculation and then you've got this fixed number of it's always three here
There's always one here
So this always has to be the same size which means that this input also has to always be the same size. Let's say
256 pixels by 256 pixels, which is not actually very big
So what tends to happen is that?
We take our image that we were interested in and we shrink it to 256 by 256 and put that in you know
and so when we train our network
We make a decision early on as to what kind of appropriate size we should use now, of course, it doesn't really make any sense
Currently because we have lots of different kinds of sizes image, obviously
They can't be too big because we're run out of RAM
But it would be nice if we if it was a little bit flexible
The other issue is but this is actually taking our entire image and summarizing it in one value
So all spatial information is lost right?
you can see that the spatial information is getting lower and lower as we go through this network to the point where all we
Care about is if it's a cat not where is the cat? What if we wanted to find out where the cat was or?
Segment the cat tutor or somet in a person or count a number of people right to do that
This isn't gonna work because it always goes down to one. So that's kind of a yes or no is yeah
Yeah, yes or no. You could have multiple outputs
If it was yes, dog, no cat, you know different outputs
Sometimes instead of a classification you output an actual value like the amount of something
But in this case, that's not that's not worry about it now
You've told me that this is an amazing market so I'm gonna have a go at this
I said anyone ever raised your marker in your videos. I mean, this is a first that
Okay, it's work he's just gonna take quite a while because it stuck this rubber is tiny you know what I qualities Marcus
All right. There we go
All right. So the same input still produces this little feature vector
But now instead of a fixed size neural network on the end
We're just going to put another convolution of one pixel by one pixel
So it's just a tiny little filter
but it's just one by one and that's going to scan over here and produce an image of
Exactly the same size but this of course we'll be looking for all of these and working out in detail what the object is
So it will have much more information than these ones back here
So, you know
this could be outputting a heat map of where the cats are or where the dogs are or
You know the areas of disease in sort of a medical image or something like this
And so this is called a fully convolutional network because there are no longer any
Fully connected or fixed size layers in this network. So normal deep learning in some sense or at least up until so 2014-2015
Predominantly just put a little new network on the end of this. That was a fixed size now
We don't do that
And the nice thing is if we double the size of this input image, I mean we're using more RAM
But this is going to double little double and in the end
This will also double and we'll just get the exact same result which is bigger so we can now put in different size images
the way this actually works in practice is that when one your deep learning library like
Cafe 2 or pi stalks or tensorflow will allocate memory as required
So you put in an input image and it goes well, ok with that input image
We're going to need to allocate this much RAM to do all this and so the nice thing is that this can now have information
On where the objects are as well as what they are picks output. So
We'll show a few examples of semantic segmentation on the screen so you can see the kind of thing
We're talking about the obvious downside here, which is what I'm going to leave for
Another video is that this is very very small, you know
maybe this is only a few pixels by a few pixels or something like this or
You haven't done that much down sampling and so it's not a very deep network and you haven't learned a whole lot if you are
Looking for where is the carrier's image? You have kind of it's down in the bottom left. It would be very very general
So it would be you know bit sort of area. Maybe there's something else going on over here
It depends on the resolution of this image looks great with different colors in line. But what are you actually using this stuff?
Alright, so, I mean we have to extend this slightly, which I'm you know
Normally going to postpone for another video because this is too small for us to be practical, right?
What we could do is just up up sample this we could use linear or bilinear interpolation
to just make this way bigger like this and have a bigger output image and
It would still be very low resolution you'd get the rough idea of where something was but it wouldn't be great
Right, so you could use this to find
Objects that you're looking for. So for example in our lab, we're using this for things like analysis of plants
So where are the wheat is how many are there that can be useful in a field to try and work out
What the yield or disease problems are going to be you can do it for medical images where the tumors in this image
Segmenting x-ray images we're also doing it on human pose estimation and face
Estimation so you know, where is the face in this image? Where are the eyes?
What shape is the face this kind of thing so you can use this for a huge amount of things?
But we're going to need to extend it a little bit more to get the best out of it
And the extension we'll call an encoder decoder Network
Are you tying it up now? What are you doing? It's not neat enough this there's little bits of unwrapped out bits
Bear with me I start on the next video in a minute. Yeah
That's as good as it's getting it