編碼器解碼器網絡 - Computerphile (Encoder Decoder Network - Computerphile)

字幕列表影片播放

So where we left it Waas that we've got ourselves now fully connected network.
So it makes no assumptions about the size of the input, the number of parameters.
We're going to have it just adapt itself, depending on the size of input, which for images you could imagine, makes quite a lot of sense.
They change sides quite a lot, but in most of the way, the acts exactly like a normal deep network.
We've talked about this before in other videos like Deep Dream One.
But the deeper you go into the network the sort of higher level information we have on what's going on.
It's objects and and animals and things rather van bits of edges on the shallower we are.
We have much less idea, but the shallow are we are.
We also have much higher spatial resolution because we've got basically the input image size because of these Max pooling lays mostly every time we down some.
But what we're doing here, taking a small group of pixels on just choosing the best of them, the maximum on putting that in the output on that just Harv society, the image and harder to get and hard it again.
And you can imagine that if you've got an image of 256 by 246 we might repeat this process for 56 times until we got a very small region.
It's done for a couple of reasons.
One is that we want to be in variant toe, where things are in the image, which means that if the dogs over to the right, we still want to find it, even if it's or over to the left.
And so we what?
We don't want it to be affected by that, the other the other issue.
Quite frankly, we don't have enough video ram yet we routinely fill up multiple graphics cards, each of which has 12 gigabytes on it depends on the situation you're looking at.
This is only one dimension I've drawn here, but it's actually two dimensional.
If you have the X and Y dimensions, you actually dividing the amount of memory required for Burnett flair by four and then by four again and then by four again.
And so actually you save an absolutely massive amount of ram by specially down sampling on.
Without it, we'd be start with very small networks, indeed, but we've got this problem that, yes, we've worked out the capsule in the image or something like this, but it's very, very small.
It's only a few pixels by a few pixels.
We've got a rough idea.
There's something going on here.
Maybe we could just blood balloon it up like like a large linear up sampling and just sort of go Well, it's roughly a cat, but it wouldn't be anything interesting.
So I guess the interesting thing happened in 2014 when Jonathan Long proposed.
That's kind of a solution to this, which is essentially a smarter up sampling.
What we do is we essentially reverse this process.
Basically, you have a sort of an up sample here which may be double the size and then way, Look over here and we bring in some of this interesting information as well.
And then we up sample again on we go so we can visit now the same size as this, so we can bring in some of this information on when I say bring it.
I mean literally add these to these, and we can have convolution allies here to learn that mapping so we can take nothing from here or everything from here.
It doesn't really matter.
And finally, we up sample back to the original size on dhe.
We bring this in here using a sum.
Now, what would actually done is a kind of smart way of making this bigger.
I mean, you got to kind of try to get your head around it, but these features are very sure what's in the image, but only lovely where it is, these features are much higher pixel resolution.
They're much more sure in some sense, where things are, but not exactly what they are, right.
So you could imagine in an intuitive way, we're saying this is a cat, and down here we've seen some texture reefer.
Let's combine them together to outline exactly where the cat is.
This is the kind of idea, and you can use this for all kinds of things.
So people have used it for segmentation, or we call semantic segmentation, which is where you will label each pixel with a class, depending on what is in that pixel.
Traditional segmentation usually meant background and foreground.
Now, semantic segmentation means maybe hundreds of classes.
For instance, an image.
I'm staying here with you.
The table, the computer, the desk, the window.
Yet this kind of things.
There's a huge amount of different applications of that kind of thing's on a basic level, you could imagine just trying to find one object specifically in the sea.
So just the people, it's either a person or his background.
We don't care what else.
Or you could be training this on something like Image Net with lots and lots of classes.
Or, I mean, there's the M S Coco data set, for example, that has lots and lots of classes on dhe.
So you're trying to find the airplanes and cars and things, and people do this on street scene segmentation as well.
So you could say, Look, given this picture of road, where is the road?
Where is the pavement?
What's a building where the road signs and actually analyze the entire scene, which which is obviously really, really quite powerful?
The other thing is that you don't have to segment the image instead of segmenting it.
You could just try and find objects.
You can say, instead of just outlining where an object is, yes or no, Why don't we try and draw a little heat map of where we think it is?
And then we can pinpoint objects so we can say where the two peoples on the face or can we draw around someone's face or their nose or their forehead so that we can then fit a model for that?
So Arun was doing this in his network, where he was actually predicting the free D and positional information of a face based just on a picture on You want to go with that?
We've also been using it for human pose estimation.
So where's the right hand?
Where's the left hand?
What poses this person currently doing, which obviously you can imagine, has lots of implications for things like connect centers and sort of interactive gangs, but also, you know, pedestrian tracking and on loads of other examples of things where it might be useful to know what a person is up to.
And finally, we're using obviously implant science to try and count objects and localize objects.
So where's the disease on this image?
Can we produce a heat map that shows exactly where it is, where the ears of wheat and this image.
Can we count the number of spite blitz to get an estimate of how much yield this week is producing compared to this sweet on.
Then we can start to run experiments on you know, these ones, the water stressed.
Does that Does that mean this one's better?
This kind of thing.
So this is called an encoded decoder because sometimes what we're doing is we're encoding our spatial information to some kind of features of what's going on in the in the scene.
In general, we remove the spatial resolution in exchange of learning more about the scene, and then we bring it back in by finding detail from earlier parts of the network and bring them in as well.
That's the decoding stage.
In some sense, it is a little bit like again in the sense that this is the generator here and this is her discriminator.
It's just that you would switch around, but let's not go over complicating, and this one lit up, which is maybe pause and maybe this will let up because here was a few lines in a row, and this one is sort of Furby texture or something.