檢測人臉(Viola Jones算法) - Computerphile(計算機愛好者) (Detecting Faces (Viola Jones Algorithm) - Computerphile)

字幕列表影片播放

I'd like to talk about face detection
All right. So this is the idea or if you've got a picture with one face in it or many faces in it
how do we find those faces and
The standard approaches is "Ah, we'll just use deep learning"
Now you can use deep learning to find faces
But actually the approach that everyone uses isn't deep learning and it was developed in the early 2000s
So back before deep learning did everything
You kind of had to come up with these algorithms yourself right machine learning was still a thing. So people still use machine learning
But they used them with handcrafted features and small neural networks and other kinds of classifiers
that they tried to use to do these things
Now the face detection was you know ongoing research at this time
In 2002 Paul viola Michael Jones came up with this paper here called
"Rapid object detection using a boosted cascade of simple features", and this is a very very good paper.
It's been cited some 17,000 times
And despite the fact that deep learning has kind of taken over everything.
In face detection, this still performs absolutely fine, right
It's incredibly quick and if you've got any kind of camera that does some kind of face detection
It's going to be using something very similar to this, right?
So what does it do? Let's talk about that.
The problem is, right,
There's a few problems with face detection one is that we don't know how big the face is going to be
So it could be very big could be very small, and another is, you know,
Maybe you've got a very high-resolution image. We want to be doing this lots and lots of times a second
So what are we going to do to? Look over every every tiny bit of image lots and lots of times?
Complicated, um,
Machine learning, that says, you know, is this a face? is this not a face?
There's a trade-off between speed and accuracy and false-positives and false-negatives. It's a total mess
It's very difficult to find faces quickly, right? This is also considering it, you know, we have different ethnic groups
young, old people, people who've got glasses on, things like this
So all of this adds up to quite a difficult problem, and yet it's not a problem
we worry about anymore because we can do it and we can do it because of these guys
They came up with a classifier that uses very very simple features, one bit of an image subtracted from another bit of an image and
On its own and that's not very good, but if you have
thousands and thousands of those, all giving you a clue that maybe this is a face, you could start to come up with proper decision
[offscreen] Is this looking for facial features then is it as simple as looking for a nose and an eye and etc?
So no, not really, right. So deep learning kind of does that right?
It takes it takes edges and other features and it combines them together into objects
you know, in a hierarchy and then maybe it finds faces. What this is doing is making very quick decisions about
What it is to be a face, so in for example, if we're just looking at a grayscale image
Right, my eye is arguably slightly darker than my forehead, right?
In terms of shadowing and the pupils darker and things like this
So if you just do this bit of image minus this bit of image
My eye is going to produce a different response from this blackboard, right, most of the time
Now, if you do that on its own, that's not a very good classifier, right? It'll get
quite a lot of the faces
But it'll also find a load of other stuff as well where something happens to be darker than something else that happens all the time
so the question is "can we produce a lot of these things all at once and make a decision that way?"
They proposed these very very simple rectangular features
Which are just one part of an image subtracted from another part of an image
So there are a few types of these features. One of them is a two rectangle features
So we have a block of image where we subtract one side from the other side
Their approaches are machine learning-based approach
Normally, what you would do in machine learning is you would extract --
You can't put the whole image in maybe there's five hundred faces in this image
So we put in something we've calculated from the image some features and then we use all machine learning to try and classify
bits of the image or the whole image or something like this. Their contribution was a very quick way to
calculate these features and use them to make a face classification
To say there is a face in this block of image or there isn't
And the features they use a super simple, right? So they're just rectangular features like this
So we've got two rectangles next to each other which, you know are some amount of pixels
so maybe it's a
It's nine pixels here and nine pixels here or just one pixel and one pixel or hundred pixels and a hundred pixels
It's not really important.
and we do one subtract the other right?
So essentially we're looking for bits of an image where one bit is darker or brighter than another bit
This is a two rectangle feature. It can also be oriented the other way so, you know like this
We also have three rectangle features which are like this where you're doing sort of maybe the middle subtract the outside or vice versa
And we have four rectangle feature which are going to be kind of finding diagonal sort of corner things
So something like this
Even if your image is small right you're going to have a lot of different possible features even of these four types
So this four rectangle feature could just be one pixel each or each of these could be half the image it can scale
You know or move and move around
Brady : What determines that? Mike : Um, so they do all of them, right?
Or at least they look at all of them originally
And they learn which ones are the most useful for finding a face this over a whole image of a face isn't hugely
representative of what a face looks like right? No one's face. The corners are darker than the other two corners
That doesn't make sense, right but maybe over their eye, maybe that makes more sense
I don't know, that's the kind of the idea. So they have a training process at which was down
Which of these features are useful, the other problem we've got is that on an image
Calculating large groups of pixels and summing them up is quite a slow process
So they come a really nifty idea called an integral image which makes this way way faster
So let's imagine we have an image
Right, and so think -- consider while we're talking about this that we want to kind of calculate these bits of image
But minus some other bit of image, right? So let's imagine we have an image which is nice and small
It's too small for me to write on but let's not worry about it
Right and then let's draw in some pixel values. Sast forward. Look at the state of that. That's that's a total total shambles
This is a rubbable-out pen, right? For goodness sake
Right right okay okay so all right so
Let's imagine this is our input image. We're trying to find a face in it
Now I can't see one
But obviously this could be a quite a lot bigger and we want to calculate let's say one of our two rectangle features
So maybe we want to do these four pixels up in the top
Minus the four pixels below it now that's only a few additions : 7 + 7 + 1 + 2
minus 8 + 3 + 1 + 2
But if you're doing this over large sections of image and thousands and thousands of times to try and find faces
That's not gonna work
So what Viola Jones came up with was this integral image where we pre-compute
Some of this arithmetic for us, store it in an intermediate form, and then we can calculate
rectangles minus of of rectangles really easily
So we do one pass over the image, and every new pixel is the sum of all the pixels
Above and to the left and it including it. right, so this will be something like this
so
1 and 1 + 7 is 8 so this pixel is the sum of these two pixels and this pixel is going to be all these three
So that's going to be 12... 14... 23
and now we fast forward while I do a bit of math in my head
8...17 maybe I did somebody's earlier, 24... On a computer this is much much faster
The sum of all the pixels is 113. For example, the sum of this 4x4 block is 68 now
The reason this is useful, bear with me here
But if we want to work out what, let's say, the sum of this region is what we do is we take this one
113 we subtract this one, minus 64
Alright, and this one?
minus 71 and that's taken off all of that and all of that and then we have to add this bit in because we've been
Taken off twice so plus 40. All right, so that's four reads. Now funnily enough this is a 4 by 4 block
So I've achieved nothing
But if this was a huge huge image, I've saved a huge amount of time and the answer to this is 18
Which is 6 plus 6 plus 5 plus 1
So the assumption is that I'm not just going to be looking at these pictures one time to do this, right?
There's lots of places a face could be I've got to look at lots of combinations of pixels and different regions
So I'm going to be doing huge amounts of pixel addition and subtraction
So let's calculate this integral image once and then use that as a base to do really quick
Adding and subtracting of regions, right?
and so I think for example a 4 rectangle region
is going to take something like nine reads or something like that and a little bit addition. It's very simple
All right. So now how do we turn this into a working face detector? Let's imagine
We have a picture of a face, which is going to be one of my good drawings again
Now in this particular algorithm, they look 24 by 24 pixel regions, but they can also scale up and down a little bit
So let's imagine there's a face here which has, you know eyes, a nose and a mouth right and some hair
Okay, good. Now as I mentioned earlier, there are probably some features that don't make a lot of sense on this
So subtracting, for example, if I take my red pen
Subtracting this half of image from this half. It's not going to represent most faces
It may be when there's a lot of lighting on one side, but it's not very good at distinguishing
Images that have faces in and images that don't have faces in
So what they do, is they calculate all of the features, right for a 24 by 24 image
They calculate all 180,000 possible combinations of 2, 3, and 4 rectangle features and they work out which one
For a given data set of faces and not faces, which one best separates the positives from the negatives, right?
So let's say you have 10,000 pictures of faces
10,000 pictures of background which one feature best
says "this is a face, this is not a face" Right, bearing in mind
Nothing is going to get it completely right with just one feature
So the first one it looks it turns out is something like this
It's a two rectangle region, but works out a difference between the area of the eyes and the air for cheeks
So it's saying if on a normal face your cheeks are generally brighter or darker than your eyes
So what they do is they say, okay
Well, let's start a classifier with just that feature right and see how good it is
This is our first feature feature number one, and we have a pretty relaxed threshold
so if there's anything plausible in this region
we'll let it through right which is going to let through all of the faces and a bunch of other stuff as well that we
Don't want right. So this is yes. That's okay, right? That's okay if it's a no then we immediately
Fail that region of image right? So we've done one test which is as we know about four additions
So we've said for this region of image if this passes will let it through to the next stage
Right and we'll say okay it definitely could be a face
It's not not-a-face. Does that make sense? Yeah, okay
So let's do look at the next feature
The next feature is this one
So it's a three region feature and it measures the difference between the nose and the bridge and the eyes, right?
which may or may not be darker or lighter. All right, so there's a difference there
So this is feature number two, so I'm going to draw that in here number two
And if that passes we go to the next feature, so this is a sort of binary, they call it "degenerate decision tree"
Right, well because the decision tree is a binary tree. This is not really because you immediately stop here
you don't go any further. The argument is that
Every time we calculate one of these features it takes a little bit of time
The quicker we can say "no definitely not a face in there", the better. And the only time we ever need to look at all the features
Or all of the good ones is when we think, "okay, that actually could be a face here"
So we have less and less general, more and more specific features going forward right up to about the number
I think it's about six thousand they end up using. All right, so we we say just the first one pass
Yes, just a second one pass
Yes, and we keep going until we get a fail and if we get all the way to the end and nothing fails
that's a face, right and the beauty of this, is that
For the vast majority of the image, there's no computation at all. We just take one look at it, first feature fails
"Nah, not a face". They designed a really good way of adding and subtracting different regions of the image
And then they trained a classifier like this to find the best features and the best order to apply those features
which was a nice compromise between always detecting the faces that are there and false positives and speed right?
And at the time, this was running on, I think to give you some idea of what the computational Technology was like in 2002
This was presented on a 700 megahertz Pentium 3 and ran at 15 frames a second
which was totally unheard of back then. Face detection was the kind of offline, you know, it was okay at that time
So this is a really, really cool algorithm and it's so effective
that you still see it used in, you know, in your camera phone
and in this camera and so on, when you just get a little bounding box around the face and this is still really useful
because you might be doing deep learning on something like face recognition, face ID something like this
But part of that process is firstly working out where the face is, and why reinvent the wheel when this technique works really really well
You can't really get into the data center necessarily and take all the chips out that you've put in there
So you probably will make the chips look like they're meant to be there like they're something else or hide them
So the way a modern printed circuit board is constructed. It's a printed circuit board that's got several layers of fiberglass