人臉識別如何工作...可能 - Computerphile (How Face ID Works... Probably - Computerphile)

字幕列表影片播放

Don't mind me I'm just shuttling pictures of computer barb people's basis. It's not at all weird Oh
In the last video we talked about how do you find faces quickly in an image?
that's I guess nowadays only half the story if you want to unlock a face of your phone or
You want to unlock it with your computer or you want to just recognize who's in a picture?
That's face recognition not face detection. We can't just train a classifier
We can't just say here's 1,500 images of Shawn and 1,500 images of Mike
what work out what the difference is because it will do that and then I'll say well here's a picture of Steve and it will
Go
Mike you know because it's only got two options like so then we have to retrain it and you'll notice that when you sign up
For your phone the first time and it recognizes your face
It doesn't have to train a network right because that would take way too long. How does it do it?
Will be answer is basically we train a network to distinguish the differences between faces rather than actually recognizing individual faces
I've got some printer here with his max
Here's me Shawn and Dave and so on right I've got lots of lots of computer power hosts in here
So what I could do is I could say well here's a picture of Max and here's a picture of Mike
So I have some you know
some convolutional layers or something and I have a network here and that goes all the way deep network up to a
classification but lights up with Max or Mike
The problem is that we bring in Shawn and everything's ruined you're put in a funny face
That doesn't help the standard way of training a network which is giving it an image and a label and saying learn to be better
With predicting that isn't going to work because we don't know how many people are going to use this system
We can't put them all in
All right, otherwise companies have been tapping you up for face pictures before they even release the phone we say well
Why don't we train a network to instead of saying bitties? Definitely someone to just say these are their features, right?
And hopefully when we if it's good at it, it will say that their features are different to someone else's features. That's the idea
So what we're actually doing is we're training a network to separate people out
Let's say you put me in and this network that I'm designing has a lot of layers in it all the way along here
But instead of outputting a single decision as to who this is it outputs a series of numbers
So let's say a vector of numbers here like this
I didn't maybe matter how many there are for now
and what we're saying is when we put me in these numbers need to be different than when we put
Maxine or when we put let's see
Who else we got Dave right when we put Dave in his numbers come out different to mine, right?
And it's those numbers which are kind of like a fingerprint for each person. So, how do we do this?
well
We use a special kind of learning or a special kind of loss function called a triplet loss
Right all this is one of the ways you can do it. There were a few
So what we say is we say what we put in three images at once
so we say
here's two images of me and one image of Dave and what we want to do if
We want to change the network so that when we put these fujas through these two are rated very similar and these two pairs
Are rated is very different
And actually what we'd usually do is we label this one and anchor this one a positive sample and this one a negative sample
So we're saying but a distance between these two has to be very similar and the distance between these has to be very far apart
So let's imagine it was only two numbers out. So we're putting ourselves on a sort of 2d. Grid, right?
So this this is variable one and this is available to that come out of our network, right?
So this is our network like my anchor is is a picture of me a positive sample and a negative sample, which is Dave
right
so I put them through the network and what we trained it to do is separate out the pictures of me in the pictures of
Dave so I maybe get put over here
So I get a very high value for - and a very low value for number one. Let's say all right
Dave gets a very high value for number one and a very low value for number two
And then we start to repeat this process with different pairs of people and different positives and different negative samples. So let's say I
Mean, why did I shuffle these? That's a real?
Okay. So let's say two pictures of op miles. That's why he's not nice to avoid my printer and one picture of Sean, right?
So maybe what miles gets put over here near me, which is not so good
But we'll get to that and then you're put over here like this and then maybe later on
We have two pictures of me and one of Rob which moves Rob down here a little bit and then Dave gets put over here
And you know max gets put over here somewhere negative values are also allowed and what we're trying to do
Is make sure that everyone is nicely separate, okay?
now if you do this for just a few people what you're actually doing is just classifying them but if you do this for
Thousands of different humans of all different ethnicities and different poses and different lighting conditions eventually
The network is going to start to learn how to I mean actually that's not right because Dave's far away from Dave, right?
So hopefully we start to come together
But that's you've got a train for a long time
And let's not let Steve off. The hook is Steve over here high value of two high value of one, whatever
That means the interesting thing about this is we're not performing a classification which is performing a dimensionality reduction
We're saying how do we represent people as just these two numbers right or in the case of actual?
Deployments of this maybe 128 or 256 numbers somewhere in this space
when you put my face in I'll appear and when you put Steve's face in it'll peer somewhere else and this actually solves a really
Nice problem right? It's called the one-shot learning problem
How do we convince a phone to let me in having only seen one ever picture of my face?
Which is when I first, you know
Calibrated it the first time and the answer is we don't train a neural network to classify me
We just use the existing network that we trained on thousands of thousands of people doing this
To put me somewhere on here and then we record that location and then when I come in again and try and unlock the phone
Do does my new image go to the same place in this space as my last one?
So let's say I get put over here with a high value of two and a low value of one
I take another picture of myself on my camera and I come in over here and it goes well, that's pretty close
Okay, we'll let them unlock the phone. Right but max comes in and gets put over here that's judged as to higher difference and
Access is denied, right this is how it works
And this is really clever because it means that the actual decision making process on whether you're allowed in or not
It's based on just the distance of these numbers right in which case is like 128 numbers. Sure. This is
Susceptible to problems
Yeah
So it is and this is one of the things that Apple for example with their face ID
Have yeah, if you bear in mind, of course haven't told me how they do it, right?
So nor would they but we can presume it works something like this. We have a depth camera as well
But they will have included in their training set
pictures of people in masks and pictures of people with different hair and pictures of people in strange locations and things
So the network learned to ignore those things, right?
If you never showed it to the network, you're right B will just miss classifier all the time
That's that's the problem
If you only train this to separate me in day when you put Steve in its behavior is going to be undefined
Right, so but that's kind of how neural networks work. They often undefined you hope that you put in a good enough training set
So but for the vast majority
99.999% of cases it works very consistently and it says no they come out over here, which is not close enough
So we're not unlocking the phone
The interesting thing is it's much harder to gain this system than it is to gain a system based on simple decision-making, right?
So yes, you might be able to trick this to unlock a phone once or twice, right?
But if you try and recreate that same process with my face and unlock my phone, for example
maybe you won't have as much luck because
Exactly how its network works isn't clear even to the people that chain to trained it
Which is quite kind of its strength in this case, right? Maybe it's security for obscurity, right?
Maybe there's a certain thing you can hold up in front and it'll always unlock right?
It doesn't seem very likely but we don't know until we find those things. So
It's the air as an interesting one for further study
I guess you were mentioning these features here people will ensure that as what can arise we've got all the hair
We've got I mean, is that what's going on?
Okay, we don't know right so we call this feature space for latent space
It's a kind of space just before
Classification where it's you've got features but these features in a deep network are or mean
We've had a look at sort of inside and your network before and they kind of a sort of
Combinations of edges and things like this it is going to be bored leaf or something trained on human faces
It's going to be broadly the kind of face related features because otherwise it wouldn't work as a as a trained network
But exactly what it does. We don't know does it wait hairs more important than eye color
I don't know and neither do the people that run it I expect they're trained with different haircuts
So so that they forego this kind of issue
But of course you have to be careful doing that, right?
Because if you can shave your head and still unlock the phone, is that as secure as a phone
But you couldn't do that on my it's not usable
So that's the other reason they do it, but you get the idea but you've noticed from this two dimensional space
Which I've done just for simplicity. It becomes difficult to separate out everyone in this space. So who else have we got?
So you're in here you're in over here. So maybe your images are here blobs. Images are here
They start to take up quite a lot of room one got a few of them
And that's a bit of a weird one, right? So that goes over here and and and maxes over here
So it's getting a bit cluttered, right?
So the decision on whether to unlock a phone becomes more difficult, so we don't usually do this in two dimensions
We do it in 128 or 256 dimensions. So but spacing these things out for many many people is much much easier
I would say that it's likely that someone on earth will be able to unlock
Someone's phone like this because they look similar enough to them
But the chances of that person being the one that steals your phone is pretty slim so I really wouldn't worry about it too much
And this pixel was going to be always three. So that's going to be 12
14 23 and now we fast forward while I do a bit of math in my head 8
On a computer. This is much much faster likely to be on a network which is limited in what it can connect to
It's probably likely to be able to connect to other board management controllers on the same network