字幕列表 影片播放 列印英文字幕 Don't mind me I'm just shuttling pictures of computer barb people's basis. It's not at all weird Oh In the last video we talked about how do you find faces quickly in an image? that's I guess nowadays only half the story if you want to unlock a face of your phone or You want to unlock it with your computer or you want to just recognize who's in a picture? That's face recognition not face detection. We can't just train a classifier We can't just say here's 1,500 images of Shawn and 1,500 images of Mike what work out what the difference is because it will do that and then I'll say well here's a picture of Steve and it will Go Mike you know because it's only got two options like so then we have to retrain it and you'll notice that when you sign up For your phone the first time and it recognizes your face It doesn't have to train a network right because that would take way too long. How does it do it? Will be answer is basically we train a network to distinguish the differences between faces rather than actually recognizing individual faces I've got some printer here with his max Here's me Shawn and Dave and so on right I've got lots of lots of computer power hosts in here So what I could do is I could say well here's a picture of Max and here's a picture of Mike So I have some you know some convolutional layers or something and I have a network here and that goes all the way deep network up to a classification but lights up with Max or Mike The problem is that we bring in Shawn and everything's ruined you're put in a funny face That doesn't help the standard way of training a network which is giving it an image and a label and saying learn to be better With predicting that isn't going to work because we don't know how many people are going to use this system We can't put them all in All right, otherwise companies have been tapping you up for face pictures before they even release the phone we say well Why don't we train a network to instead of saying bitties? Definitely someone to just say these are their features, right? And hopefully when we if it's good at it, it will say that their features are different to someone else's features. That's the idea So what we're actually doing is we're training a network to separate people out Let's say you put me in and this network that I'm designing has a lot of layers in it all the way along here But instead of outputting a single decision as to who this is it outputs a series of numbers So let's say a vector of numbers here like this I didn't maybe matter how many there are for now and what we're saying is when we put me in these numbers need to be different than when we put Maxine or when we put let's see Who else we got Dave right when we put Dave in his numbers come out different to mine, right? And it's those numbers which are kind of like a fingerprint for each person. So, how do we do this? well We use a special kind of learning or a special kind of loss function called a triplet loss Right all this is one of the ways you can do it. There were a few So what we say is we say what we put in three images at once so we say here's two images of me and one image of Dave and what we want to do if We want to change the network so that when we put these fujas through these two are rated very similar and these two pairs Are rated is very different And actually what we'd usually do is we label this one and anchor this one a positive sample and this one a negative sample So we're saying but a distance between these two has to be very similar and the distance between these has to be very far apart So let's imagine it was only two numbers out. So we're putting ourselves on a sort of 2d. Grid, right? So this this is variable one and this is available to that come out of our network, right? So this is our network like my anchor is is a picture of me a positive sample and a negative sample, which is Dave right so I put them through the network and what we trained it to do is separate out the pictures of me in the pictures of Dave so I maybe get put over here So I get a very high value for - and a very low value for number one. Let's say all right Dave gets a very high value for number one and a very low value for number two And then we start to repeat this process with different pairs of people and different positives and different negative samples. So let's say I Mean, why did I shuffle these? That's a real? Okay. So let's say two pictures of op miles. That's why he's not nice to avoid my printer and one picture of Sean, right? So maybe what miles gets put over here near me, which is not so good But we'll get to that and then you're put over here like this and then maybe later on We have two pictures of me and one of Rob which moves Rob down here a little bit and then Dave gets put over here And you know max gets put over here somewhere negative values are also allowed and what we're trying to do Is make sure that everyone is nicely separate, okay? now if you do this for just a few people what you're actually doing is just classifying them but if you do this for Thousands of different humans of all different ethnicities and different poses and different lighting conditions eventually The network is going to start to learn how to I mean actually that's not right because Dave's far away from Dave, right? So hopefully we start to come together But that's you've got a train for a long time And let's not let Steve off. The hook is Steve over here high value of two high value of one, whatever That means the interesting thing about this is we're not performing a classification which is performing a dimensionality reduction We're saying how do we represent people as just these two numbers right or in the case of actual? Deployments of this maybe 128 or 256 numbers somewhere in this space when you put my face in I'll appear and when you put Steve's face in it'll peer somewhere else and this actually solves a really Nice problem right? It's called the one-shot learning problem How do we convince a phone to let me in having only seen one ever picture of my face? Which is when I first, you know Calibrated it the first time and the answer is we don't train a neural network to classify me We just use the existing network that we trained on thousands of thousands of people doing this To put me somewhere on here and then we record that location and then when I come in again and try and unlock the phone Do does my new image go to the same place in this space as my last one? So let's say I get put over here with a high value of two and a low value of one I take another picture of myself on my camera and I come in over here and it goes well, that's pretty close Okay, we'll let them unlock the phone. Right but max comes in and gets put over here that's judged as to higher difference and Access is denied, right this is how it works And this is really clever because it means that the actual decision making process on whether you're allowed in or not It's based on just the distance of these numbers right in which case is like 128 numbers. Sure. This is Susceptible to problems Yeah So it is and this is one of the things that Apple for example with their face ID Have yeah, if you bear in mind, of course haven't told me how they do it, right? So nor would they but we can presume it works something like this. We have a depth camera as well But they will have included in their training set pictures of people in masks and pictures of people with different hair and pictures of people in strange locations and things So the network learned to ignore those things, right? If you never showed it to the network, you're right B will just miss classifier all the time That's that's the problem If you only train this to separate me in day when you put Steve in its behavior is going to be undefined Right, so but that's kind of how neural networks work. They often undefined you hope that you put in a good enough training set So but for the vast majority 99.999% of cases it works very consistently and it says no they come out over here, which is not close enough So we're not unlocking the phone The interesting thing is it's much harder to gain this system than it is to gain a system based on simple decision-making, right? So yes, you might be able to trick this to unlock a phone once or twice, right? But if you try and recreate that same process with my face and unlock my phone, for example maybe you won't have as much luck because Exactly how its network works isn't clear even to the people that chain to trained it Which is quite kind of its strength in this case, right? Maybe it's security for obscurity, right? Maybe there's a certain thing you can hold up in front and it'll always unlock right? It doesn't seem very likely but we don't know until we find those things. So It's the air as an interesting one for further study I guess you were mentioning these features here people will ensure that as what can arise we've got all the hair We've got I mean, is that what's going on? Okay, we don't know right so we call this feature space for latent space It's a kind of space just before Classification where it's you've got features but these features in a deep network are or mean We've had a look at sort of inside and your network before and they kind of a sort of Combinations of edges and things like this it is going to be bored leaf or something trained on human faces It's going to be broadly the kind of face related features because otherwise it wouldn't work as a as a trained network But exactly what it does. We don't know does it wait hairs more important than eye color I don't know and neither do the people that run it I expect they're trained with different haircuts So so that they forego this kind of issue But of course you have to be careful doing that, right? Because if you can shave your head and still unlock the phone, is that as secure as a phone But you couldn't do that on my it's not usable So that's the other reason they do it, but you get the idea but you've noticed from this two dimensional space Which I've done just for simplicity. It becomes difficult to separate out everyone in this space. So who else have we got? So you're in here you're in over here. So maybe your images are here blobs. Images are here They start to take up quite a lot of room one got a few of them And that's a bit of a weird one, right? So that goes over here and and and maxes over here So it's getting a bit cluttered, right? So the decision on whether to unlock a phone becomes more difficult, so we don't usually do this in two dimensions We do it in 128 or 256 dimensions. So but spacing these things out for many many people is much much easier I would say that it's likely that someone on earth will be able to unlock Someone's phone like this because they look similar enough to them But the chances of that person being the one that steals your phone is pretty slim so I really wouldn't worry about it too much And this pixel was going to be always three. So that's going to be 12 14 23 and now we fast forward while I do a bit of math in my head 8 On a computer. This is much much faster likely to be on a network which is limited in what it can connect to It's probably likely to be able to connect to other board management controllers on the same network
A2 初級 人臉識別如何工作...可能 - Computerphile (How Face ID Works... Probably - Computerphile) 2 0 林宜悉 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字