字幕列表 影片播放 列印英文字幕 >> So it's my pleasure to introduce to you Geoff Hinton, who is a pioneer in machine learning and neural nets, and more recently [INDISTINCT] architectures. And then, I think that's going to be topic of today. So take it over. >> HINTON: Okay. So I gave it to Okeer a couple of years ago. And the first 10 minutes or so will be an overview of what I said there, and then I'll talk about the new stuff. The new stuff consists of a better learning module. It allows you to learn better in all sorts of different things, like, learning how images transform, learning how people walk, and learning object recognition. So the basic learning module consists of some variables that represent things like pixels, and these will be binary variables for that. Some variables that represent--these are latent variables, they're also going to be binary. And there's a bipartite connectivity, so these guys are connected to each other. And that makes it very easy if I give you the states of the visible variables to infer the states of the hidden variables. They're all independent given the visible variables because it's a non-directed graph. And the input procedure just says, the probability of turning on hidden unit "hj" given this visible vector "v" is the logistic function of the total what he gets from his audience, so very simple for the hidden variables. Given the hidden variables, we can also infer the visible variables very simply. And if we want some--if we put some weights on the connections and we want to know what this model believes, we can just go back and then forward inferring all the hidden variables in parallel than all the visible ones. Do that for a long a time, and then you'll see examples of the kinds of things it likes to believe. And the end of learning is going to be to get it to like to believe the kinds of things that actually happen. So this thing is governed by an energy function that has given the weights on the connections. The energy of a visible plus a hidden vector is the sum overall connections of the weight if both the visible and hidden units are active. So I'm going to pick some of the features that are active, you adding the weight, and if it's a big positive weight, that's low energy, which is good. So, it's a happy network. This has nice derivatives. If you differentiate it with respect to the weights, you get this product of the visible and hidden activity. And so, that the derivative is going to show up a lot in the learning because that derivative is how you change the energy of a combined configuration of visible and hidden units. The probability of a combined configuration, given the energy function, is E to the minus the energy of that combined configuration normalized by the partition function. And if you want to know the probability of a particular visible vector, you have to sum all the hidden vectors that might go with it and that's the probability of visible vector. If you want to change the weights to make this probability higher, you always need to lower the energies of combinations of visible vector on the hidden vector that would like to go with it and raise the energies of all other combinations, so you decrease the computation. The correct maximum likelihood learning rule that is if I want to change the weights so as to increase the log probability, that this network would generate the vector "v" when I let it just sort of fantasize the things it like to believe in is a nice simple form. It's just the difference of two correlations. So even though it depends on all the other weights, it shows up this is difference of correlations. And what you do is you take your data, you activate the hidden units, that's to classify the units, and then we construct, activate, we construct activate. So this is a mark of chain. You run it for a long time, so you forgot where you started. And then you measure the correlation there, start with the correlation here. And what you're really doing is saying, "By changing the weights in proportion to that, I'm lowering the energy of this visible vector with whatever hidden vector it chose. By doing the opposite here, I'm raising the energy, the things I fantasize." And so, what I'm trying to do is believe in the data and not believe in what the model believes in. Eventually, this correlation will be the same as that one. In most case, nothing will happen because it will believe in the data. In terms that you can get a much quicker learning algorithm where you can just go on and [INDISTINCT] again, and you take this difference of correlations. Justifying that is hard but the main justification is it works and it's quick. The reason this module is interesting, the main reason it's interesting is you can stack them up. That is for accompanied reason you're not going to go into it, it works very well to train the module then take that activities of the feature detectors, treat them as so they were data, and train another module on top of that. So the first module is trying to model what's going on in the pixels by using these feature detectors. And the feature detectors would tend to be highly correlated. The second model is trying to model a correlation among feature detectors. And you can guarantee that if you do that right, every time you go up a level, you get a better model of the data. Actually, you can guarantee that the first time you go up a level. For further levels, all you can guarantee is that there's a bound on how good your model of the data is. And every time we add another level, that bound improves if we had it right. Having got this guarantee that something good is happening as we add more levels, we then violate all the conditions of mathematics and just add more levels in sort of [INDISTINCT] way because we know good things are going to happen and then we justify by the fact that good things do happen. This allows us to learn many lesser feature detectors entirely unsupervised just to model instruction of the data. Once we've done that, you can't get that accepted in a machine learning conference because you have to do discrimination to be accepted in a machine learning conference. So once you've done that, you add some decision units to the top and you learn the connections discriminatively between the top-layer features and the decision units, and then if you want you can go back and fine-tune all of the connections using backpropagation. That overcomes the limit of backpropagation which is there's not much information in the label and it can only learn on label data. These things can learn on large amounts of unlabeled data. After they've learned, they you add these units at the top and backpropagate from this small amount of label data, and that's not designing the feature detectors anymore. As you probably know at Google, designing feature detectors is the art of things and you'd like to design feature detectors based on what's in the data, not based on having to produce labeled data. So the edge of backpropagation was design your feature detectors so you're good at getting the right answer. The idea here is design your feature detectors to be good at modeling whatever is going on in the data. Once you've done that, just have a so slightly fine-tune and so you better get right answer. But don't try and use the answer to design feature detectors. And Yoshua Bengio's lab has done lots of work showing that this gives you better minima than just doing backpropagation. And what's more minima in completing different part of the space? So just to summarize this section, I think this is the most important slide in the talk because it says, "What's wrong with million machine learning up to a few years ago." What people in machine learning would try to do is learn the mapping from an image to a label. And now, it would be a fine thing to do if you felt that images and labels are rows in the following way. The stuff and it gives rise to images and then the images give rise to the labels. Given the image, the labels don't depend on the stuff. But you don't really believe that. You only believe that if a label is something like the parity of the pixels in the image. What you really believe is the stuff that gives rise to images and then the labels that goes with images because of the stuff not because of the image. So it's a cow in a field and you say cow. Now, if I just say cow to you, you don't know whether the cow is brown or black, or upright or dead, or far way. If I show an image of the cow, you know all those things. So this is a very high bandwidth path, this is a very low bandwidth path. On the right way to associate labels with images is to first learn to invert this high bandwidth path. And we can currently do that because vision works basically. The first store you look at then, you see things. And it's not like it might be a cow, it might be an elephant, it might be electric theater. Basically, you get it right nearly all the time. And so we can invert that pathway. Having learned to do that, we can then learn what things are called. But you get the concept of a cow not from the name, but from seeing what's going on in the world. And that's what we're doing and then later as I say from the label. Now, I need to do one slight modification to the basic module which is I had binary units as the observables. Now, we want to have linear units with Gaussian noise. So we just change the energy function of it. And the energy now says, "I got a kind of parabolic containment here." Each of these linear visible units has a bias which is like its mean. And it would like to sit here and moving away from that [INDISTINCT] energy. The parabola is the negative log of the Gaussian [INDISTINCT]. And then the input that comes from the hidden units, this is just vi, hj, wij, but Vs have to be scaled by the standard deviation of the Gaussian there. If I ask--if I differentiate that with respect to a visible activity, then what I get is hj, wij divided by the sigma I. And that's like an energy gradient. And what the visible unit does when you reconstruct is it tries to compromise between wanting to sit around here and wanting to satisfy this energy gradient, so it goes to the place where this two gradients [INDISTINCT] opposite and you have--that's the most likely value and then you [INDISTINCT] there. So with that small modification we can now deal with real value data with binary latent variables and we have an efficient learning algorithm that's an approximation of [INDISTINCT]. And so we can apply it to something. So it's a nice speech recognition task that's been well organized by the speech people where there's an old database called TIMIT, it's got a very well-defined task for phone recognition where what you have to do is you're given the short window speech, you have to predict the distribution, the probability for the central frame of the various different phones. Actually, each phone is modeled by 3-state HMMs, sort of beginning middle and end, so you have to predict for each frame is it the beginning middle or end of each with the possible phones, there's a 183 of those things. If you give it a good distribution there to sort of focus on the right thing then all the post-processing will give you back where the phoning bandwidth should be and what your phone arrow radius, and that's all very standard. Some people use tri-phone models. We're using bi-phone models which aren't quite as powerful. So now we can test high goodwill by taking 11 frame of speech. It's 10 milliseconds per frame but each frame is looking at like 25 milliseconds of speech and predicting the phone at the middle frame. We use the standard speech representation which is mel-cepstral coefficients. There's 30 of those, and there are differences and differences and difference, differences; and we feed them in to one of these deep nets so. So here's your input, 11 frames and 39 coefficients. And then--I was away when the student did this and he actually believed what I said. So he thought adding lots and lots of hidden units was a good idea. I've started it too. But he added lots of hidden units all unsupervised, so all this green connections are learned without any use of the labels. He used to bottleneck there, so the number of Reg connections will be relatively small. These are not--these have to be learned using discriminative information. And now you're back propagating the correct answers through this whole net for about a day on a GPU board or a month on a core, and it does very well. That is the best phone error rate we got was 23%. But the important thing is whatever configuration you use, how many hidden layers as long as they are plenty and whatever widths and whether you use this bottleneck or not, it gets between 23% and 24%. So it's very robust to the exact details of how many layers and how wide they are. On the best previous result on TIMIT for things that didn't use speaker adaptation is 24.4% and that was averaging together on lots of models, so this is good. >> So each of these layers that's four million weights? >> HINTON: Yup, four million weights. So we're only training one, two, three, one, two, three, we're training, you know, about 20 million weights. Twenty million weights is about 2% of a cubic millimeter of cortex. I think so, this is a tiny brain. The last, probably, all you need for [INDISTINCT] recognition. >> Why did they start <