Placeholder Image

字幕列表 影片播放

  • >>

  • So it's my pleasure to introduce to you Geoff Hinton, who is a pioneer in machine learning

  • and neural nets, and more recently [INDISTINCT] architectures. And then, I think that's going

  • to be topic of today. So take it over. >> HINTON: Okay. So I gave it to Okeer a couple

  • of years ago. And the first 10 minutes or so will be an overview of what I said there,

  • and then I'll talk about the new stuff. The new stuff consists of a better learning module.

  • It allows you to learn better in all sorts of different things, like, learning how images

  • transform, learning how people walk, and learning object recognition. So the basic learning

  • module consists of some variables that represent things like pixels, and these will be binary

  • variables for that. Some variables that represent--these are latent variables, they're also going to

  • be binary. And there's a bipartite connectivity, so these guys are connected to each other.

  • And that makes it very easy if I give you the states of the visible variables to infer

  • the states of the hidden variables. They're all independent given the visible variables

  • because it's a non-directed graph. And the input procedure just says, the probability

  • of turning on hidden unit "hj" given this visible vector "v" is the logistic function

  • of the total what he gets from his audience, so very simple for the hidden variables. Given

  • the hidden variables, we can also infer the visible variables very simply. And if we want

  • some--if we put some weights on the connections and we want to know what this model believes,

  • we can just go back and then forward inferring all the hidden variables in parallel than

  • all the visible ones. Do that for a long a time, and then you'll see examples of the

  • kinds of things it likes to believe. And the end of learning is going to be to get it to

  • like to believe the kinds of things that actually happen. So this thing is governed by an energy

  • function that has given the weights on the connections. The energy of a visible plus

  • a hidden vector is the sum overall connections of the weight if both the visible and hidden

  • units are active. So I'm going to pick some of the features that are active, you adding

  • the weight, and if it's a big positive weight, that's low energy, which is good. So, it's

  • a happy network. This has nice derivatives. If you differentiate it with respect to the

  • weights, you get this product of the visible and hidden activity. And so, that the derivative

  • is going to show up a lot in the learning because that derivative is how you change

  • the energy of a combined configuration of visible and hidden units. The probability

  • of a combined configuration, given the energy function, is E to the minus the energy of

  • that combined configuration normalized by the partition function. And if you want to

  • know the probability of a particular visible vector, you have to sum all the hidden vectors

  • that might go with it and that's the probability of visible vector. If you want to change the

  • weights to make this probability higher, you always need to lower the energies of combinations

  • of visible vector on the hidden vector that would like to go with it and raise the energies

  • of all other combinations, so you decrease the computation. The correct maximum likelihood

  • learning rule that is if I want to change the weights so as to increase the log probability,

  • that this network would generate the vector "v" when I let it just sort of fantasize the

  • things it like to believe in is a nice simple form. It's just the difference of two correlations.

  • So even though it depends on all the other weights, it shows up this is difference of

  • correlations. And what you do is you take your data, you activate the hidden units,

  • that's to classify the units, and then we construct, activate, we construct activate.

  • So this is a mark of chain. You run it for a long time, so you forgot where you started.

  • And then you measure the correlation there, start with the correlation here. And what

  • you're really doing is saying, "By changing the weights in proportion to that, I'm lowering

  • the energy of this visible vector with whatever hidden vector it chose. By doing the opposite

  • here, I'm raising the energy, the things I fantasize." And so, what I'm trying to do

  • is believe in the data and not believe in what the model believes in. Eventually, this

  • correlation will be the same as that one. In most case, nothing will happen because

  • it will believe in the data. In terms that you can get a much quicker learning algorithm

  • where you can just go on and [INDISTINCT] again, and you take this difference of correlations.

  • Justifying that is hard but the main justification is it works and it's quick. The reason this

  • module is interesting, the main reason it's interesting is you can stack them up. That

  • is for accompanied reason you're not going to go into it, it works very well to train

  • the module then take that activities of the feature detectors, treat them as so they were

  • data, and train another module on top of that. So the first module is trying to model what's

  • going on in the pixels by using these feature detectors. And the feature detectors would

  • tend to be highly correlated. The second model is trying to model a correlation among feature

  • detectors. And you can guarantee that if you do that right, every time you go up a level,

  • you get a better model of the data. Actually, you can guarantee that the first time you

  • go up a level. For further levels, all you can guarantee is that there's a bound on how

  • good your model of the data is. And every time we add another level, that bound improves

  • if we had it right. Having got this guarantee that something good is happening as we add

  • more levels, we then violate all the conditions of mathematics and just add more levels in

  • sort of [INDISTINCT] way because we know good things are going to happen and then we justify

  • by the fact that good things do happen. This allows us to learn many lesser feature detectors

  • entirely unsupervised just to model instruction of the data. Once we've done that, you can't

  • get that accepted in a machine learning conference because you have to do discrimination to be

  • accepted in a machine learning conference. So once you've done that, you add some decision

  • units to the top and you learn the connections discriminatively between the top-layer features

  • and the decision units, and then if you want you can go back and fine-tune all of the connections

  • using backpropagation. That overcomes the limit of backpropagation which is there's

  • not much information in the label and it can only learn on label data. These things can

  • learn on large amounts of unlabeled data. After they've learned, they you add these

  • units at the top and backpropagate from this small amount of label data, and that's not

  • designing the feature detectors anymore. As you probably know at Google, designing feature

  • detectors is the art of things and you'd like to design feature detectors based on what's

  • in the data, not based on having to produce labeled data. So the edge of backpropagation

  • was design your feature detectors so you're good at getting the right answer. The idea

  • here is design your feature detectors to be good at modeling whatever is going on in the

  • data. Once you've done that, just have a so slightly fine-tune and so you better get right

  • answer. But don't try and use the answer to design feature detectors. And Yoshua Bengio's

  • lab has done lots of work showing that this gives you better minima than just doing backpropagation.

  • And what's more minima in completing different part of the space? So just to summarize this

  • section, I think this is the most important slide in the talk because it says, "What's

  • wrong with million machine learning up to a few years ago." What people in machine learning

  • would try to do is learn the mapping from an image to a label. And now, it would be

  • a fine thing to do if you felt that images and labels are rows in the following way.

  • The stuff and it gives rise to images and then the images give rise to the labels. Given

  • the image, the labels don't depend on the stuff. But you don't really believe that.

  • You only believe that if a label is something like the parity of the pixels in the image.

  • What you really believe is the stuff that gives rise to images and then the labels that

  • goes with images because of the stuff not because of the image. So it's a cow in a field

  • and you say cow. Now, if I just say cow to you, you don't know whether the cow is brown

  • or black, or upright or dead, or far way. If I show an image of the cow, you know all

  • those things. So this is a very high bandwidth path, this is a very low bandwidth path. On

  • the right way to associate labels with images is to first learn to invert this high bandwidth

  • path. And we can currently do that because vision works basically. The first store you

  • look at then, you see things. And it's not like it might be a cow, it might be an elephant,

  • it might be electric theater. Basically, you get it right nearly all the time. And so we

  • can invert that pathway. Having learned to do that, we can then learn what things are

  • called. But you get the concept of a cow not from the name, but from seeing what's going

  • on in the world. And that's what we're doing and then later as I say from the label. Now,

  • I need to do one slight modification to the basic module which is I had binary units as

  • the observables. Now, we want to have linear units with Gaussian noise. So we just change

  • the energy function of it. And the energy now says, "I got a kind of parabolic containment

  • here." Each of these linear visible units has a bias which is like its mean. And it

  • would like to sit here and moving away from that [INDISTINCT] energy. The parabola is

  • the negative log of the Gaussian [INDISTINCT]. And then the input that comes from the hidden

  • units, this is just vi, hj, wij, but Vs have to be scaled by the standard deviation of

  • the Gaussian there. If I ask--if I differentiate that with respect to a visible activity, then

  • what I get is hj, wij divided by the sigma I. And that's like an energy gradient. And

  • what the visible unit does when you reconstruct is it tries to compromise between wanting

  • to sit around here and wanting to satisfy this energy gradient, so it goes to the place

  • where this two gradients [INDISTINCT] opposite and you have--that's the most likely value

  • and then you [INDISTINCT] there. So with that small modification we can now deal with real

  • value data with binary latent variables and we have an efficient learning algorithm that's

  • an approximation of [INDISTINCT]. And so we can apply it to something. So it's a nice

  • speech recognition task that's been well organized by the speech people where there's an old

  • database called TIMIT, it's got a very well-defined task for phone recognition where what you

  • have to do is you're given the short window speech, you have to predict the distribution,

  • the probability for the central frame of the various different phones. Actually, each phone

  • is modeled by 3-state HMMs, sort of beginning middle and end, so you have to predict for

  • each frame is it the beginning middle or end of each with the possible phones, there's

  • a 183 of those things. If you give it a good distribution there to sort of focus on the

  • right thing then all the post-processing will give you back where the phoning bandwidth

  • should be and what your phone arrow radius, and that's all very standard. Some people

  • use tri-phone models. We're using bi-phone models which aren't quite as powerful. So

  • now we can test high goodwill by taking 11 frame of speech. It's 10 milliseconds per

  • frame but each frame is looking at like 25 milliseconds of speech and predicting the

  • phone at the middle frame. We use the standard speech representation which is mel-cepstral

  • coefficients. There's 30 of those, and there are differences and differences and difference,

  • differences; and we feed them in to one of these deep nets so. So here's your input,

  • 11 frames and 39 coefficients. And then--I was away when the student did this and he

  • actually believed what I said. So he thought adding lots and lots of hidden units was a

  • good idea. I've started it too. But he added lots of hidden units all unsupervised, so

  • all this green connections are learned without any use of the labels. He used to bottleneck

  • there, so the number of Reg connections will be relatively small. These are not--these

  • have to be learned using discriminative information. And now you're back propagating the correct

  • answers through this whole net for about a day on a GPU board or a month on a core, and

  • it does very well. That is the best phone error rate we got was 23%. But the important

  • thing is whatever configuration you use, how many hidden layers as long as they are plenty

  • and whatever widths and whether you use this bottleneck or not, it gets between 23% and

  • 24%. So it's very robust to the exact details of how many layers and how wide they are.

  • On the best previous result on TIMIT for things that didn't use speaker adaptation is 24.4%

  • and that was averaging together on lots of models, so this is good.

  • >> So each of these layers that's four million weights?

  • >> HINTON: Yup, four million weights. So we're only training one, two, three, one, two, three,

  • we're training, you know, about 20 million weights. Twenty million weights is about 2%

  • of a cubic millimeter of cortex. I think so, this is a tiny brain. The last, probably,

  • all you need for [INDISTINCT] recognition. >> Why did they start <