大規模機器學習 (Large Scale Machine Learning)

字幕列表影片播放

[ MUSIC ]
[ APPLAUSE ]
BENGIO: Thank you.
All right.
Thank you for being here and participating in this colloquium.
So, I'll tell you about some of the things that are happening in deep learning,
but I only have 30 minutes so I'll be kind of quickly going through some subjects
and some challenges for scaling up deep learning towards AI.
Hopefully you'll have chances to ask me some questions during the panel that follows.
One thing I want to mention is I'm writing a book.
It's called Deep Learning, and you can already download most of the chapters.
These are draft versions of the chapters from my web page.
It's going to be an MIT Press book hopefully next year.
So, what is deep learning and why is everybody excited about it?
First of all, deep learning is just an approach to machine learning.
And what's particular about it, as Terry was saying, it's inspired by brains.
Inspired, we're trying to understand some of the principles, computational
and mathematical principles that could explain the kind of intelligence based
on learning that we see in brains.
But from a computer science perspective,
the idea is that these algorithms learn representations.
So, representations is a central concept in deep learning, and, of course,
the idea of learning representations is not new.
It was part of the deal of the original neural nets,
like the Boltzmann machine and the back prop from the '80s.
But what's new here and what happened about ten years ago is a breakthrough that allowed us
to train deeper neural networks, meaning that have multiple levels of representation.
And why is that interesting?
So already I mentioned that there are some theoretical results showing
that you can represent some complicated functions that are the result of the many levels
of compositions efficiently with these deep networks, whereas you might --
or in general, you won't be able to represent these kinds of functions
with a shallow network that doesn't have enough levels.
What does it mean to have more depth?
It means that you're able to represent more abstracts concepts,
and these more abstract concepts allow these machines to generalize better.
So, that's the essence of what's going on here.
All right.
So, the breakthrough happened in 2006 where, for the first time,
we were able to train these deeper networks and we used unsupervised learning for that,
but it took a few years before these advances made their way
to industry and to large scale applications.
So, it started around 2010 with speech recognition.
By 2012, if you had an Android phone, like this one, well,
you had neural nets doing speech recognition in them.
And now, of course, it's everywhere.
For speech, it's changed the field of speech recognition.
Everything uses it, essentially.
Then about two years later, 2012, there was another breakthrough using convolution networks,
which are a particular kind of deep networks that had been around for a long time
but that have been improved using some
of the techniques we discovered along these -- in recent years.
Really allowed us to make big impact in the field of computer vision
and object recognition, in particular.
So, I'm sure [Faye Faye] will say a few words later about that event and then the role
of the image net dataset in this.
But what's going on now is that neural nets are going beyond their traditional realm
of perception and people are exploring how to use them for understanding language.
Of course, we haven't yet solved that problem.
This is where a lot of the action is now and, of course,
continues a lot of research and R&D and computer vision.
Now, for example, expanding to video and many other areas.
But I'm particularly interested in the extension of this field in natural language.
There are other areas.
You've heard about reinforcement learning.
There is a lot of action there, robotics, control.
So, many areas of AI are now more and more seeing the potential gain coming
from using these more abstract systems.
So, today, I'm going to go through three of the main challenges that I see
for bringing deep learning, as we know it today, closer to AI.
One of them is computational.
Of course, for a company like IBM and other companies
that build machines, this is an important challenge.
It's an important challenge because what we've observed is
that the bigger the models we are able to train,
given the amount of data we currently have, the better they are.
So, you know, we just keep building bigger models
and hopefully we're going to continue improving.
Now, that being said, I think it's not going to be enough so there are other challenges.
One of them I mentioned has to do with understanding language.
But understanding language actually requires something more.
It requires a form of reasoning.
So, people are starting to use these recurrent nets you heard about, recurrent networks
that can be very deep, in some sense, when you consider time in order
to combine different pieces of evidence, in order to provide answers to questions.
And essentially, displayed in different forms of reasoning.
So, I'll say a few words about that challenge.
And finally, maybe one of the most important challenges that's maybe more fundamental even is
the unsupervised learning challenge.
Up to now, all of the industrial applications of deep learning have exploited supervised learning
where we have labeled the data we've said in that image, it's a cat.
In that image, there's a desk, and so on.
But there's a lot more data we could take advantage of that's unlabeled,
and that's going to be important because all of na information we need to build these AIs has
to come from somewhere, and we need enough data, and most of it is not going to be labeled.
Right. So, as I mentioned, and I guess as my colleague,
Ilya Sutskever from Google keeps saying, bigger is better.
At least up to now, we haven't seen the limitations.
I do believe that there are obstacles, and bigger is not going to be enough.
But clearly, there's an easy path forward with the current algorithms just
by making our neural nets a hundred times faster and bigger.
So, why is that?
Basically, what I see in many experiments with neural nets right now is that they --
I'm going to use some jargon here.
They under fit, meaning that they're not big enough or we don't train them long enough
for them to exploit all of the information that there is in the data.
And so they're not even able to learn the data by heart, right,
which is the thing we usually want to avoid in machine learning.
But that comes almost for free with these networks, and so we just have to press
on the pedal of more capacity and we're almost sure to get an improvement here.
All right.
To just illustrate graphically that we have some room to approach the size of human brains,
this picture was made up by my former student, Ian Goodfellow, where we see the sizes
of different organisms and neural nets over the years so the DBN here was from 2006.
Of the AlexNet is the breakthrough network of 2012 for computer vision,
and the AdamNet is maybe a couple of years old.
So, we see that the current technology is maybe between a bee and a frog in terms of size
of the networks for about the same number of synapses.
So, we've almost reached the kind of average number of synapses you see in natural brains,
between a thousand and ten thousand.
In terms of number of neurons, we're several orders of ranking away.
So, I'm going to tell you a little bit about a stream of research we've been pushing in my lab,
which is more connected to the computing challenge and potentially part
of our implementation, which is can we train neural nets that have very low precision.
So, we had a first paper at ICLR.
By the way, ICLR is the deep learning conference, and it happens every year now.
Yann Lecun and I started it in 2013 and it's been an amazing success
that year and every year since then.
We're going to have a third version next May.
And so we wanted to know how many bits do you actually require.
Of course, people have been asking these kinds of questions for decades.
But using sort of the current state of the art neural nets and we found 12,
and I can show you some pictures how we got these numbers on different data sets
and comparing different ways of representing numbers with fixed point or dynamic fixed point.
And also, depending on where I use those bits, you actually need less bits
in the activations than in the weights.
So, you need more rescission in the weights.
So, that was the first investigation.
But then we thought -- so that's the --
for the weights, that's the number of bits you actually need to keep the information
that you are accumulating from many examples.
But when you actually run your system during training, especially,
maybe you don't need all those bits.
Maybe you can get the same effect by introducing noise
and discretizing randomly those weights to plus one or minus one.
So, that's exactly what we did.
The idea is -- the cute idea here is that we can replace a real number by a binary number
that has the same expected value by, you know, sampling those two values with a probability
such as that the expected value is the correct one.
And now, instead of having a real number to multiply,
we have a bit to multiply, which is easy.
It's just an addition.
And why would we do that?
Because we want to get rid of multiplications.
Multiplications is what takes up most of the surface area on chips for doing neural nets.
So, we had a first try at this, and this is going to be presented at the next NIPS
in the next few weeks in Montreal.
And it allows us to get rid of the multiplications in the feed forward computation
and in the backward computation where we compute gradients.
But we remained with the multiplication -- even if you discretize the weights,
there is another multiplication at the end of the back prop
where you multiply -- you don't multiply weights.
You multiply activations and gradients.
So, if those two things are real valued, you still need regular multiplication.
So, we -- yes, so that's going to be in the NIPS paper.
But the new thing we did is to get rid of that last multiplication that we need for the update
of the weight, so the delta W is a change in the weights,
DC DA is the gradient that's propagated back, and H is the activations.
It's some jargon.
But anyway, we have to do this multiplication, and so, well, the only thing we need
to do is take one of these two numbers and replace it again by a stochastic quantity
that is not going to require multiplication.
So, instead of binarizing it, we quantize it stochastically to its mantissa.
In other words, we get rid of -- to its exponent.
We get rid of the mantissa.
In other words, we represent it, we -- we represent it in a log scale.
So, if you do that, again, you can map the activations
to some values that are just powers of two.
And now multiplication is just addition.
This is an old trick.
I mean, the trick of using powers of two is an old trick.
The new trick is to do this stochastically so that you actually get the right things
in average and stochastic gradient works perfectly fine.
And so we're running some experiments on a few data sets showing that you get a bit
of a slowdown because of the extra noise.
But so the green and yellow curve here are where this strict with binarized weights
and quantized, stochastically quantize the calculations.
And the good news is, well, it learns even better, actually,
because this noise acts as a regularizer.
Now, this -- yes, this is pretty good news.
Now, why is this interesting?
It's interesting because we can probably -- for two reasons.
One is for hardware implementations, this could be useful.
The other reasons is that it connects with what the brain -- with spikes, right.
So the idea with -- you can think of, if I go back here, when you replace activations
by some stoke tick binary values that have the right expected value, you're introducing noise.
But you're actually not changing that much the computation of the gradient.
And so it would be reasonable for brains to use the same trick
if they could save on the hardware side.
Okay. So now let me move on to my second challenge, which has to do with language and,
in particular, language understanding.
There's a lot of work to do in this direction,
but the progress in the last few years is pretty impressive.
Actually, I was part of the beginning of that process of extending the realm
of application of neural networks to language.
So, in 2000, we had a NIPS paper where we introduced the idea of learning
to represent probability distributions over sequences of words.
In other words, being able to generate sequences of words that look like English
by decomposing the problem in two parts.
That's a kind of a central element that you find in neural nets and especially in deep learning,
which is think of the problem not as going directly from inputs to outputs,
but breaking the problem into two parts.
One is the representation part.
So, learning to represent words here by mapping each word to a fixed size, real valued vector.
And then taking those representations and mapping them to the answers you care about.
And here, that's predicting the next word.
It turned out that those representations of words
that we learned have incredibly nice properties and they capture a lot
of the semantic aspects of words.
And there's been tons and tons of papers
to analyze these things, to use them in applications.
So, these are called word vectors, word embeddings, and they're used all over the place
and becoming like commonplace in natural language processing.
In the last couple of years, there's been a kind of an exciting observation
about these word embeddings, which is that they capture analogies,
even though they were not programmed for that.
So, what do I mean?
What I mean is that if you take the vector which is for each word and you do operations on them,
like subtract and add them, you can get interesting things coming up.
So, for example, if you take the vector for queen and you subtract the vector for king,
you get a new vector, and that vector is pretty much aligned with the vector that you get
from subtracting the representation for woman from the representation for man.
So, that means that you could do something like woman minus man, plus king and get queen, right.
So, it can answer the question, you know,
what is to king what woman is to man, and it would find queen.
So, that's interesting, and there is some nice explanations that we're starting
to understand why this is happening.
Basically, directions in that space of representations correspond to attributes
that have been discovered by the machine.
So, here, the difference between man and woman, they have all the same attributes somehow,
in some semantic space, except for gender.
The same is true for queen and king.
They have lots of different attributes, but they essentially have all the same except for gender.
So, when you subtract them, the only thing you get in your hand is the direction for gender.
Okay. So the progress with representing the meaning of words has been really amazing.
But, of course, this is by no means sufficient to understand language.
So, the next stage has been, well, can we represent the meaning of sentences or phrases.
And in my group, we worked on machine translation as a case study to see
if we could bring up that power of representation that we've seen
in those language models to a task
that was a bit more challenging from a semantic point of view.
And I guess the thing we're doing now, and many other groups are also doing,
is pushing that to an even harder semantic task, which is question answering.
In other words, read a sentence or read a paragraph or a document and then read a question
and then generate a natural language in answer.
So, it's a bit more challenging, but you can see that it's a kind of translation as well.
You have a sequence in input and you produce a sequence in output.
In fact, we used very similar techniques.
So, now let me tell you about that machine translation approach
that we created about a year and a half ago.
And it uses these recurrent networks that you've heard about,
because as soon as you start dealing with sequences, it's kind of the natural thing to do.
It uses something fairly new that has been incredibly successful in the field
in the last year, which is the idea
of introducing attention mechanisms within the computation.
So, sometimes we think of attention as, like, visual attention, so deciding where to look.
But here we're talking about a different kind of attention.
It's a kind of internal attention.
So, choosing which parts of your neural network are you going to be paying attention to.
And here, let me go through this architecture a little bit.
What's going on is -- do I have a pointer?
All right.
You have an input sentence in English, say, and there's a recurrent net that reads it,
meaning that it sees one word at a time.
As it goes through it, it builds a representation of the words that it has seen.
Actually, there are two recurrent nets, one reading from left to right
and the other from right to left.
Then at each position, you have a representation of what's going on around that word.
So, that's the reading network.
Then there is a writing -- an output network, which is going to produce a sequence of words.
More precisely, it's going to produce a probability distribution for each word
in the vocabulary at each stage and then we're going to pick, according to the distribution,
we're going to pick the next word.
The choice of that word is going to condition the computation for the next stage.
The state of the network is going to be different,
depending on what words you've said before.
And that whole output sequence is going to be influenced by what we have read, of course,
because we want to translate the input sequence.
Now, the way that that input sequence and that output sequence are related is important.
That's where the attention mechanism comes in.
Because when you're doing translation, for example,
the input sequence has a different length from the output sequence.
So, which word or which part of the sequence here corresponds to which part
in the output sequence, that's the question
that the attention mechanism is helping us figure out.
And we found a way to do that doing a mechanism that allows
to us train using normal techniques with back prop.
We can compute exact gradients to this process.
And the idea that is for each position in the output sequence,
our network looks in the input sequence at all possible positions and computes a weight.
And it's going to multiply the representation it's getting at each position by that weight
to form a linear combination which is going to be a context that's going
to drive the update at the next stage.
So, in a sense, you're choosing where to look at each stage
to decide what the next word is going to be.
So, this has actually worked incredibly well.
And in the space of one year, we went
from dismal performance to state of the art performance.
And at the last WMT, 2015, we got the first place on two of the language pairs,
English to German and English to Czech.
And now there's like a bunch of groups around the world
that are pushing these kinds of systems.
So, this is kind of a new way of doing machine translation, which is very,
very different in nature from the state of the art that's been around for 20 years.
So, the next thing we did is use the same, almost the same code for translating not
from English to French but from -- or from French to English, but from image to English.
So, the idea is, it's almost the same architecture, except that instead
of having a recurrent network that reads the French sentence,
we have what's called a convolutional net that we've heard about that looks at the image
and computes for each location or for each block
of pixels a feature vector, a gain or representation.
Similarly that we had representations for words,
now we have representations for parts of the image.
And then the attention mechanism, as it generates the words in the sentence
that it's producing, at each stage chooses where to look in the image.
So, Terry showed you some pictures from my lab.
You've seen this.
And what we see with each pair of images is on the left,
the image that the system sees an input.
On the right, we see where it's putting its attention for a particular word.
That's the word that's underlined.
So, when it says little girl, when it says girl,
we see that it's putting attention around the face of the girl.
The other one, on top, for example, a woman is throwing a frisbee in the park.
So, the underlined word is frisbee, and we show the second image in the pair
where it's putting its attention in the image.
So, these are cases where it works quite well.
But it wouldn't be fair if I only showed you those cases.
I need to show you those where it fails.
So, here are examples where it fails.
That's where we learn the most.
First of all, you realize immediately that we haven't solved the eye,
and that it's making mistakes both on the visual side and on the language side.
So, on the visual side, you see things like on the top left, it thinks that it's a bird.
It's two giraffes.
Maybe if you squint you can think it's a bird.
On the second one, it thinks that the round shape on the shirt is a clock, which, you know,
again, if you squint, you might think it's a clock.
Now, the third one is totally crazy.
A man wearing a hat and a hat on a skateboard.
So, it's wrong visually.
It's wrong, you know, linguistically.
You wouldn't do a hat on a hat, and so on.
So, it's fun and instructive to use these attention mechanisms
to understand what's going on inside the machine.
To see, you know, at each step of the computation, what was it paying attention to.
So, it's pretty interesting.
Now, it turns out that this attention mechanism is at the part of another revolution that going
on right now in deep learning that has to do with the notion of memory
that Terry also mentioned during the panel.
And neural nets up to recently have been considered as purely sort
of pattern recognition devices that go from input to output.
As soon as you start thinking about dealing with reasoning and sequential processing,
comes the idea that it would be nice to have a short-term memory or even a long-term memory
that is different from the straight sort of kind of representation building computation
that we have in those feed forward neural nets.
So, the idea is that in addition to the recurrent net
that does the usual computation, we have a memory.
So, here, each of the cells, think of it as a memory cell.
A memory needs simple concepts like where are you going to be reading and writing
and what are you going to be reading and writing.
So, we can generalize these concepts to neural nets that you can join by back prop by saying
that at each time stamp, you basically have a different probability of choosing where to read
and where to write and then you're going to put something there
with some weight that's proportional for that probability.
So, these kinds of systems, they started less than a year ago at about the same time
from a group in Facebook and a group at DeepMind using the same kind of attention mechanism
that we had proposed just a few months earlier.
And so they're able to do things like this, like read sentences like this and answer questions.
So, Joe went to the garden and Fred picked up the milk.
Joe moved to the bathroom and Fred dropped the milk and then Dan moved to the living room.
Where is Dan?
You're not supposed to read the answer.
Or other things like -- I have other examples down there, like Sam walks into the kitchen.
Sam picks up an apple.
Sam walks to the bedroom.
Sam drops the apple.
Where is the apple.
So, these are the kinds of things we're able to do now.
Of course, these are toy problems.
But it's not something we would imagine just a few years ago
that neural nets would be able to do.
So, by using recurrence and by using new architectures that allow these recurrent nets
to keep information for a longer time,
so dealing with this challenge that's called long-term dependencies,
we're able to push the scope of applications
of deep learning well beyond what was thought possible just a few years ago.
So, in my lab, we're working on using these ideas for knowledge extraction.
So, the idea is to be able to read pages in Wikipedia and fill that memory
with representations, semantic representations for nuggets
of fact can be then used to answer questions.
Of course, if we can do that, that would be extremely useful.
Yes. I'm going to skip that and just use a little bit of time for the last challenge,
which is maybe the most difficult one and has to do
with how computers could form these abstractions without being told ahead of time a lot
of the details of what they should be in the first place.
So, that's what unsupervised learning is about.
And I mentioned that unsupervised learning is important because we can take advantage of all
of the knowledge implicitly stored in lots and lots
of data hasn't been tagged and labeled by humans.
But there are also reasons why it could be interesting
for other applications in machine learning.
For example, in the case of structured outputs where you want the machine to produce something
that is not a yes or a no, or it's not a category,
but it's something more complicated, like an image.
Maybe you want to transform an image or you want to produce a sentence like you've seen before.
It's also interesting because if you start thinking
about how machines could eventually reach the kind of level of performance of humans,
we have to admit that in terms of learning ability, we're very, very far from humans.
Humans are able to learn from very few examples, new tasks.
Right now, if you take a machine learning system out of the box, it's going to take --
it's going to need, depending on the task, maybe tens of thousands or hundreds of thousands
or millions of examples before you get a decent performance.
Humans can learn a new task with just a handful
for sometimes even a single example or even zero examples.
You don't even give them an example.
You give them the linguistic description of the task, right.
So, we're thinking, you know, what are plausible ways that we could address this,
and it all has to do with the notion of representation that's been central
to what I've been telling you about.
And now, we're thinking about how those representations become meaningful
as explanations for the data.
In other words, what are the explanatory factors that explain the variations we see in the data.
And that's what unsupervised learning is after.
It's trying to discover representations where each element of the representation you can think
of as a factor or a cause that could explain the things we're seeing.
So, in 2011, we participated in a couple of scientific challenges on transfer learning,
where the idea is you're seeing examples from some tasks.
Maybe they're labeled.
But the end goal is to actually use the representation that you've learned
to do a good job on new tasks for which you have very few labeled examples.
And basically, what we found is that when you use these unsupervised learning methods,
you're able to generalize much faster with very few labeled examples.
So, all these curves have on the X axis the log of the number of labeled examples.
And on the Y axis, accuracy.
As you build deeper systems that learn actually in an unsupervised way from all the other tasks,
but just looking at the input distribution, you're able on the new tasks
to extract information from the very few examples you have much faster.
Faster meaning you need less examples to get high accuracy.
That's what these curves tell us.
Now, there are really big challenges to why is it that unsupervised learning hasn't been
as successful as supervised learning.
At least as we look at the current industrial applications of deep learning.
I think it's because there are really hard fundamental challenges because you're trying
to model something that's much higher dimensional.
When you're doing supervised learning, usually the output is a small object.
It's in one category or something like that.
In unsupervised learning, you're trying to characterize a number of configurations
of these variables that's exponentially large.
And for a number of mathematical reasons, that makes the sort of more natural approaches based
on probabilities automatically intractable for reasons
that I won't have time to explain in detail.
But there has been a lot of research recently to try
to bypass these limitations, these intractabilities.
And what's amazing about the research currently in unsupervised learning is there's
like ten different ways of doing unsupervised learning.
There's not one way.
It's not like the supervised learning where we have basically back prop with small variations.
Here we have totally different learning principles that go and try to bypass
in different ways the problems with [maximum light hue] and probabilistic modeling.
So, it's moving pretty fast.
Just a few years ago, we were not able to generate, for example, images of anything
but digits, images of digits, black and white.
So, just last year we were able to move to sort of more realistic digits.
These are images of street view house numbers that were generated
by some of these recent algorithms.
And these are more natural images that were generated
by paper presented just a few months ago where the scientists who did this at Facebook
and NYU asked humans whether the images were natural or not.
So, is this coming from the machine or is this coming from real world?
And it turned out that 40 percent of the images generated
by the computer were fooling the humans.
So, you're kind of almost passing the train test here.
Now, these are, you know, particular class of images.
But still, that's, you know, there's a lot of progress and so it's very encouraging.
One thing I'm interested in, as a last bit here,
as we're exploring all these different approaches to unsupervised learning,
some of these look like they might also explain how brains do it and the thing
that is a very interesting source of inspiration for this research.
All right.
So, why is it interesting to do unsupervised learning?
As I mentioned, because it goes at the heart of what deep learning is about,
which is to allow the computer to discover good representations, more abstract representations.
So, what it does mean to be more abstract?
It means that we essentially go to the heart of the explanations
of what's going on behind the data.
Of course, that's the dream, right?
And we can measure that.
We can do experiments where we can see that the computer automatically discovers
through in its [healing] units features that we haven't programmed explicitly in
but that are perfectly capturing some of the factors that are present as we know them.
So, yes, I'm going to close there and show you pictures of the current state
of my lab, which is growing too fast.
Thank you.
[ APPLAUSE ]