字幕列表 影片播放 列印英文字幕 ROSSI LUO: Good afternoon. Welcome to Brown Biostatistics Seminar. And I'm Rossi Luo, faculty host for today's event. And for those of you new to our departmental seminar, the format is usually that the presentation followed by a question and answer session. And because of the size of crowd today, we are going to also use this red box thing to capture your questions and for videotaping and also make sure your questions are heard. And today I'm very pleased to introduce Professor Yann LeCun. Professor LeCun is a director of Facebook AI Research, also known as FAIR. And he is also senior professor of computer science, neuroscience, and electronic computer engineering at New York University. He's also the founding director of NYU Center for Data Science. Before joining NYU, he had a research department for industry, including AT&T and NEC. Professor LeCun has made extraordinary research contributions in machine learning, computer vision, mobile robotics, computational neuroscience. Among this, he's a pioneer in developing convolutional neural networks. And he is also a founding father of convolutional nets. And these works contributed to say the creation of new an exploding field in machine learning called deep learning, which is now called artificial intelligence tool for various range of applications from image to natural text processing. And his research on contributions has earned him many honors and awards including the election to the US National Academy of Engineering. Today he will give a seminar titled, How Can Machines Learn as Efficiently as Animals and Humans. I understand some of you actually told me you drove from Boston or many places are very far. So without further ado, let's welcome Professor Yann LeCun for his talk. [APPLAUSE] YANN LECUN: Thank you very much. It's a pleasure to be here. A game I play now occasionally when I give a talk here is I count how many former colleagues from AT&T are in the room. I count at least two. Chris Rose here, Michael Litman. Maybe that's it. That's pretty good, two. Right. So, how can machines learn as efficiently as animals and humans? A have a terrible confession to make. AI systems today suck. [LAUGHTER] Here it is in a slightly less vernacular form. Recently, I gave a talk at a conference in Columbia called the Compositional and Cognitive Neuroscience Conference. It was the first edition. And there was a keynote. And before me, Josh Tenenbaum give a keynote where he said this. All of these AI systems that we see now, none of them are real AI. And what he means by this is that none of them actually learn stuff that are as complicated as what humans can learn. But also learn stuff as efficiently as what animals seem to learn them. So we don't have robots that are nearly as agile as a cat for example. You know, we have machines that can play golf better than any humans. But that's kind of not quite the same. And so that tells us there are major pieces of learning that we haven't figured out. That animals are able to do that, we don't do-- we can't do with our machines. And so, I'm sort of jumping ahead here and telling you the punch line in advance, which is that we need a new paradigm for learning, or a new way of formulating that has old paradigms that will allow machines to learn how the world works the way animals and humans do that. So the current paradigm of learning is basically supervised learning. So all the applications of machine learning, AI, deep learning, all the stuff you see the actual real world applications, most of them use supervised learning. There's a tiny number of them that use reinforcement learning. Most of them use some form of supervised learning. And you know, supervised learning, we all-- I'm sure most of you in the room know what it is. You want to build a machine that classifies cars from airplanes. You show an image of a car. If a machine says car, you do nothing. If it says airplane, you adjust the knobs on the machine so that the output gets closer to what you want. And then you show an example of an airplane. And you do the same. And then you keep showing images of airplanes and cars, millions of them, thousands of them. You adjust the knobs a little bit every time. And eventually, the knobs settle on a configuration, if you're lucky enough, that will distinguish every car from every airplane, including the ones that the machine has never seen before. That's called a generalization ability. And what deepening has brought to the table there, unsupervised learning, is the ability to build those machines more or less numerically with very little sort of human input in how the machine needs to be built, except in very general terms. So the limitation of this is that you had to have lots of data that has been labeled by people. And to get a machine to distinguish cars from airplanes, you need to share with thousands of examples. And it's not the case that babies or animals need thousands of examples of each category to be able to recognize. Now, I should say that even with supervised learning, you could do something called transfer learning, where you train a machine to recognize lots of different objects. And then if you want to add a new object category, you can just retrain with very few samples. And generally it works. And so what that says, what that tells you is that when you train a machine, you kind of figure out a way to represent the world that is independent of the task somehow, even though you train it for a particular task. So what did deep learning bring to the table? Deep learning brought to the table the ability to basically train those machines without having to hand craft too many modules of it. The traditional way of doing pattern recognition is you take an image, and you design a feature extractor that turns the image into a list of numbers that can be digested by a learning algorithm, regardless of what your favorite learning algorithm is, linear classifiers, [INAUDIBLE] machines, kernel machines, trees, whatever you want, or neural nets. But you have to preprocess it in a digestible way. And what deep learning has allowed us to do is basically design a learning machine as a cascade of parametrised modules, each of which computes a nonlinear function parametrised by a set of coefficients, and train the whole machine end to end to do a particular task. And this kind of an old idea. People even in the 60s had the idea that this would be great to come up with learning algorithms that would train multilayer systems of this type. They didn't quite have the right framework if you want, neither the right computers for it. And so in the 80s, something came up called back propagation with neural nets that allowed us to do this. And I'm going to come to this in a minute. So the next question you can ask of course is what do you put in those boxes? And the simplest thing you can imagine as a nonlinear function, it has to be non-linear, because otherwise there's no point in stacking boxes. So the simplest thing you can imagine is take an image, think of it as a vector, essentially. Multiply it by a matrix. The coefficient of this matrix are going to be learned. And you can think of every row of this matrix being used to compute a dot product with an input vector. And that produces basically a weighted sum of the inputs multiplied by those coefficients. That gives you another vector. And you pass each component of these vector through a non-linearity like this one, for example. Just halfway ratification. So you have two different steps. Linear, nonlinear. Linear pointwise, nonlinear. Very simple. And you can show that by stacking two layers of this, you can approximate any function you want, as close as you want as long as you have sufficiently many of these guys in the middle by tweaking the parameters of the two layers. But in fact, most functions we're interested in are more economically represented by many layers. And so that's the new approach to deep learning, if you want, that changes from the neural nets of 30 years ago, which typically had only two or three layers. The neural nets of today, the deep learning systems of today have anywhere between 20, 50, or 100 layers. OK. So we have linear operators that are parametrized by coefficients. And the supervised learning, we're basically going to train it to be some sort of objective function that's going to measure the discrepancy between the output the machine produces and the output we want. And so the subjective function is going to be differentiable. What we're going to do is compute the gradient of the objective function with respect to all the parameters in the machine averaged over a number of training samples. Or if we use stochastic gradient decent, averaged over a small batch of training samples, or even a single sample. And then take one step [INAUDIBLE] to get your gradient using the stochastic gradient update rule. Basically, the parameters are going to kind of go down to a minimum in a stochastic fashion as you train more and more. So now the next step you have to do is compute the gradient of the objective function with respect to the parameters. And the way you do this through back propagation. I'm not going to go through this. The mathematical concept on which it's based is incredibly sophisticated. Is it's called chain rule. [LAUGHTER] And some people learn this in high school. And it basically comes down to the fact that if have-- if your ranged parametrized functions in a graph of competition, which in this case is a very simple one. It's just a linear stack of modules. But it doesn't need to be such a simple graph. It could be any graph. And you [? take ?] connection by propagating signals backwards through this graph. Basically taking the gradient of some cost function you want to minimize with respect to this red variable. And so this gradient is represented by these green variable. And multiplying it by the Jacobian of this box, you get the gradient respect to the input of that box. This is chain rule. So it's this guy here. Gradient with respect to the input equals gradient with respect to the output multiplied by Jacobian. Very easy. And so you propagate this backwards through the graph. And the cool thing about this is that you can do this automatically by having a bunch of modules of this type that have been predefined. And you assemble them in a graph. And then automatically you get a gradient back. You don't have to figure out how to compute it. So that's what all of those deep learning frameworks can allow you to do. They're very simple to use. Our favorite one is called PyTorch. And you know, there's several Jacobians for each of those boxes. One that propagates through the input, others that propagate through the parameters. And that allows you to compute all the gradients of the objective function, or whatever you want to minimize with respect to all the parameters. So, OK, back prop. That's an old idea. The basic idea of it actually goes back to Leibniz and Newton, obviously. But more recently, the people in optimal control actually have used things like this to called the adjoint state methods or adjoint system methods for optimal control that was invented in the 60s. That's what NASA used to compute rocket trajectories and things of that type. And it wasn't used for learning. It was used for optimal control. But it is very similar idea. So we think of those variables as being kind of control variables of a rocket, and this being kind of the trajectory the rocket if you want. And then people realized you could use this for learning in the late 70s, early 80s, but never quite actually made it work. And it started being used in the late 80s essentially. And that's when the first wave of neural nets-- or the second wave a neural nets took off. And around 1986, 1987 where people realized you could train [? multi ?] neural nets with this. And then it died in the 90s, the mid 90s. OK. So the next question you can ask is those linear operators are nice. But you know, if my image is a long vector with millions of pixels, I'm not going to multiply by matrix that's several million by several million. So you have to organize those linear operators in ways that make them practical for things like images or high dimensional inputs. That's where the idea of convolutional nets comes in. It actually doesn't come from sort of theoretical hypotheses. But it was actually inspired by biology. So I know there are neuroscientists in the room. So this is inspired by Hubel and Wiesel, 1962. Very classical working in neuroscience, Nobel Prize winning work. There were models of-- computational models of these most basic ideas by Hubel, Wiesel, by Fukushima and his new neocognitron model that was inspiring for inspiration for convolutional nets. And the basic ideas that individual cortex, and this is something you can derive from first principles, it's probably a good idea images to be able to detect local features by basically having a template that you match with the input. And you get a score for how well this thing matches with this one, basically a dot product, the weighted sum of those pixels by those coefficients. And then you swipe this over the edge everywhere. And the results are recorded in a something we call a feature map here. And that operation is a discrete convolution. But it's very similar to the kind of operation you see, what's called simple cells in the visual cortex do on images, where a particular neuron, an individual cortex is connected to a local neighborhood in the visual field. And sort of detects local features as well. So that's where this first layer is doing. So these are multiple filters. These are the convolutional kernel, [INAUDIBLE] filter applied to this image by use of those maps. And then you do what's called a pooling operation where you take the result, like a local patch of those results of filtering after the non-linearity. And you compute an average or a max or L2 norm, or something like this. And you subsample the results so that the windows over which you compute this aggregation is set by more than one pixel. So here it's set by two pixels. So you get a map that's half the resolution of this one. And then you repeat the process. So you get convolutions again. So this guy is a result of applying convolution kernels to each of those maps, adding up the result, passing it through a non-linearity. And then again, there is pooling and subsampling. So as you go up the layers, you get representations that are more global and kind of more abstract and etc. And this is really the idea of simple cells and complex cells, complex cells being those pooling areas sort of a realization of this. That's the-- drawing from Fukushima's paper on the neocognitron where you had those kind of simple cells and complex cells. So this is a convolutional net. This is meant to be an animation. I'm not sure why it's not an animating. But it's not animating. And not only that, it actually crashed my computer. All right. I'm going to have to do something very brief for just a minute. OK. Now it works. So this is a an old convolutional net trained in the early 90s to recognize handwriting. And what you can see here is that this is the first layer. That's the input. So the first layer, 6 feature maps. Then pooling subsampling, second layer. Pooling subsampling, third layer. And by the time you get here, each unit here, each pixel represents the activation of the a unit. It basically sees the entire input, or at least a square on the input. And so a slice through this represents an entire character essentially in sort of abstract form. And the good thing we realize pretty quickly with it is that we could not just use it to recognize single objects, but also multiple objects. And that's very important. So here we-- you basically have multiple copies of the same convolutional net applied to a sliding window over the input. And it's actually very cheap to do this. You can sort of apply the convolutional net convolutionally. It's convolutions all the way. People sometimes call this [? free ?] convolutional net now. And at the output, you get a score for every window and every category. And here I'm just showing the winning score with kind of a gray scale to indicate the score of the category. And then a very simple post-processing pulls out the correct interpretation. So here, the cool thing is that the system can recognize objects without prior segmentation. You don't have to separate the digits before being able to recognize them. And that's really important if you want to be able to apply those things to natural images where objects appear in the background. And you can't afford to-- and you can't actually figure out how to separate them from the background. So that was kind of an important thing. And then going forward a number of years, about almost 10 years to 2003, someone at DARPA came up to us and said, can you use machine learning, neural nets, let's say, to drive robots? And so we built this little track robot here. It's just a radio controlled track with two cameras, analog cameras. And we had this truck being driven by someone for about 20 minutes, or a total of maybe two hours. And that person would be instructed to drive straight and sort of veer off whenever there was an obstacle. And you know, he would-- after some training, you feed the network with two images from the two cameras. And then you would just train network to emulate the steering angle of the human driver. And you let the robot loose. And he gets through all this kind of horrible busy Jersey backyard here driving itself through this these obstacles. So we showed these to DARPA. And they said, oh, that's great. We're going to start a program called LAGR and have six different teams compete. That would be nice if this slide actually showed. Here we go. See different teams compete. They will all get the same robot. And you'll train this robot to-- using machine learning, to figure out whether it can drive over a particular area or not. And so we used this convolutional net that would look at bands in the image and then label every pixel as to whether it's traversable or not. So something like this. And the cool thing is that you can actually get truth more or less, run truth through stereo vision. So using a stereo vision system, because this robot has multiple cameras, you can figure out if something sticks out of the ground. But that only works up to about 10 meters. Beyond that it doesn't work. So you trained a neural net with the labels collected from stereo. And then you run the neural net on the whole image. And it does this. It figures out where a path is essentially. And it figures out here in the back there is this row of obstacles in the little passage way in between. And so this thin kind worked pretty well. There were again, six different teams competing on this. We were the only ones to use convolutional nets. But again, this was 200-- project started in 2005 and ended 2008. And so the fast vision system that uses a stereo, a slow system that uses stereo, and then a slow vision system as well that uses this neural net. And then you put the result. You combine all the results in a map. And you can do some planning to figure out how to get to a particular goal. The map here is centered on the robot. So it's relatively easy to plan. And then the system actually trains itself as it goes. It adapts, collecting labels from the stereo vision. It learns how to navigate new environment it's never seen before, even the pesky grad students who try to annoy this poor robot. [LAUGHTER] The robot weighs about 100 kilos. It can probably break their legs. But they're pretty sure it's not going to do that, because they actually wrote the code. This is-- and they trained it. This was Raia Hadsell, who at that time was a PhD student with me, who now leads the Robotics Research Group at Deepmind. And Pierre Sermanet, who is at Google Brain, also working on robotics. So a couple of years later, we realized we could use the same kind of technology for not just labeling pixels in an image as to whether it's traversable or not, but also labeled with categories. And some datasets started to appear that allowed to train, you know, maybe with a couple thousand images, that allowed to train the convolutional net to do this. So again, this is a convolutional net applied to the whole image. Each output of the convolutional net is influenced by a window on the input, which is something like 40 by 40 pixels at high resolution and 90 by 90 pixels at half, and 180 by 180 pixels at quarter resolution. So it sees a big context to make a decision for a single pixel. But it kind of makes a decision for every pixel. And the cool thing about this is that we can read this in real time. So this was implemented on what's called an FAG, which is sort of a programmable hardware. And it could run at about 20 frames per second classifying to 33 categories. And it wasn't-- far from perfect. You know, it classified those areas here as sand or desert. And this is the middle of Manhattan. So there's no sand I'm aware of. And it worked pretty well. So we submitted a paper to CVPR in 2011. And it was soundly rejected. And the reviewer comments were either what the hell is a convolutional net? Or how is it possible that you get so good results with a technique we've never heard of? So it's kind of funny. So we afterwards submitted it to ACML where it was accepted. And so the funny thing is back in 2011, you couldn't get a paper accepted at a computer vision conference if you use neural nets. Now you cannot get a paper accepted at CVPR unless you actually use convolutional nets. So there's a complete revolution over the next few years. So that gave some ideas to a few people working with driving cars around that time around 2013-14, where they realized they could use those kind of convolutional net based semantic segmentation techniques to label every pixel in an image as to whether it's traversable or not, or as to whether it's a pedestrian or a road or something like this. So this is some work at Nvidia. This is work at Mobileye. Which now belongs to Intel. And this is a system that-- Mobileye produces systems that were used in the Tesla cars for autonomous driving until mid 2016. Then the two companies are divorced. They weren't agreeing with each other somehow. So now Tesla is developing its own system. Nvidia has big project on this which I may come back to. And then around 2012, the big revolution occurred. And what that was is the use of very large convolutional nets implemented on GPUs to run really efficiently and train on large datasets like the ImageNet dataset that has a million training samples, 1,000 categories. And it turns out those things work really, really well when you have lots of categories and lots of training samples. And when you make them big. And so the first to really make an efficient implementation of those networks on GPUs were Geoff Hinton and his students, Alex Krizhevsky and Ilya Sutskever. And they had presented the result at an Imagenet workshop at ECCV in Fall 2012. And then had a paper at NIPS in Winter 2012. And that basically made the computer vision field completely change, and basically jump started the deep learning revolution. That revolution had started in speech recognition a couple of years earlier. And the interesting thing about this is that we ended up seeing an inflation in the number of layers that are used by those convolutional nets. So this is the VGG network, which was one of the top performing in 2013. GoogLeNet-- no, this was 2013. Then GoogLeNet in 2014, which had even more layers. And then ResNet. [INAUDIBLE] Hee and his collaborators from Microsoft Research Asia had this idea of having skipping connections that basically solved for the problem that sometimes, when you train a very deep neural net, some of the layers die. The weights don't go anywhere. That kills the entire thing. So they use those kipping connections to prevent the catastrophic bad things happening if some layers died. And that turned out to be a very, very good idea that seems incredibly efficient. But in fact, it works really, really well. And so you can train neural nets with 50 layers, 100 layers, 150 layers. And they work really well. There's sort of a more modern version of this. One version called DenseNet, which is a collaboration between people at FAIR and people at Cornell, which is sort of a version of this is designed to run efficiently and etc. And so one question you might ask is, why do we need all those layers? Right, Theoretically, you can approximate any function with only two layers. Why you need many layers? And you know, one possibility is the fact that the world is compositional. Images are basically composite pixels. And pixels form together, arranged together to form things like edges and colored blobs, and stuff like that. And then by detecting combinations of those, you can detect things like circles and corners and gratings. And then a combination of those form parts of objects. And combination of those objects, et cetera. So there is this kind of hierarchical nature of the perceptual world which is sort of captured by those layered architectures. So we used to take weeks to train those networks. And now we can train one of those networks with basically state of the art performance in about an hour. On a very large machine with 250 GPU cards in it. It's actually multiple machines. Each machine has 8 GPUs. And you stack them up. So you can do these kind of things if you are at Facebook or at Google. A little more difficult in university environment. But here are some more recent results on computer vision. So this is a bit of a snapshot of the state of the art. This is a model called Mask R-CNN, which is a system that does not just semantic segmentation, but instant segmentation. So I'm going to bore you with all the details. I'm just going to tell you that beats all the records on some standard data like, COCO. And here's an example of a result you can do. So again, it's essentially conceptually very simple, a convolutional net with some sort of system that sort of detects regions of interest and then applies a slightly more complex convolutional net on those regions of interest. And the output of the network is not just a category, but it's a category, the coordinates of a bounding box, and an image of a mask of the object at the same resolution as the input. And so you get for every object, you get the category, you get the mask of the person or the object, and you get a bounding box. And it detects baseball, the dog, the individual people, even though they all overlap. So this is instance segmentation, not just semantic segmentation. Semantic segmentation it would have just one big blob here labeled people. You can detect wine glasses and wine bottles, very important for French people, computers, you know, et cetera. Backpacks, umbrellas, sheeps, you can count sheeps. You know, overlapping cars, things like that. It works amazingly well. It's also trained to detect key points on human bodies. So you can infer that the body pose of people in photos and videos. There's actually-- there's more of this which I can't show you. But it actually runs at 5 frames per seconds on a smartphone. So it's scaled down version of this. And then there were kind of new applications of this for convolutional net for 3D data. So this is a recent competition called ShapeNet where the dataset consist of 3D objects represented by point cloud from a depth center. And it's been manually segmented into regions or parts. And the goal here is to essentially label every region with the correct label. And what turned out to win this recent competition was a 3D convolutional net produced by Ben Graham and Laurens van der Maaten. So this is the original paper that describes the idea of a sparse 3D convolutional net. And there's some other contributors to the system. It's a library you can download. It's basically the idea of sort of only doing convolutions in areas where you have populated voxels, because in a 3-D environment, most of the voxels are empty. So you don't want to be computing convolutions everywhere where there is nothing. So you just follow the areas where there is something. And it turns out to be much faster and easier to train. And they actually won the competition with his technique. And other application of convolutional nets that's more research is a system that's actually deployed at Facebook that uses convolutional nets for translation, language translation. So you use feed a sentence in English. And it goes through a bunch of convolutions. And it's actually a gated convolutional network. So those are gated linear units, which I'm not going to go into the details of. There is pointwise multiplication going on here. And then it goes into this kind of a weird alignment system that basically produces sort of German words, word by word, and then kind of lines them up in an appropriate way. And so, it's very fast. It's very efficient. It works really well. And this is what I used for some-- for translating from some pairs of languages on Facebook. Facebook can translate 2000 pairs of languages. A number of them are translated using old style phrase based statistical methods. A number of them are translated using recurrent neural nets. And then a small number of them are translated using this system, which is now being trained on more and more language pairs. So a lot of the research that we do at a FAIR-- in fact all of it is open. We publish everything we do, generally very quickly on arXiv. And we also publish most of our coding open source so forth. So these are a few examples of some of stuff we've deployed. We've distributed open source. I would single PyTorch. This is a deep learning framework with a Python front end. It is very simple to use. It's very good for research. It's more transparent than TensorFlow. OK. And there's of course a lot of applications of those things to medical imaging, of course, and things like that, which I'm not personally working on. But a lot of my colleagues are. But what's missing about this is two things. One is, how do we learn reasoning and memory and things like this? And the second one is, how do we learn general things that animals and humans can learn without being told the name of everything, without being given labeled data. So this is a work by a bunch of people from Facebook AI research in Menlo Park in California. Justin Johnson was an intern at Facebook from Stanford. And Fei-Fei Li, his advisor. And the idea here is can we use deep learning to do things like visual reasoning? So could we answer questions like this one. Is there a [? mat ?] cube that has the same size as the red metal object. So you to have to read this a few times and sort of figure out really what operation you have to do here. And so the idea they come up with is very cool. You take the question. Are there more cubes than yellow things? You feed this through a recurrent neural net that represents this as essentially a single vector of fixed size. And then you run this through another recurrent net that spits out a kind of a representation of a computation graph. Think of it as a visual program, which basically gets instantiated in this graph that has one block. Those are actually trainable blocks. OK. They're all the same architecture. So one block that is supposed to figure out-- filter all the objects that are yellow. And another one that filters out the cubes. One block that counts how many yellow things there are. This one counts how many cubes there are. And then it compares the two. And then figures out the answer. Right. And so you don't predefine what those blocks should do. You initialize it a little bit by heavy supervision, by specifying what the program here should be, and which blocks should be assembled, even though the blocks are not trained initially. And then you backpropagate the gradients to get the right answer through this whole thing, including the convolutional net. And eventually this thing figures out what those blocks should do. Of course, we'll need to reach all those keywords. And learn how to do reasoning. But the interesting thing about it is that it's completely dynamical. You change the question, it's going to change the graph. So the graph that you propagate gradient through changes every time. And that's why the dynamic graphs are so important in deep learning nowadays. People are so excited about it for things like natural language understanding. So dynamic graphs is the situation where the computational graph that you use to compute your answer changes when the data changes. There's actually more recent work along those lines by Aaron Courville at University of Montreal, where they don't actually have to specify a program like this. You just stack multiple blocks. And it just works. It's pretty cool. OK. So for those statisticians in the room, since I've been invited by bio-statisticians, deep learning breaks all the basic rules of statistics. I mean, not all of them, but some of them, right. So the models are enormous, often with many, many more parameters and there are training samples. I mean, so take one of those convolutional nets for ImageNet. There is 1 million training samples. Some of those models have 100 million parameters. And they still work quite well. They can often nail the training set perfectly. And often there is no explicit regularization. But it still works. How is that possible? The loss function is very highly non-convex. It's got a ridiculously large combinatorial number of settle points. But still, you pretty much get the same result every time you train. What it tells you is that maybe there are local minima, but they're all pretty much equivalent. And in fact, there are experiments that seem to suggest they're all connected. There is only one local minimum basically. I mean, not one. But essentially one. Little attention is paid to managing uncertainty beyond using very simple things like softmax on the output when you do classification. But there's a lot of effort spent on computational issues. Like efficiently implementing all those things, and all that stuff. So it's sort of very much unusual. It breaks the rules you see in textbooks, in statistical textbooks. And that might be a reason why some people who are more theoretically oriented had initially a lot of skepticism towards neural nets. OK. But let me switch to kind of the point I really want to make about with this talk, which is, where do we go from there? OK. So deep learning works very well. There's a lot of applications we can use it for. Even if we don't do any research anymore, just with the technique that we've developed so far, there's probably a lot of different industries that are going to be affected by it that we can apply this to. In fact, there's something that Andrew Ng said recently. Stop doing research. Just apply the stuff that we already know. I don't think it's a good idea. But I don't think he believes it completely either. But what is interesting of him to say this. So what are the obstacles really to making significant progress? Because as I said before, all the stuff you see, that's not real AI. And our machines do not learn with the same kind of efficiency that we observe animals and humans learning with. So how do we get machines to learn how the world works, learn common sense or something like this? So that would ask the question going back to the inspiration from biology, does the brain use a learning algorithm? Or does it use 50 learning algorithms? Or maybe 200? Or maybe it's complete [INAUDIBLE],, the result of evolution. There's no underlying principle behind it. It's just a result of millions of years of evolution. How much prior structure does animal or human learning require for a intelligence to emerge in a reasonable amount of time? All the learning algorithms that people in machine learning have come up with in statistics minimize some sort of objective function, or optimize some sort of objective function, I should say. Does the brain optimize an objective function? What would that function be? If it optimizes a function, does it do it by evaluating a gradient? If it evaluates a gradient, how does it do it? It probably doesn't do backprop in the way that we understand it today. And how does it handle uncertainty in prediction, which I think is a crucial issue? So all kinds of questions like this that connect AI machine learning with neuroscience really. And one big missing ingredient in AI, or maybe a holy grail, is common sense. There's a subarea of AI called commonsense reasoning. It's not actually a solution to a problem. It's more of a problem. And it's a question of how do we get machines to quite common sense. So common sense is everyday-- the commonsense of everyday thing. That supported-- unsupported objects fall. That some objects are stable. And some are not. If I let this guy go, it's going to fall, even if I put it briefly vertically. If I take this object, I hide it behind my computer, you still know it's here. It hasn't disappeared. So object permanence. So those things we learn. How do we learn the structure of the world? And one hypothesis perhaps is that our brains are prediction machines. They learn to predict all the missing information from whatever is available today at this time. And then time passes by. Or you move your head, or whatever. And new information becomes available. And that allows you to train your world model with the new information. So if I want to learn that the world is three dimensional, I'm going to learn it because it's the best explanation for how the world changes when I move my head. My view of the world changes when I move my head side to side. And the best explanation for how it changes is the notion of depth. So necessarily, if my brain is trained to predict what the world is going to look like when I move my head, it's going to have to somehow represent the notion of depth. Same way if I want to predict-- if I let this go and I stop the movie right there, then I ask the machine, ask my brain what's going to happen next? It's going to predict this guy is going to fall-- he's going to fall down, of course, because of gravity. So it just needs to wait for time to pass by to train itself to see if its prediction was correct. So that would be predictive learning. But predicting-- learning to predict is not just predicting the future from the present and the past. It might be also predicting what the blind spot of a retina contains without even looking. So if you fixate on a particular place, there is a particular spot in your visual field where you're essentially blind because that's where your optical nerve [? puncture ?] through your retina. You don't see anything at there. But you don't realize it, because your brain fills it up essentially. So things like filling the visual field of the regional blind spot, filling occluded images, missing segments in speech, predicting the state of the world from partial textual description, predicting the consequences of your action, predicting sequences of action leading to a result. I mean, all of those are fill in the blanks, if you want. And common sense, I would surmise, is the ability to fill in the blanks through the construction of world models. Object permanence is something babies learn around the age of two or three months. And which is why peekaboo is so funny for little babies, because you can disappear when you hide your face. So here's a baby orangutan here. It's being shown a magic trick. The guy put an object in the cup. And then he shakes the cup. It takes the object out without showing the orangutan. And then shows the inside cup. And the cup is empty. And the orangutan rolls on the floor laughing. OK. That obviously broke his world model, that objects-- there's object permanence. Objects don't disappear like that. And you know, one of three things can happen when your world model is broken, you laugh. It's really funny. It's really interesting, you pay attention, because your role model is wrong. So you need to learn a new world model basically, because of this new data that you predicted wrongly. Or something really dangerous might happen that you didn't predict. And so you're scared. So that's what happens when your world model is working. So I think-- how do we do this a machine? How do we get them to learn all those things about the world? Lean gravity? So if you show a baby, this are special slides I borrowed from Emmanuel Dupoux, who is a cognitive scientist, developmental cognitive scientist in Paris at Ecole Normale Superieur. And if you do an experiment like this, you take this little car here. And you put it on this support. And you push it. And it goes off, and it doesn't fall. Of course, it's held in the back. But the baby doesn't see that. Before six months, the baby says, yeah, sure. That's way the world works. Fine. No problem. After eight months, they go like this. You know, they open their eyes. And they fixate. And they say, what's going on? And they don't say, what's going on, obviously because they can't talk. But you know, they look like they're saying, what's going on. And so with this kind of technique, by basically measuring how long you know babies fixate and observe and open their eyes like crazy, you can figure out at what stage babies learn things. And again, this is from Emmanuel Dupoux. So things like object permanence you learn pretty quickly. Biological motion, the fact that there are objects that move by themselves, others that are inanimate. You know, you learn that by three months. Objects that are rigid or not. Different types of natural categories, chairs, tables cars etc. Stability and support. And sort of basic intuitive physics, gravity, inertia, conservation of momentum. That arrives around 8 months, roughly between six and eight months. And there's a bunch of other things like that happen at various stages. And this is not learned in supervised mode. It's not like, babies are told the name of objects. It's not like they are directed in any way for any of this. They basically learn this by observation. They're really not well-developed in sort of motor control either. So they don't get to do a huge amount of interaction with the world. So there's no way this can be learned through interaction, by some sort of direct reinforcement learning. There's other mechanism going on there where you learn how the world works by observation. And that's the piece we're missing in our current machine learning and AI systems. So in fact, I need to apologize in advance to Michael. But he knows what I'm going to show, so-- There's three sort of paradigms of learning, right. There is a reinforcement learning, where basically the machine at each trial is given a scalar value to tell it whether it did well enough or not. So there was grade for games. Machine does an action. And it either gets a reward or not. Or sometimes it has to make a whole sequence of action before it gets a reward. And it works great when it's combined with deep learning. The problem is that it requires a huge amount of training samples, an enormous amount of training samples. It's because the amount of information you give to the machine is extremely small at every trial. It's very weak. It's a small amount of information. Therefore, you need to do this many, many times for it to learn anything complicated. Supervised learning, you need a little less samples, because you give more information every time. You give it the correct answer. And so if there are a dozen categories, that's more than just a single scalar value. So you need fewer samples to learn similarly complex tasks. And then the predictive learning or unsupervised learning, you ask the machine to predict basically every variable from every feature variable from every present variable or past variable, or every unseen variable from every seen variable. And so there is a lot more information you ask the machine to predict. And that's why probably you can learn a lot more about the structure of the world this way. So that led me to this completely obnoxious slide, which I have to show in every slide-- in every talk now. The analogy between intelligence and chocolate cake, where the [INAUDIBLE] of the cake is basically unsupervised or predictive learning, because that's where the bulk of the information goes. The bulk of the information given to the machine is really in that mode of learning. And then the icing on the cake is supervised learning. There is considerably less information provided to the machine per trial in supervised mode. And in reinforcement mode there is very little information given to the machine. So that's going to be equivalent to the cherry on the cake. And I've been showing this-- the first time I showed this slide was actually giving a talk at Deepmind, where Deepmind is actually the temple of reinforcement learning. So it was sort of obnoxious on purpose, a little bit. But now I kind of fell into that obsession of showing it in every talk. So the problem with reinforcement learning, with pure reinforcement learning, and Michael will correct me if I'm wrong, is that if you use it in its purest form, you need so many trials to learn any kind of complex behavior that if you were to train a self-driving car to drive, and to learn to not run off a cliff, it would have to run off a cliff about 50,000 times before it figures out it's a bad idea. And then another 50 dozen times before it figures out how not to run off a cliff. And you know it's half of a joke, which is why-- I mean, that's the reason why it works really well for games, because you can run games very quickly on many computers at the same time and at many thousands of frames per second. But it doesn't really work in the real world, because you cannot run the real world faster than real time. That's a thing that sucks about the world. And then anything you do real world can kill you, like running off cliffs. Maybe it's a good thing that we can't run the real world faster than real time. So perhaps what we need is build models of the world that we can run faster than real time, and that we can run without the risk of killing ourselves. And that would be predictive models. If we ever were to predict before we run off a cliff that we're going to run off a cliff, we would not run off a cliff. And perhaps, that's the way we learn to drive. We know not to get off the road, because we know bad things will happen if that's the case. Reinforcement learning works really well for games. And there was a smashing demonstration of how well this works for Atari games and Go and doom, and not yet StarCraft, that's very much work in progress at FAIR and Deepmind and various other places. It's very complicated. But you know, it works really well. And the latest AlphaGo Zero is pretty amazing in that way. But again, it's a particularly simple situation where the number of actions is discrete, the world is completely observable, and the reward is fairly clear. And you can run the environment, which is a go board, at tens of thousands of frames per second essentially. It works pretty well, even for games like Doom. So this is a Doom competition that was won by the team from Facebook. And actually teams with Facebook people won two years in a row, in '16 and '17 using basically deep reinforcement learning techniques. So we work on reinforcement learning at Facebook. It's not-- The cake I showed-- I showed the cake, but you have to notice that this is a black forest chocolate cake. And the cherry is not optional on this cake. In fact, it's got little bits of cherries all around here inside. [LAUGHTER] OK as I said, we also work on StarCraft. So StarCraft is an extremely challenging situation, because there is multiple time scales. There are continuous actions. It's not fully observable. You can't tell what your opponent is doing unless you send scouts to look at it. So it's very complicated in that sense. We've done a little bit of reinforcement training for sort of local micro-management of tactics. It's actually an open source platform called ELM or miniRTS from Facebook that is basically a StarCraft like real time strategy game. But here is a suggestion. So I said we need our machines to be able to learn predictive models of the world. And this idea is very old. It goes back to a very old time. But in particular, to one of Rich Sutton's papers where he was proposing what he called the Dyna architecture. And he said the main idea of Dyna is the old common sense idea that planning is trying things in your head using an internal model of the world. And this suggests existence of a more primitive process for training things not in your head, but through direct interaction with the world. So he said here, reinforcement learning is the name we use for this more primitive and direct kind of training. And Dyna is the extension of reinforcement learning to include a [INAUDIBLE] world model. In fact, this [? domain picture ?] doesn't exist today. All of this is called reinforcement learning. It's just that the version that has a model is called model based reinforcement learning. And the other one is called model free reinforcement learning. But it's basically the same, the same thing. And this idea that you should have a world model which in optimal control is called a plant simulator, but it's the same thing, or a plant model. But this idea that [INAUDIBLE] predictive world model to be able to reason about what to do, what action to take, is really [? all ?] idea in the context of optimal control. So a typical situation in optimal control, and you can look at classical textbooks going back to the 60s, is you have a model of the world that gives you the state of the world at time t plus 1 as a function of [? standard ?] time t. And the action [? you can ?] [? take. ?] And then the state of the world is sent to an objective function that measures how well the state of the world is, or how good it is. And so you can run this model of the world. And through backprop through time and gradient descent figure out a sequence of commands that will optimize this objective function over time. And if you're well-simulator is differentiable, you can do this through backprop and gradient decent. If it's not, you have to do things [INAUDIBLE] programming or something like this. So the main problem we're going to have is, how do we learn this world model? How do we learn a model that will allow our mission to predict what the state of the world at time t plus 1 is going to be as a function of the state at time t and our action, and perhaps actions of others in the environment. That's the problem of predictive or unsupervised learning. And that led me to state that-- oops. I'm not sure how that happened. Apologies. Wow, it went forward by like 10 slides. So that is new to this statement that the next revolution in AI will not be supervised. I stole the concept of this slide from Alyosha Efros at Berkeley. And so we have to think about what would be the architecture of a real intelligent system, a sort of autonomous intelligence system. So it would be something like this, an agent that produces actions on the world. And the world responds with percepts. And of course, the world might be-- the world might not care about your action at all. Or it might care only vaguely. What the agent is trying to do, the agent has an internal state which is sent to an objective function. And the objective function produces a value that basically tells the agent whether it's happy or not. So the objective function is a measure of unhappiness of that agent. You get a small value if you're happy, a large value if you are unhappy. So what the agent is trying to do is bring the world into a state that will bring itself into a mental state that basically this red function identifies as happy. And there are models of how animal brains are built, are basically this way, where this is your entire brain, except the basal ganglia. And that's the basal ganglia. So basal ganglia is the thing at the bottom of your brain that basically determines your level of happiness or comfort or discomfort or pain or things like that. So inside of this agent, if we believe what I-- or the argument that I previously, the system should have some sort of world simulator that allows you to predict what the state the world is going to be as a consequences of a sequence of actions. And then two other modules. These are sort of standard nomenclature in RL. An actor that produces action proposals that can be kind of simulated in the world. And then a critic whose role is to predict the long term expected value of this objective. So this guy basically computes emotions. So if this guy predicts that your objective function is going to rise up, make you very unhappy or in pain, that creates fear, essentially. You don't want to get anywhere near that state. And this guy predicts what happens. So this guy predicts this. This guy doesn't quite predict that. But this guy actually predicts that as well. And so now the problem becomes, how do we train this world simulator? Because the rest, we kind of know how to do it more or less. We don't know how to build this. But if we knew, we could do something like this. Get the state of the world through your perception module, initialization your world simulator, propose a sequence of actions, and then refine the sequence of actions so as to minimize the expected cost computed by the critic. And then train the actor to produce this optimal sequence of actions. And then take the first action. And then kind of shift everything by one time stamp. So how do we learn forward models of the world? This is an experiment that was done at Facebook a couple of years ago by Adam Lere, Sam Gross, and Rob Fergus where they put a stack of cubes, this is in a simulator. This isn't the real world. And then they observe what actually occurs. And then they train a convolutional net to actually predict what's going to happen by kind of learning the mask of the objects. And what you get is a pretty accurate prediction for this tower is going to fall this way. But fairly fuzzy predictions for like, tall towers, where it's king of ambiguous where things are going to fall. So you get those kind of fuzzy predictions here. Because you can't exactly predicting where things are going to fall. So how do we solve that problem? I'm going to skip this. So this is why predictive models are good for question answering systems and natural language processing. But I'm going to skip this in the interest of time. So, here's the problem we have to deal with. Those towers can fall in a number of different directions that we can't really predict just from the look of it which direction they're going to fall into. So it's kind of-- I don't know if we can find a pen here or any kind of vertical thing. I'm going to do it with a piece of paper. So if I put this piece of paper here on the table, and I let it go, you can be pretty sure it's going to fall. But you can't really tell probably which direction it's going to fall. Every time I do it, it's probably going to fall into a different direction. So you can't really use supervised learning to train something like this. Because if I give the initial segment, and then I ask machine predict, the machine predicts that. If that happens, that's fine. If this happens, then the mission has to predict now this. But now the next time over, it's going to predict that. And so the best thing the machine can predict is kind of an average of the outcomes, which is not a good answer. And so, something like this, where let's say you observe two variables which have a dependency between them. And this is pretty elementary for anybody who works on probabilistic models. But let's say these are the data points you observe. Your world consists of two variables. And these are your observations. If I give you a particular value of Y2, you can infer basically two values for Y1. But if you try to learn this with L2 least square criterion, you're going to predict something right in the middle, which is not a good answer. So you have to predict, somehow be able to predict one or the other, but not an average of the two. Or predict a distribution. But how do you represent distributions in high dimensional spaces? So the unsupervised learning problem is how do you capture the dependency between things like this? And one possible way is to learn a contrast function. So basically, think of it as an energy function, or negative lo log probability if you are a probabilist. And this are your data points. And you want those to have low energy, which means high probability. And you want everything else to have higher energy, or lower probability. So the blue points are the data that you observe. The green points are not data. And you want the energy of the green points to be higher than the energy of the blue points. So if you have a parametrised function that computes this function in the space of Ys, it's easy enough to tweak its parameters so that when you see a blue point, you make the output go down. But how you make sure at the value of your function is higher outside of those needs? How you generate those green points? And that's basically-- there's basically seven or eight different methods for doing this. But I'm only going to talk about a couple. And the first one is adversarial training. So adversarial-- the basic idea of adversarial training is basically the scenario I was talking about. You have a predictor here. And this predictor looks at the past, let's say, if you want to do video production. So it looks at the past. And it has access to a source of random vectors and is going to produce a prediction. The precise prediction is going to depend on the value of this vector. And as the value of this vector changes, this prediction goes through a set of plausible outputs, let's say, represented by this red ribbon here. So let's say we asked the machine. We show the machine a small segment of video. And we ask it, what is the world going to look like half a second from now? And the machine predicts this. It predicts that pen is going to fall to the back and the left. And in fact, we let time pass by. And what happens is this. The pen falls to the back and slightly to the right. So we don't want to punish the machine for making the wrong decision here, because it's qualitatively correct. So what we'd like is we'd like an objective function that tells us low cost if you are on this red ribbon, high cost if you are outside. And that's exactly what I was talking about earlier. You want a function like this one that tells you low cost if it's something that looks reasonable. High cost if it's not. So the thing is, we don't know how you characterize this functions. So we're going to have to learn it. So adversarial training is you have two functions you learn, one that predicts and one that tells the system whether the predictions are good or not. And basically it works like this. So you have an initial segment of a video. For example, if you do video prediction, the data tells you here is how the video ends. And you train this contrast function, called the discriminator, or sometimes critic actually , to produce a low output for things that actually occur in the world. So those are the two blue points. So we'll make the function take a low value for things actually occur. And then you this past to the generator. You have it generate a prediction, which initially sucks. And so you feed it to the discriminator. And it tells the discriminator produce a large output here to make the output here. So these are all of the green points. Make that large. And so next time around, the value here the discriminator will produce for those predictions is going to be higher. But here is what you do simultaneously. Simultaneously, you backpropagate gradients through the discriminator to train the generator to produce Ys that make the discriminator produce low outputs. OK. So basically, the generator gets information about how to change its parameters so as to change its output so that the green points get closer to the blueprints, essentially, to a region that the discriminator give low energy to. So eventually it looks like this, where the green points match the blue points more or less in distribution if you're lucky, because those things are kind of finicky. And it works. So you can train those things with past frames. Or you can just train it on images to just generate images from random vectors. So this thing has access to all sorts of vectors. If you trend this thing on images of bedrooms, you get-- those are non-existing generated bedrooms. And they all look kind of reasonable, except maybe for this guy. It looks an Austin Powers kind of bedroom, or whatever. But you know, they all have a bed and windows and dressers and lights, and stuff like that. And those are basically a bunch of random numbers coming into a convolutional net that has been trained to produce bedroom images. And they don't look like anything in a training set. They're different from any training set image. So there are various versions of those GANs. There's a whole menagerie of different types of GANs nowadays. There are [? psycho ?] GANs and infoGANs and WGANs and IWGANs, and an infinite number of GANs. There is another family of generative models, this type called variational [INAUDIBLE] encoders. This is when trained on ImageNet. So this is something called Energy-Based GAN trained on ImageNet. And it doesn't actually produce objects. But you put things that from far away kind of looks like objects, [INAUDIBLE] abstract. This is trained on dogs. It's kind of funny. I mean, people do much better than this now. But it's still funny. OK. So here is an example for video production. So here it's a convolutional net that looks at 4 frames and predicts two frames, two future frames. And it looks at the images at multiple scales. And there's all kinds-- and it's pretty complicated architecture. And this is the prediction you get if you train with least square. So you train this video predictor with least square. You get blurry predictions. If you train it with this adversarial training criteria combined with some others, you get this kind of prediction, considerably sharper. So the first four frames are observed. The last two frames are indicated in red here are predicted. And so you get-- the motions basically continue. And they seem fairly reasonable. There's a little bit of blurriness. But it's is not too bad. This is when trained on video segments from apartments in New York. So the camera rotates. And the system has to basically invent what the room looks like as the camera rotates. So here is a bookcase. And this part of the bookcase-- so this is observed. Now it's predicted. This part of the bookcase is invented. So it figures out that a bookcase has to continue. It figures out that a couch has to continue. So it captures some regularity of what an apartment in New York is supposed to look like. Something that maybe is more interesting for people interested self-driving cars. This is a dataset called cityscape. And-- oops. And this is a system where you take a video sequence, and you run a semantic segmentation system on the video sequence. So what you get is a bunch of maps which give you the pixels that are labeled for every category for every pixel. So much like this, blue is car. Sidewalk is pink. And pedestrian is red. And things like that. And what this thing predicts is that-- so it predicts in this case here half a second in the future. It predicts that pedestrians keep crossing the street. The car that is turning left keeps turning left. The scenery keeps moving. So it's useful if you want to work and self-driving cars to have the ability to predict what's going to happen ahead before it happens. It might allow you to use this to train for example, a reinforcement learning system without actually crashing, but just by predicting even a crash. Here's a new model, a more recent one just admitted actually called error encoding network. So this one-- in fact, the one that actually works is slightly different from this one. But this is a simpler version to explain. So this one basically trains a model. So it looks at the past. It runs through a few layers of a neural net. It produces an internal state. And ignore the top for the time being. Then runs through a generator essentially, another part of a neural net that produces a prediction, say a video, another frame in the video. And you train this using least square, or something like this with what is actually observed. And then you play a trick. What you do is you take the difference between those two. So this is a vector, the vector of the difference between those two, the target and the prediction. You feed this to a parametrised trainable function. And then you feed the output of that function to the hidden layer. You add it to the hidden layer. And you train this guy so that this variable is going to take a value that minimizes the prediction error. But this viable only depends on the prediction error. And so basically, this part of the network, when this value is set to zero, predicts whatever is predictable. And this guy basically parametrise whatever is not predictable, which is a residual error, and figures out how to represent the hidden latent variable that will actually correct that mistake. So that might represent the-- for example, you observe someone playing a game and moving something on the screen. The physics of how things move on the screen is essentially predictable. That's Newtonian physics. But the action that the player uses maybe isn't. And so that would essentially represent the action that the player played. That would be very useful for things like imitation learning, for example. Here's an example of how this can be used. And I'm probably going to end here. So you have to wait a little bit. So this is a dataset that was produced by Sergey Levine, [INAUDIBLE] and a few other people at Berkeley. So there is an object. There is a robot arm. And the robot randomly pokes the object. So the result is that after being poked, the object has moved a little bit. And these are predictions for how the object could have been moved by the thing. This is pure pixel prediction, pixel space prediction. So the system has no notion of object or anything. These are prediction it makes. And each different prediction is generated by different sampling of the Z variable, the latent variable, or the action variable. You can think of this as basically an encoding of what the robot arm did without actually having to observe what it did. So it's action inference if you want. OK. I've spoken for long enough, so I'm going to stop here and take your questions. Thank you very much. [APPLAUSE] AUDIENCE: Hey. [INAUDIBLE] Real quick question. So can you break-- so, let's just think about images. Are you trying to-- or we use essentially biology and things we know about the world to segment the image. What if you took a camera and did a combinatorial scramble, which is a huge potential scramble. Does it break everything? YANN LECUN: It scrambles the pixels? AUDIENCE: It scrambles the pixels. YANN LECUN: Yeah. AUDIENCE: You know, it's combinatorially huge. YANN LECUN: Yeah, that's right. So if you do a fixed scramble and you use a convolutional net, the convolutional net will have a hard time figuring out the thing, because it's based on the idea that neighboring pixels are correlated. And a local patch of pixels can be represented efficiently by just those features. So it probably would have a very hard time. Now it turns out there's a paper by Pascal [INAUDIBLE] on [INAUDIBLE] from way back where they show that if you just-- if you take a collection of images that you've perturbed through the fixed permutation of the pixels, you can actually recover the topology by figuring out the local correlations between pixels. So in principle, it would be possible to make this work if you [? hardwired ?] this. AUDIENCE: Thank you for giving a talk today. I'm a big fan to you, actually. [INAUDIBLE] talk to me. And recently the D-Wave Systems and the quantum computer is actually deployed in practice right now. And how would you envision the quantum computing affect the deep neural networks in general? YANN LECUN: Yeah, it's-- if you didn't hear the question, it's about whether quantum computing will affect deep learning in some way. It's not entirely clear to me. So D-Wave is not actually deployed in practice. It's experimented with by people. And there are a few attempts. But it's not actually used in practice for commercial deployment, if that's the question. So the D-Wave System is not a full quantum computer in the sense that it uses quantum tunneling for more efficient function optimization. It's not entirely clear that you need this at all for any of the tasks that I talked about. So I think it's still up in the air whether or quantum computing will have any effect. It's possible you could do nearest neighbor much faster with quantum computing. It's not even clear to me that you can, but it's possible. So, it's unclear. AUDIENCE: So I actually have two questions. The first question is that [INAUDIBLE] if the data point is very small, like in the area of a [INAUDIBLE],, but only [INAUDIBLE] maybe X-Ray imaging or even less. [INAUDIBLE] So I read something about the [? zero ?] shot, one shot, and [? two ?] shot [INAUDIBLE].. So what do you think of [INAUDIBLE].. And the second question is are any of the AI [INAUDIBLE] developed by Facebook or developed [INAUDIBLE],, [INAUDIBLE]. YANN LECUN: All right. Yeah, OK. Let me answer first question first. So the small the regime. There's basically currently two ways to handle it. One is transfer learning. So for example, you want to do image recognition. And you want to do, I don't know, medical imaging or something like this. And you don't have enough data. So one approach is you train your neural net on a big data set that you actually have, either with the same type of images, or even complete different types of images, as long as the statistics are similar, like ImageNet for example. You know, it's not the same type of image. But it's OK. [INAUDIBLE] And then you can transfer learning. So you take that pre-trained machine. And then you retrain this machine for your data that helps you just retrain the top two or three layers to a limit the number of parameters. That works really well. So there is actually a service within Facebook that uses this for the product division within Facebook. So to give you an idea, there's 2.1 billion users on Facebook. And the users upload on the order of 1.5 billion photos every day. So there's 1.5 billion a day. Every single one of those photos go through four convolutional nets that we know about. It goes way more. But these four pre-trained convolutional nets. So one that basically recognizes tags of various types on the image. So recognizes objects. It recognizes the type of images. Is this a birthday or a wedding or landscape or indoor scene or a [? macrophoto ?] or whatever. There's a second one that-- and this is used for feed ranking basically, to decide whether to show particular images to particular people who have particular interests. The second one filters objectionable content. So basically, violence, pornography, things like that. The third one generates captions for images, for the visually impaired. So that if you're blind and you're on Facebook, you can get an idea of what's in the picture by getting this text description. And then the last one, which is turned on in US, but not in other countries, not in many other countries, not turned on in Europe does face detection. So it tags your friends automatically. So that was for the first question. Now there's a second answer to the first question. And the second answer to the first question is you can use unsupervised training or pre-training. So basically, you don't trust trained system to classify your medical images into cancer or non-cancer. But you also train it to reconstruct itself. And that has a regularization effect. So there are situations, certain types of architectures, things called ladder networks or what stack [INAUDIBLE] or UNet, where this type of learning actually helps supervised learning and reduces the need for labeled data. OK. So that was-- ultimately, I think that supervised learning is going to solve all of these problems. Now your second question was about those bots that there was a big story in the press a few months ago that said that researchers at Facebook had created two bots that were supposed to talk to each other in English. And they're supposed to cooperate to solve a task. It's going to reinforcement learning type task. And they ended up using English language in ways that were not really initially predicted. They would use a funny way to use words to express-- to communicate with each other. And so some of the newspapers right after, it almost said AI's going to kill us all. Some tabloid published an article saying, oh my god, Facebook researchers had this project where two bots invented their own language. And they had to like unplug the computer in panic mode, because they were going to take over the world or something. And it's completely insane, because there was a blog post about it and a paper that was published. And it's basically, these people are interested in natural language understanding. And they trained those systems to use English. And they ended up not using English in a way you would normally use it. So they said, the experiment failed. Let's try something else. It's not like the Hollywood sci-fi movie where you see these guys grabbing the electronic cars, and there's sparks flying and all that stuff right. Nothing like that. But it's really funny how-- funny in a way, kind of depressing a little bit, of how some of the press describes those things. There were a lot of articles in more serious press afterward that said that's complete bunk, which is good. AUDIENCE: Thank you. AUDIENCE: Hi. I have a comment here. I have a comment and a question. First comment is that earlier you said there are many systems that hasn't been in the parameters that much more than the number of pixels or whatever you're talking-- YANN LECUN: Samples. AUDIENCE: Samples. [INAUDIBLE] I think from a statistics point of view, it's the central limit theorem [? doing it's ?] [? job. ?] That's my comment. YANN LECUN: Which theory? AUDIENCE: Central limit. YANN LECUN: Oh, central limit theorem. AUDIENCE: [INAUDIBLE] I think. But, OK. My second question is actually related to this. Are there-- all your examples kind of works. Are there any theoretical scientists, computer scientists working on foundation of these kinds of things. What makes it converge, and what's not? YANN LECUN: Yeah. I mean, there's a lot of different types of people working on those questions, some of them are computer scientists, but many of whom are either physicists or mathematicians. So I've been-- I've been involved in an effort for many years to try to get the applied math and pure math community interested in those questions. And I've only been successful in the last year or two. Same for the physicists. So basically, there are results in random matrix theory that can be applied to the understanding of the landscape of objective functions of those networks. And it would seem to demonstrate, to show that the number of [? settle ?] points in those loss functions is combinatorally large. But on the other hand, that there are-- although there might be a lot of local minima, they're all pretty much of the same energy level. So it doesn't matter which one you find. And then there is empirical evidence to the fact that the local minima are extremely degenerate. So if you move in a large number of dimensions around those local minima, the objective function is essentially flat. And there's a small number of directions where it's not flat. That depends on the complexity of the problem. And there's also empirical evidence that [INAUDIBLE] showed in a paper, which is that if you take two solutions. So you start from two random initial conditions. You train your neural net. You get two different solutions. Then you go straight line between the two. And you barely go up. And if you bend the past just a little bit, then you can go from one minimum the other without going up. So that tends to show that there's basically only one minimum. It's very degenerate. And it's connected everywhere. The intuition that we have, the usual intuition of a local minimum in one dimension is completely wrong. Building a box in a hundred million dimension is very hard because you need a lot of walls. So there's always going to be directions where you can escape. And that creates settle points. So that's one thing. And then there is work on generalization ability. Like why do those things generalize the way they do, even though they are way overparameterized. There's an interesting paper. One of the co-authors is Ben Recht from Berkeley recently where they showed that you can take a ImageNet style network, convolutional net. You set the labels to completely random labels. And those neural nets can still learn the training set completely without errors. One million training samples, they will just nail it, 100% correct. Of course, [? transition ?] error is chance. But what that means is that there is a huge amount of capacity in those networks that they are able to recruit, if they need to. But when you train them on things that make sense, they don't have overfit that much. They do overfit, but not ridiculously. AUDIENCE: Hi. So it seems like it's very clear that it's important to have a strong predictive model of the world to achieve intelligence. But it also seems like there may be other components to it, things such as creativity or metacognition. So do you have any thoughts on how we might achieve those other parts of intelligence? YANN LECUN: So metacognition probably is number 562 in the list of problems we have to solve that maybe has 1,000 items so. I'm not sure about that. But creativity, I think those GANs actually exhibits some level of creativity. So there are people, for example, at Rutgers, one of them is actually now at Facebook, who used GANs to generate paintings, abstract paintings in particular styles. And they look really nice. So that begs the question of is there, what does creativity really mean? We have a couple projects at Facebook that I can't talk about yet, but soon, that involve also creating kind of artistic artifacts using those generative models. And they look interesting. People who actually are in the business of creating artifacts are actually the impressed. AUDIENCE: Hi. I do some particles physics here. I'm an undergrad. And one of the big problems that we're facing in implementing technologies like this is that the data we have is collected almost from a third person perspective where you have access to all the variable information in three dimensions. And so it's very hard to take a first person camera view perspective of an event and try to pick apart what's going on. What are the major computational challenges-- what's the difference between taking like a camera view of these scenes and dissecting them with a convolutional neural net versus somehow finding an effective way of analyzing three dimensional information? YANN LECUN: OK. So a number of different answers there. So first of all, there is quite a lot of interest for the use of convolutional nets in the context of high energy physics, basically for trajectory filtering essentially, so filtering events that are interesting. I'm sure that's the kind of stuff you were thinking of. I actually gave a talk at CERN maybe a couple years ago, or a year and a half ago, and met a bunch of people working on this. And it's really expanding. There's a colleague of mine at NYU called Kyle Cranmer who has been working on this kind of stuff actually using those GANs. He's come up with good ideas on characterizing trajectories of generating models of trajectories. So that said, very often, those trajectories are in 3D. And you'd like to be able to basically analyze them in 3D. So you could use those 3D convolutional net that I was talking about early in the middle of the talk. They are sort of efficient for this, because most of the voxels in a high energy physics experiments are empty. So you would like to be able to concentrate the computation where things are relevant. That's one thing. The second thing is that there is a new set of ideas I didn't talk about called graph convolutional nets, or spectral networks. So it's basically the idea that an image, a normal image, you can think of an image as a function on a grid graph, on a regular grid. The pixels form a grid. You can think of it as a graph where each pixel is connected to its nearest neighbors. And that indicates that-- it's just a reflection of the fact that neighboring pixels are correlated. Now imagine now that you have data that comes to you not in kind of a flat grid graph, but in a weird graph, like a cylinder or something, like the calorimeter or in a high energy physics experiment, or with some other set of sensors that is non-Euclidean. You can actually define convolutions in those spaces. And they're basically diagonal operators in the graph Laplacian where the graph represents the neighborhood relationships. And so people have actually come up with ways to apply convolutional nets to those non-Euclidean domains. In fact, there is going to be a tutorial at NIPS next week on precisely that topic in exactly one week, Monday next week, which I'm a core speaker on. But I'm actually going to speak. There's going to be [INAUDIBLE]. AUDIENCE: You talked about-- sorry. You talked about systems that both learn and reason. And it seems to me like you argued that to get a strong AI, you would need to do both of these things. Now it seems to me like obviously humans do this. But humans in a lot of ways are very dumb. They make a lot of mistakes. And they're very plastic. And they need to learn to reason. Whereas a lot of AI systems and reinforcement learning systems do something very smart that takes a lot of computational power. And it's very much hard coded. Do you think we'll see a trend towards dumber and more plastic reasoning systems? YANN LECUN: So I think most reinforcement-- Michael, correct me if I'm wrong. But I think most reinforcement learning systems that people are training today actually are completely reactive. They are very simple in terms-- I mean, there's very little actual reasoning. Other than things like AlphaGo, AlphaGo Zero, where there is tree exploration in the set of possible futures, which is used for training. Once it's trained, it actually just plays without much tree exploration, actually. So there's not a huge amount of reasoning there. And that's a limitation not of reinforcement learning per se, but of the architectures we use for all of our AI systems. So I think what we consider I think intelligent behavior involves this ability to predict. In fact, I think the essence of intelligence really is the ability to predict. And so if you have a good model of the world that is accurate for prediction, then you can use it to plan a sequence of actions ahead and perhaps moderate uncertainties about it. And things like this. So this is what reasoning really is about, is predicting ahead what's going to happen, not necessarily in time. But also sort of simulating, so manipulating models. Like when you think in your head about mathematics or various other things, very often, you have mental models that you manipulate. They are simulators in a way. You give them inputs, and they change. And things like that. That I think is really the essence of reasoning and intelligence. ROSSI LUO: Looking at the clock, it's 5:30. I'm going to take one last question. And if you have additional questions, you probably just [? briefly ?] to [? floor ?] discussions afterwards. AUDIENCE: What's-- I'm not that familiar with deep learning neural nets. But I'm curious. If I wanted to learn an object up to something like affine transformations, can I do transfer learning to do that? Can you learn a whole group of transformations, and then learn an object and then have the object under those transformations? YANN LECUN: So yes and no. So if you take a convolutional net, for example, and you train it on datasets like ImageNet that have lots of different instances of the same objects and various and things like this, it learns the notion of object relatively independently of the viewpoint, but not completely. So it has to recognize a dog, whether it's a profile view or a frontal view. But if you take the head of the dog upside down, it probably won't be able to recognize it. The same way we have a hard time recognizing people when their faces are upside down. AUDIENCE: Not exclu-- little rotations, shears, things like that. YANN LECUN: Right, right. So small rotation, shears, and scaling, that that's handled by the pooling operation in convolutional nets. AUDIENCE: Right. But there's nothing, no explicit geometric-- YANN LECUN: No. There's no explicit 3D geometry. And there is no real explicit 3D geometry, except for the fact that whenever a feature is detected in one location, it's also detected in other locations. And the fact that there is this pooling operation that basically build a little bit of resist-- smoothness to variations of the location of particular features. So small variations of the position of elementary features due to rotation, shear, and things like this, will actually-- AUDIENCE: You're pooling them. And that's why you're getting them. But you're not explicitly modeling. Same thing with Newtonian physics. There's no built in physics yet, right? YANN LECUN: Right. There's-- no. No built in physics. AUDIENCE: Thank you. ROSSI LUO: The main event I think is over. And if you have additional questions, you're welcome to briefly discuss with Professor Yamaka afterwards. And thanks [INAUDIBLE]. And let's give Professor Yann Lecun applause. [APPLAUSE]
B1 中級 美國腔 Yann LeCun博士,"機器如何能像動物和人類一樣高效地學習?" (Dr. Yann LeCun, "How Could Machines Learn as Efficiently as Animals and Humans?") 40 2 alex 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字