字幕列表 影片播放 列印英文字幕 Alright. Hello everybody. Hopefully you can hear me well. Yes? Yes. Great! So, welcome to Course 6.S094. Deep Learning for Self-Driving Cars. We will introduce to you the methods of deep learning, of deep neural networks using the guiding case study of building self-driving cars. My name is Lex Fridman. You get to listen to me for a majority of these lectures and I am part of an amazing team with some brilliant TAs. Would you say brilliant? (CHUCKLES) Dan Brown. You guys want to stand up? They're in the front row. Spencer, William Angell. Spencer Dodd and all the way in the back. The smartest and the tallest person I know, Benedict Jenik. Well you see there on the left of the slide is a visualization of one of the two projects that one of the two simulations, games that we'll get to go through. We use it as a way to teach you about deep reinforcement learning but also as a way to excite you. By challenging you to compete against others if you wish to in a special prize yet to be announced. Super secret prize. So you can reach me and the TA's at deepcars@MIT.edu if you have any questions about the tutorials, about the lecture, about anything at all. The website cars.mit.edu has the lecture content. Code tutorials, again like today, the lectures slides for today are already up in PDF form. The slides themselves, if you want to see them just e-mail me but there are over a gigabyte in size because they're very heavy in videos so I'm just posting the PDS. And there will be lecture videos available a few days after the lectures were given. So speaking of which there is a camera in the back. This is being videotaped and recorded but for the most part the camera is just on the speaker. So you shouldn't have to worry. If that kind of thing worries you then you could sit on the periphery of the classroom or maybe I suggest sunglasses and a moustache, fake mustache, would be a good idea. There is a competition for the game that you see on the left. I'll describe exactly what's involved in order to get credit for the course you have to design a neural network that drives the car just above the speed limit sixty five miles an hour. But if you want to win, we need to go a little faster than that. So who's this class is for? You may be new to programming, new to machine learning, new to robotics, or you're an expert in those fields but want to go back to the basics. So what you will learn is an overview of deep reinforcement learning, of convolutional neural networks, recurring neural networks and how these methods can help improve each of the components of autonomous driving - perception, visual perception, localization, mapping, control planning and the detection of driver state. Okay, two projects. Code named "DeepTraffic" is the first one. There is, in this particular formulation of it, there is seven lanes. It's a top view. It looks like a game but I assure you it's very serious. It is the agent in red, the car in red is being controlled by a neural network and we'll explain how you can control and design the various aspects, the various parameters of this neural network and it learns in the browser. So this, we're using ConvNet.JS which is a library that is programmed by Andrej Karpathy in javascript. So amazingly we live in a world where you can train in a matter of minutes a neural network in your browser. And we'll talk about how to do that. The reason we did this is so that there is very few requirements to get you up and started with neural networks. So in order to complete this project for the course, you don't need any requirements except to have a Chrome browser. And to win the competition you don't need anything except the Chrome browser. The second project code name "DeepTesla" or "Tesla" is using data from a Tesla vehicle of the forward road way and using end-to-end learning taking the image and putting into convolutional neural networks that directly maps "or aggressor" that maps to a steering angle. So all it takes is a single image and it predicts a steering angle for the car. We have data for the car itself and you get to build a neural network that tries to do better, tries to steer better or at least as good as the car. Okay. Let's get started with the question, with the thing that we understand so poorly at this time because it's so shot in mystery but it fascinates many of us. And that is the question of: "What is intelligence?" This is from a March 1996 Time magazine. And the question: "Can machines think?" is answered below with, "they already do." So what if anything is special about the human mind? It's a good question for 1996, a good question for 2016, 2017 now, and the future. And there's two ways to ask that question. One is the special purpose version. Can an artificial intelligence system achieve a well defined, specifically, formally defined finite set of goals? And this little diagram from a book that got me into artificial intelligence as a bright-eyed high school student they are artificial intelligence to modern approach. This is a beautifully simple diagram of a system. It exists in an environment. It has a set of sensors that do the perception. It takes those sensors in. It does something magical. There's a question mark there. And with a set of affectors acts in the world, manipulates objects in that world, and so special purpose. We can, under this formulation, as long as the environment is formally defined, well defined; as long as a set of goals are well defined. As long as the set of actions, sensors, and the ways that the perception carries itself out as well defined. We have good algorithms which will talk about that can optimize for those goals. The question is, if we inch along this path, will we get closer to the general formulation, to the general purpose version of what artificial intelligence is? Can it achieve poorly defined, unconstrained set of goals with an unconstrained, poorly defined set of actions and unconstrained, poorly defined utility functions rewards. This is what human life is about. This is what we do pretty well most days. Exist in an undefined, full of uncertainty, world. So, okay. We can separate tasks into three different, categories, formal tasks. This is the easiest. It doesn't seem so, it didn't seem so at the birth of artificial intelligence but that's in fact true if you think about it. The easiest is the formal tasks, playing board games, theory improving. All the kind of mathematical logic problems that can be formally defined. Then there is the expert tasks. So this is where a lot of the exciting breakthroughs have been happening where machine learning methods, data driven methods, can help aid or improve on the performance of our human experts. This means medical diagnosis, hardware design, scheduling, and then there is the thing that we take for granted. The trivial thing. The thing that we do so easily every day when we wake up in the morning. The mundane tasks of everyday speech, of written language, of visual perception, of walking which we'll talk about in today's lecture is a fascinatingly difficult task on object manipulation. So the question is that we're asking here, before we talk about deep learning, before we talk about the specific methods, we really want to dig in and try to see what is it about driving, how difficult is driving. Is it more like chess which you see on the left there where we can formally define a set of lanes, a set of actions and formulate it as there's five set of actions - you can change your lane, you can avoid obstacles. You can formally define an obstacle. You can the formally define the rules of the road. Or is there something about natural language, something similar to everyday conversation about driving that requires a much higher degree of reasoning, of communication, of learning, of existing in this under-actuated space. Is it a lot more than just left lane, right lane, speed up, slow down? So let's look at it as a chess game. Here's the chess pieces. What are the sensors we get to work with on an autonomous vehicle? And we get a lot more in-depth on this especially with the guest speakers who built many of these. There's radar. There's the Rays sensors. Radar lidar. They give you information about the obstacles in their environment. They'll help localize the obstacles in the environment. There's the visible light camera and stereo vision that gives you texture information, that helps you figure out not just where the obstacles are but what they are, helps to classify those, has to understand their subtle movements. Then there is the information about the vehicle itself, about the trajectory and the movement of the vehicle that comes from the GPS an IMU sensors. And there is the rich state of the vehicle itself. What is it doing? What are all the individual systems doing that comes from the canned network. And there is one of the less studied but fascinating to us on the research side is audio. The sounds of the road that provide the rich context of a wet road. The sound of a road that when it stop raining but it's still wet, the sound that it makes. The screeching tire and honking. These are all fascinating signals as well. And the focus of the research in our group, the thing that's really much under-investigated is the internal facing sensors. The driver, sensing the state of the driver, were they looking? Are they sleepy? The emotional state. Are they in the seat at all? And the same with audio. That comes from the visual information and the audio information. More than that. Here are the tasks. If you were to break into modules the tasks of what it means to build a self-driving vehicle. First, you want to know where you are. Where am I. Localization and mapping. You want to map the external environment. Figure out where all the different obstacles are, all the entities are, and use that estimate of the environment to then figure out where I am, where the robot is. Then there is scene understanding. It's understanding not just the positional aspects of the external environment and the dynamics of it but also what those entities are. Is it a car? Is it a pedestrian? Is it a bird? There is movement planning. Once you have kind of figured out to the best of your abilities your position and the position of other entities in this world, it's figuring out a trajectory through that world. And finally, once you've figured out how to move about safely and effectively through the world it's figuring out what the human that's on board is doing because as I will talk about the path to a self-driving vehicle and that is, hence, our focus on Tesla may go through semi-autonomous vehicles. Where the vehicle must not only drive itself but effectively hand over control from the car to the human and back. Ok, quick history. Well, there's a lot of fun stuff from the eighty's and ninety's but the big breakthroughs came in the second DARPA Grand Challenge with Stanford Stanley, when they won the competition. One of five cars that finished. This was an incredible accomplishment in a desert race. A fully autonomous vehicle was able to complete the race in record time. The DARPA Urban Challenge in 2007 where the task was no longer a race to the desert but through an urban environment and CMU's "Boss" with GM won that race and a lot of that work went directly into the acceptance and large major industry players taking on the challenge of building these vehicles. Google, now "Waymo" self-driving car. Tesla with its "Autopilot" system and now "Autopilot 2" system. Uber with its testing in Pittsburgh. And there's many other companies including one of the speakers for this course of nuTonomy that are driving the wonderful streets of Boston. Ok. So let's take a step back. We have, if we think about the accomplishments in the DARPA Challenge, and if you look at the accomplishments of the Google self-driving car which essentially boils the world down into a chess game. It uses incredibly accurate sensors to build a three dimensional map of the world, localize itself effectively in that world and move about that world in a very well-defined way. Now, what if driving... The open question is: if driving is more like a conversation, like in natural language conversation, how hard is it to pass the Turing Test? The Turing Test, as the popular current formulation is, can a computer be mistaken for a human being in more than thirty percent of the time? When a human is talking behind a veil, having a conversation with their computer or a human, can they mistake the other side of that conversation for being a human when it's in fact a computer. And the way you would, in a natural language, build a system that has successfully passes the Turing Test is, the natural language processing part to enable it to communicate successfully? So, general language and interpret language, then you represent knowledge the state of the conversation transferred over time. And the last piece and this is the hard piece, is the automated reasoning, is reasoning. Can we teach machine learning methods to reason? That is something that will propagate through our discussion because as I will talk about the various methods, the various deep learning methods, neural networks are good at learning from data but they're not yet, there is no good mechanism for reasoning. Now reasoning could be just something that we tell ourselves we do to feel special. Better to feel like we're better than machines. Reasoning may be simply something as simple as learning from data. We just need a larger network. Or there could be a totally different mechanism required and we'll talk about the possibilities there. Yes. (Inaudible question from one of the attendees) No, it's very difficult to find these kind of situations in the United States. So the question was, for this video, is it in the United States or not? I believe it's in Tokyo. So India, as is a few European countries, are much more towards the direction of natural language versus chess. In the United States, generally speaking, we follow rules more concretely. The quality of roads is better. The marking on the roads is better. So there's less requirements there. (Inaudible question from one of the attendees) These cars are are driving on one side? I see. I just- Okay, you're right. It is because, yeah- So, but it's certainly not the United States. I spent quite a bit of googling trying to find in the United States and it is difficult. So let's talk about the recent breakthroughs in machine learning and what is at the core of those breakthroughs is neural networks that have been around for a long time and I will talk about what has changed. What are the cool new things and what hasn't changed and what are its possibilities. But first a neuron, crudely, is a computational building block of the brain. I know there's a few folks here, neuroscience folks, this is hardly a model. It is mostly an inspiration and so the human neuron has inspired the artificial neuron the computational building block of a neural network, of an artificial neural network. I have to give you some context. These neurons, for both artificial and human brains, are interconnected. And the human brain, there's about, I believe 10,000 outgoing connections from every neuron on average and they're interconnected to each other, are the largest current, as far as I'm aware, artificial neural network, has 10 billion of those connections. Synapses. Our human brain, to the best estimate that I'm aware of, has 10,000X that. So one hundred to one thousand trillion synapses. Now what is an artificial neuron? That is the building block of a neural network. It takes a set of inputs. It puts a weight on each of those inputs, sums them together, applies a bias value on each neuron and using an activation function that takes its input, that sum plus the bias and it squishes it together to produce a zero to one signal. And this allows us a single neuron to take a few inputs and produces an output a classification for example, a zero one. And then we'll talk about, simply, it can serve as a linear classifier so it can draw a line. It can learn to draw a line between, like what you'd seen here, between the blue dots and the yellow dots. And that's exactly what we'll do in the iPython Notebook that I'll talk about but the basic algorithm is you initialize the weights on the inputs and you compute the output. You perform this previous operation I talked about sum up and compute the output. And if the output does not match the ground truth, The expected output, the output it should produce, the weights are punished accordingly and will talk through a little bit of the math of that. And this process is repeated until the perceptron does not make any more mistakes. Now here's the amazing thing about neural networks. There are several and I'll talk about them. One on the mathematical side is the universality of neural networks with just a single layer if you stack them together, a single hidden layer, the inputs on the left, the outputs on the right. And in the middle there is a single hidden layer, it can closely approximate any function. Any function. So this is an incredible property that with a single layer any function you could think of, that you could think of driving as a function. It takes its input, the world outside as output to control the vehicle. There exists a neural network out there that can drive perfectly. It's a fascinating mathematical fact. So we can think of this then these functions as a special purpose function, special purpose intelligence. You can take, say as input, the number of bedrooms, the square feet, the type of neighborhood. Those are the three inputs. It passes that value through to the hidden layer. And then one more step. It produces the final price estimate for the house or for the residence. And we can teach a network to do this pretty well in a supervised way. This is supervised learning. You provide a lot of examples where you know the number of bedrooms, the square feet, the type of neighborhood and then you also know the final price of the house or the residence. And then you can, as I'll talk about through a process of back propagation, teach these networks to make this prediction pretty well. Now some of the exciting breakthroughs recently have been in the general purpose intelligence. This is is from Andrej Karpathy who is now at OpenAI. I would like to take a moment here to try to explain how amazing this is. This is a game of "pong". If you're not familiar with "pong", there are two paddles and you're trying to bounce the ball back and in such a way that prevents the other guy from bouncing the ball back at you. The artificial intelligence agent is on the right in green and up top is the score 8-1. Now this takes about three days to train on a regular computer, this network. What is this network doing? It's called the Policy Network. The input is the raw pixels. There's slightly a process and also you take the difference between two frames but it's basically the raw pixel information. That's the input. There's a few hidden layers and the output is the single probability of moving up. That's it. That's the whole system and what it's doing is, it learns. You don't know at any one moment, you don't know what the right thing to do is. Is it to move up? Is it's moved down? You only know what the right thing to do is by the fact that eventually you win or lose the game. So this is the amazing thing here is, there's no supervised learning. There's no universal fact about anyone stay being good or bad. And anyone actually being good or bad in the state but if you punish or reward every single action you took, every single action you took, for an entire game based on the result. So no matter what you did, if you won the game, the end justifies the means. If you won the game, every action you took in every every action state pair gets rewarded. If you lost the game, it gets punished. And this process, with only two hundred thousand games where the system just simulates the games, it can learn to beat the computer. This system knows nothing about "pong", nothing about games, this is general intelligence. Except for the fact, that it's just a game "pong". And I will talk about how this can be extended further, why this is so promising and why we should proceed with caution. So again, there's a set of actions you take up, down, up, down, based on the output of the network. There's a threshold given the probability of moving up, you move up or down based on the output of the network. And you have a set of states and every single state action pair is rewarded if there's a win and it's punished if there's a loss. When when you go home, think about how amazing that is and if you don't understand why that's amazing, spend some time on it. It's incredible. (Inaudible question from one of the attendees) Sure, sure thing. The question was: "What is supervised learning? What is unsupervised learning? What's the difference?" So supervised learning is, when people talk about machine learning they mean supervised learning most of the time. Supervised learning is learning from data, is learning from example. When you have a set of inputs and a set of outputs that you know are correct or called Ground Truth. So you need those examples, a large amount of them, to train any of the machine learning algorithms to learn to then generalize that to future examples. Actually, there's a third one called Reinforcement Learning where the Ground Truth is sparse. The information about when something is good or not, the ground truth only happens every once in a while, at the end of the game. Not every single frame. And unsupervised learning is when you have no information about the outputs. They are correct or incorrect. And it is the excitement of the deep learning community is unsupervised learning, but it has achieved no major breakthroughs at this point. I'll talk about what the future of deep learning is and a lot of the people that are working in t he field are excited by it. But right now, any interesting accomplishment has to do with supervised learning. (Partially inaudible question from one of the attendees) And the wrong one is just has the [00:33:29] (Inaudible) solution like looking at the philosophy. So basically, the reinforcement learning here is learning from somebody who has certain hopes and how can that be guaranteed that it would generalize to somebody else? So the question was this: the green paddle learns to play this game successfully against this specific one brown paddle operating under specific kinds of rules. How do we know it can generalize to other games, other things and it can't. But the mechanism by which it learns generalizes. So as long as you let it play, as long as you let it play in whatever world you wanted it to succeed in long enough, it will use the same approach to learn to succeed in that world. The problem is this works for worlds you can simulate well. Unfortunately, one of the big challenges of neural networks is they're not currently efficient learners. We need a lot of data to learn anything. Human beings need one example often times and they learn very efficiently from that one example. And again I'll talk about that as well, it's a good question. So the drawbacks of neural networks. So if you think about the way a human being would approach this game, this game of "pong", it would only need a simple set of instructions. You're in control of a paddle and you can move it up and down. And your task is to bounce the ball past the other player controlled by AI. Now the human being would immediately, they may not win the game but they would immediately understand the game and would be able to successfully play it well enough to pretty quickly learn to beat the game. But they would need to have a concept of control. What it means to control a paddle, need to have a concept of a paddle, need to have a concept of moving up and down and a ball and bouncing, they have to know, they have to have at least a loose concept of real world physics that they can then project that real world physics on to the two dimensional world. All of these concepts are concepts that you come to the table with. That's knowledge. And the kind of way you transfer that knowledge from your previous experience, from childhood to now when you come to this game, that something is called reasoning. Whatever reasoning means. And the question is whether through this same kind of process, you can see the entire world as a game of "pong" and reasoning is simply the ability to simulate that game in your mind and learn very efficiently, much more efficiently, than 200,000 innovations. The other challenge of deep neural networks and machine learning broadly is you need big data and efficient learners as I said. And that data also need to be supervised data. You need to have Ground Truth which is very costly for annotation. A human being looking at a particular image, for example, and labeling that as something as a cat or dog, whatever objects is in the image, that's very costly. And particularly for neural networks there's a lot of parameters to tune. There's a lot of hyper-parameters. You need to figure out the network structure first. How does this network look, how many layers? How many hidden nodes? What type of activation function for each node? There's a lot of hyper-parameters there and then once you've built your network, there's parameters for how you teach that network. There's learning rate, loss function - meaning bad size - number of training iterations, gradient updates moving and selecting even the optimizer with which you solve the various differential equations involved. It's a topic of many research paper, certainly it's rich enough for research papers, but it's also really challenging. It means you can't just pop the network down it will solve the problem generally. And defining a good lost function, or in the case of "pong" or games, a good reward function is difficult. So here's a game, this is a recent result from OpenAI, I'm teaching a network to play the game of coast runners. And the goal of coast runners is you're in a boat the task is to go around the track and successfully complete a race against other people you're racing against. Now this network is an optimal one. And what is figured out that actually in the game, it gets a lot of points for collecting certain objects along the path. So you see it's figured out to go in a circle and collect those those green turbo things. And what is figured out is you don't need to complete the game to earn the award. And despite being on fire and hitting the wall and going through this whole process, it's actually achieved at least the local optima given the reward function of maximizing the number of points. And so it's figured out a way to earn a higher reward while ignoring the implied bigger picture goal of finishing the race which us as humans understand much better. This raises, for self-driving cars, ethical questions. Besides other quick questions. (CHUCKLING) We could watch this for hours and it will do that for hours and that's the point: It's hard to teach, it's hard to encode the formally defined utility function under which an intelligent system needs to operate. And that's made obvious even in a simple game. And so what is - Yup, question. (Inaudible question from one of the attendees) So the question was: "what's an example of a local optimum that an autonomous car, similar to the cost racer, what would be the example in the real world for an autonomous vehicle? And it's a touchy subject. But it would certainly have to be involved the choices we make under near crashes and crashes. The choices a car makes want to avoid. For example, if there's a crash imminent and there's no way you can stop to prevent the crash, do you keep the driver safe or do you keep the other people safe. And there has to be some, even if you don't choose to acknowledge it, even if it's only in the data and the learning that you do, there's an implied reward function there. And we need to be aware of that reward function is because it may find something. Until you actually see it, we won't know it. Once we see it, we realize that oh that was a bad design and that's the scary thing. It's hard to know ahead of time what that is. So the recent breakthroughs from deep learning came several factors. First is the compute, Moore's Law. CPUs are getting faster, hundred times faster, every decade. Then there's GPU use. Also the ability to train neural networks and GPUs and now ASICs has created a lot of capabilities in terms of energy efficiency and being able to train larger networks more efficiently. Well, first of all in the in the 21st Century there's digitized data. There's larger data sets of digital data and now there is that data is becoming more organized, not just vaguely available data out there on the internet, it's actual organized data sets like Imagenet. Certainly for natural languages there's large data sets. There is the algorithm innovations, Backprop. Back propagation, Convolutional Neural Networks, LSTMs. All these different architectures for dealing with specific types of domains and tasks. There is the huge one, is infrastructure. It's on the software and the hardware side. There's Git, Ability to Share and Open Source Way software. There are pieces of software that make robotics and make machine learning easier. ROS, TensorFlow. There is Amazon Mechanical Turk which allows for efficient, cheap annotation of large scale data sets. As AWS and the cloud hosting, machine learning hosting the data and the compute. And then there's a financial backing of large companies - Google, Facebook, Amazon. But really nothing is changed. There really has not been any significant breakthroughs. Convolutional networks have been around since the 90s, neural networks has been around since the 60s. There's been a few improvements but the hope is, that's in terms of methodology, the compute has really been the work horse. The ability to do the hundred fold improvement every decade, holds promise and the question is whether that reasoning thing I talked about, all you need is a larger network. That is the open question. Some terms for deep learning. First of all deep learning, is a PR term for neural networks. It is a term for utilising deep neural networks for neural networks to have many layers. It is symbolic term for the newly gained capabilities that compute has brought us. That training on GPUs have brought us. So deep learning is a subset of machine learning. There's many other methods that are still effective. The terms that will come up in this class is, first of all, Multilayer Perceptron (MLP) Deep neural networks (DNN), Recurrent neural networks (RNN), LSTM (Long Short-Term Memory) Networks, CNN and ConvNet (Convolutional neural networks), Deep Belief Networks. And the operational come up is Convolutional, Pooling, Activation functions and Backpropagation. Yes, you've got a question? (Inaudible question from one of the attendees) So the question was, what is the purpose of the different layers in neural network? What is the need of one configuration versus another? So a neural network, having several layers, it's the only thing you have an understanding of, is the inputs and the outputs. You don't have a good understanding about what these layer does. They are mysterious things, neural networks. So I'll talk about how, with every layer, it forms a higher level. A higher order representation of the input. So it's not like the first layer does localization, the second layer does path planning, the third layer does navigation - how you get from here to Florida - or maybe it does, but we don't know. So we know we're beginning to visualize neural networks for simple tasks like for ImageNet classifying cats versus dogs. We can tell what is the thing that the first layer does, the second layer, the third layer and we look at that. But for driving, as the input provide just the images the output the steering. It's still unclear what you learned partially because we don't have neural networks that drive successfully yet. (Points to a member of the class) (Inaudible question) So the question was, does a neural network generate layers over time, like does it grow it? That's one of the challenges, that a neural network is pre-defined. The architecture, the number of nodes, the number of layers. That's all fixed. Unlike the human brain where the neurons die and are born all the time. A neural Network is pre-specified, that's it. That's all you get and if you want to change that, you have to change that and then retrain everything. So it's fixed. So what I encourage you is to proceed with caution because there's this feeling when you first teach a network with very little effort, how to do some amazing tasks like classify a face versus non-face, or your face versus other faces or cats versus dogs, its an incredible feeling. And then there's definitely this feeling that I'm an expert but what you realize is we don't actually understand how it works. And getting it to perform well for more generalized task, for larger scale data sets, for more useful applications, requires a lot of hyper-parameter tuning. Figuring out how to tweak little things here and there and still in the end, you don't understand why it work so damn well. So deep learning, these deep neural network architectures is representation learning. This is the difference between traditional machine learning methods where, for example, for the task of having an image here is the input. The input to the network here is on the bottom, the output up on top, and the input is a single image of a person in this case. And so the input, specifically, is all the pixels in that image. RGB, the different colors of the pixels in the image. And over time, what a network does is build a multiverse solutional representation of this data. The first layer learns the concept of edges, for example. The second layer starts to learn composition of those edges, corners, contours. Then it starts to learn about object parts. And finally, actually provide a label for the entities that are in the input. And this is the difference in traditional machine learning methods where the concepts like edges and corners and contours are manually pre-specified by human beings, human experts, for that particular domain. And representation matters because figuring out a line for the Cartesian coordinates of this particular data set where you want to design a machine learning system that tells the difference between green triangles and blue circles is difficult. There is no line that separates them cleanly. And if you were to ask a human being, a human expert in the field. to try to draw that line they would probably do a Ph. D. on it and still not succeed. But a neural network can automatically figure out to remap that input into polar coordinates where the representation is such that it's an easily, linearly separable data set. And so, deep learning is a subset of representation learning, is a subset of machine learning and a key subset artificial intelligence. Now, because of this, because of its ability to compute an arbitrary number of features that are at the core of the representation. So if you are trying to detect a cat in an image, you're not specifying 215 specific features of cat ears and whiskers and so on that a human expert will specify you allow and you'll know it discover tens of thousands of such features, which maybe for cats you are an expert but for a lot of objects you may never be able to sufficiently provide the features which successfully will be used for identifying the object. And so, this kind of representation learning, one is easy in the sense that all you have to provide is inputs and outputs. All you need to provide is a data set the care about without [00:53:39] features. And two, because of it's ability to construct arbitrarily sized representations, deep neural networks are hungry for data. The more data we give them, the more they are able to learn about this particular data set. So let's look at some applications. First, some cool things that deep neural networks have been able to accomplish up to this point. Let me go through them. First, the basic one. AlexNet is for- ImageNet is a famous data set and a competition of classification, localization where the task is given an image, identify what are the five most likely things in that image and what is the most likely and you have to do so correctly. So on the right, there's an image of a leopard and you have to correctly classify that that is in fact the leopard. So they're able to do this pretty well given a specific image. Determine that it's a leopard. And we started, what's shown here on the x-axis is years on the y-axis is error in classification. So starting from 2012 on the left with AlexNet and today the errors decreased from 16% and 40% before then with traditional methods have decreased to <4%. So human level performance, if I were to give you this picture of a leopard is a 4% of those pictures of leopards you would not say it's a leopard. That's human level performance. So for the first time in 2015, convolutional neural networks are performed human beings. That in itself is incredible. That is something that seemed impossible. And now is because it's done is not as impressive. But I just want to get to why this is so impressive because computer vision is hard. Now we as human beings have evolved visual perception over millions of years, hundreds of millions of years. So we take it for granted but computer vision is really hard, visual perception is really hard. There's illumination variability. So it's the same object. The only way we are telling you a thing is from the shade, the reflection of light from that surface. It could be the same object with drastically, in terms of pixels, drastically different looking shapes and we still know it's the same object. There is post-variability in occlusion. Probably my favorite caption for an image for a figure in a academic paper is deformable and truncated cat. These are pictures, you know cats are famously deformable. They can take a lot of different shapes. (LAUGHTER) Its arbitrary poses are possible so you have to have computer vision to know it's still the same objects, still the same class of objects, given all the variability in the pose and occlusions is a huge problem. We still know it's an object. We still know it's a cat even when parts of it are not visible. And sometimes large parts of it are not visible. And then there's all the inter-class variability. Inter-class, all of these on the top two rows are cats. Many of them look drastically different. And the top bottom two rows are dogs also look drastically different. And yet some of the dogs look like cats, some of the cats look like dogs. And as human beings are pretty good at telling the difference and we want computer vision to do better than that. It's hard. So how is this done? This is done with convolutional neural networks. The input to which is a raw image. Here's an input on the left of a number three and I'll talk about through convolutional layers that image is processed past through convolutional layers maintain spatial information. On the output, in this case predicts which of the images what number is shown in the image. 0, 1, 2 through 9. And so, these networks, everybody's using the same kind of network to determine exactly that. Input is an image, output is a number. And in the case of probability, that is a leopard. What is that number? Then there is segmentation built on top of these convolution neural networks where you chop off the end and convolutionise the network. You chop off the end where the output is a heat map. So you can have, instead of a detector for a cat, you can do a cat heat map where it's the part of the image, the output heat map gets excited, the neurons in that output get excited in the spatially excited, in the parts of the image that contain a tabby cat. And this kind of process can be used to segment the image into different objects, a horse. So the original input on the left is a woman on a horse and the output is a fully segmented image of knowing where is the woman, where is the horse. And this kind of process can be used for object detection which is the task of detecting an object in an image. Now the traditional method with convolutional neural networks and in general computer vision is the sliding window approach. We have a detector, like the leopard detector, where you slide through the image to find where in that image is the leopard. This, the segmenting approach, the R-CNN approach, is efficiently segmenting the image in such a way that it can propose different parts of the image that are likely to have a leopard, or in this case a cowboy, and that drastically reduces the computational requirements of the object detection task. And so these networks, this is currently one of the best networks for the ImageNet task of localization is the Deep residual networks. They're deep. So VGG-19 is one of the famous ones. You started to get above twenty layers in many cases, thirty four layers is the rise in that one. So the lesson there is, the deeper you go the more representation power you have, the higher accuracy but you need more data. Other applications, colorization of images. So this again, input is a single image and output is a single image. So you can take a black and white video from a film, from an old film, and recolor it. And all you need to do to train that network in the supervised way is provide modern films and convert them to grayscale. So now you have arbitrarily sized data sets, data sets of gray scale to color. And you're able to, with very little effort on top of it, to successfully well, somewhat successful recolor images. Again, Google Translate does image translation in this way, image to image. It first perceives, here in German I believe, famous German correct me if I'm wrong, dark chocolate written in German on a box. So this can take this image, detect different letters convert them to text, translate the text and then using the image to image mapping map the letters, the translated letters, back onto the box and you could do this in real time on video. So what we've talked about up to this point on the left are "vanilla" neural networks, convolutional neural networks, that map a single input, a single output, a single image to a number, single image another image. Then there is recurrent neural networks, the map. This is the more general formulation, they map a sequence of images or a sequence of words or a sequence of any kind to another sequence. And these networks are able to do incredible things with natural language, with video, and any type of series of data. For example, you can convert text to hand written digits, with hand written text. Here, you type in and you can do this online, type in deep learning for self-driving cars and it will use an arbitrary handwriting style to generate the words "deep learning for self-driving cars". This is done using recurring neural networks. We can also take Char-RNNs they're called, it's character level recurring neural networks that train on a data set an arbitrary text data set and learn to generate text one character at a time. So there is no preconceived syntactical semantic structure that's provided to the network. It learns that structure. So for example, you can train it on Wikipedia articles like in this case. And it's able to generate successfully not only text that makes some kind of grammatical sense at least but also keep perfect syntactic structure for Wikipedia, for Markdown, editing, for late tack editing and so on. This text as "naturalism and decision for the majority of Arab countries capitalide." Whatever that means, "was grounded by the Irish language by John Clare," and so on. These are sentences. If you didn't know better, that might sound correct. And it does so and you pause one character at a time so these aren't words being generated. This is one character, you start with the beginning three letters "nat", you generate "u" completely without knowing of the word naturalism. This is incredible. You can do this to start a sentence and let the neural network complete that sentence. So for example if you start the sentence with "life is" or "life is about" actually, it will complete it with a lot of fun things. "The weather." "Life is about kids." "Life is about the true love of Mr Mom", "is about the truth now." And this is from [01:05:59], the last two, if you start with "the meaning of life," it can complete that with "the meaning of life is literary recognition" may be true for some of us here. Publish or perish. And "the meaning of life is the tradition of ancient human reproduction." (LAUGHTER) Also true for some of us here. I'm sure. Okay, so what else can you do? You can, this has been very exciting recently is image capture recognition. No, generation, I'm sorry. Image capture generation is important for large data sets of images. What we want to be able to determine what's going on inside those images. Specially for search, if you want to find a man sitting in a college with a dog, you type it into Google and it's able to find that. So here shown in black text a man sitting on a couch with a dog is generated by the system. A man sitting in a chair with a dog in his lap is generated by a human observer. And again these annotations are done by detecting the different obstacles, the different objects in the scene. So segmenting the scene detecting on the right there's a woman, a crowd, a cat, a camera, holding, purple. All of these words are being detected then a syntactically correct sentence is generated, a lot of them, and then you order which sentence is the most likely. And in this way you can generate very accurate labeling of the images, captions for the images. And you can do the same kind of process for image question answering. You can ask how many for quantity, how many chairs are there? You can ask about location, where are the ripe bananas? You can ask about the type of object. What is the object in the chair? It's a pillow. And these are, again, using the recurring neural networks. You could do the same thing with video captions generation, video captions description generation. So looking at a sequence of images as opposed to just a single image. What is the action going on in this situation? This is the difficult task. There's a lot of work in it, in this area. On the left is correct descriptions of a man is do stunts on his bike or a herd a zebra are walking in the field and on the right, there's a small bus running into a building. You know it's talking about relevant entities but just doing an incorrect description. A man is cutting a piece of a pair of a paper. So the words are correct. Perhaps, but so you're close, but mostly are. One of the interesting things you can do with a recurring neural networks is if you think about the way we look at images, human beings look at images, is we only have a small phobia with which we focus in a scene. So right now you're periphery is very distorted. The only thing, if you're looking at the slides, you're looking at me that's the only thing that's in focus. Majority of everything else is out of focus. So we can use the same kind of concept to try to teach a neural network to steer around the image. Both for perception and generation of those images. This is important first on the general artificial intelligence point of it being just fascinating that we can selectively steer our attention but also it's important for things like drones. They have to fly at high speeds in an environment where three hundred plus frames a second, you have to make decisions. So you can't possibly localize yourself or perceive the world around yourself successfully if you have to interpret the entire scene. So we can do is you can steer, for example here shown, is reading a house number by steering around an image. You can do the same task for reading and for writing. So reading numbers here, and this data set on the left, is reading numbers. We can also selectively steer a network around an image to generate that image starting with a blurred image first and then getting more and more higher resolution as the steering goes on. Work here at MIT is able to map video to audio. So head stuff for the drumstick silent video and able to generate the sound that would drumstick hitting that particular object makes. So you can get texture information from that impact. So here is the video of a human soccer player playing soccer and a state-of-the-art machine playing soccer. And, well let me give it some time, to build up. (LAUGHTER) Okay. So soccer, we take this for granted, but walking is hard. Object manipulation is hard. Soccer is harder than chess for us to do much harder. On your phone now, you can have a chess engine that beats the best players in the world. And you have to internalize that because the question is, this is a painful video, the question is: where does driving fall? Is it closer to chess or is it closer soccer? For those incredible, brilliant engineers that worked on the most recent DARPA challenge this would be a very painful video to watch, I apologize. This is a video from the DARPA Challenge (LAUGHTER) of robots struggling with basic object manipulation and walking tasks. So it's mostly a fully autonomous navigation task. (LAUGHTER) Maybe I'll just let this play for a few moments to let it internalize how difficult this task is, of balancing, of planning in an underactuated way. We don't have full control of everything. When there is a delta between your perception of what you think the world is and what reality is. So there, a robot was trying to turn an object that wasn't there. And this is an MIT entry that actually successfully, I believe, gotten points for this because it got into that area (LAUGHTER) but as a lot of the teams talked about the hardest part, So one of the things the robot had to do is get into a car and drive it and get out of the car. And there's a few other manipulation task like walking on unsteady ground, it had to drill a hole through a wall. All these tasks and what a lot of teams said is the hardest part, the hardest task of all of them, is getting out of the car. So it's not getting into the car, it's this very task you saw now is the robot getting out of the car. These are things we take for granted. So in our evaluation of what is difficult about driving, we have to remember that some of those things we may take for granted in the same kind of way that we take walking for granted, this is more of X paradox. Will Hans Moravec from CMU, let me just quickly read that quote: "Encoded in the large highly evolved sensory motor portions of the human brain is billions of years of experience about the nature of the world and how to survive in it." So this is data. This is big data. Billions of years and abstract thought which is reasoning. The stuff we think is intelligence is perhaps less than one hundred thousand years of data old. We haven't yet mastered it and so, I'm sorry I'm asserting my own statements in the middle of a quote, but it's been very recent that we've learned how to think. And so we respected perhaps more than the things we take for granted like walking, the visual perception and so on but those may be strictly a matter of data, data and training time and network size. So walking is hard. The question is how hard is driving? And that's an important question because the margin of error is small. One, there's 1 fatality per 100 million miles. That's the number of people that die in car crashes every year, 1 fatality per 100 million miles. That's a point 0.000001% margin of error. That's through all the time you spend on the road, that is the error you get. More impressed with ImageNet being able to classify a leopard, a cat or a dog at above human level performance but this is the margin of error we get with driving. And we have to be able to deal with snow, with heavy rain, with big open parking lots, with parking garages, any pedestrians that behaves irresponsibly as rarely as that happens or just some predictably, again especially in Boston, reflections. The ones especially some things you don't think about: the lighting variations that blind the cameras. (Inaudible question from one of the attendees) The question was if that number changes, if you look at just crashes, the fatalities per crash. So one of the big things is that cars have gotten really good at crashing and not hurting anybody. So the number of crashes is much, much larger than the number of fatalities which is a great thing, we've built safer cars. But still, you know even one fatality is too many. So this is one that Google self-driving car team is quite open about their performance since hitting public road, this is from a report that shows the number of times the driver disengaged the car gives up control, that it asked the driver to take control back or the driver takes control back by force. Meaning that they're unhappy with the decision that the car was making or it was putting the car or other pedestrians or other cars in unsafe situations. And so, if you see over time there's been a total from 2014 to 2015 there's been a total of 341 times on beautiful San Francisco roads and I say that seriously because the weather conditions are great there, 341 times that the driver had to elect to control back. So it's a work in progress. And let me give you something to think about here. This, with neural networks is a big open question. The question of robustness. So this is an amazing paper, I encourage people to read it. There's a couple of papers around this topic. Deep neural networks are easily fooled. So here are 8 images where, if given to a neural network as input, a convolutional neural network as input, the network with higher than 99.6% confidence says that the image, for example the top left, as a robin. Next to is a cheetah, then an armadillo, a panda, an electric guitar, a baseball, a starfish, a king penguin. All of these things are obviously not in the images. So the networks can be fooled with noise. More importantly, practically for the real world, adding just a little bit of distortion, a little bit of noise distortion to the image, can force the network to produce a totally wrong prediction. So here's an example, there's 3 columns, correct image classification, the slight addition of distortion and the resulting prediction of an ostrich for all three images on the left and a prediction of an ostrich for all three images on the right. This ability to fool networks easily brings up an important point. And that point is that there has been a lot of excitement about neural networks throughout their history. There's been a lot of excitement about artificial intelligence throughout its history and not coupling that excitement, not granting that excitement, in the reality the real challenges around that has resulted in in crashes, in A.I. winters when funding dried out and people became hopeless in terms of the possibilities of artificial intelligence. So here is the 1958 New York Times article that said the Navy revealed the embryo of an electronic computer today. This is when the first perceptron that I talked about was implemented in hardware by Frank Rosenblatt. It took 400 pixel image input and it provided a single output. Weights were encoded in the hardware potentiometers and waves were updated with electric motors. Now New York Times wrote, the Navy revealed the embryo vanilla electronic computer today that expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence. Dr. Frank Rosenblatt, a research psychologist at the Cornell Aeronautical Laboratory in Buffalo, said perceptrons might be fired to the planets as mechanical space explorers. This might seem ridiculous but this is the general opinion of the time. And as we know now, perceptrons cannot even separate a non-linear function. They're just linear classifiers. And so this led to 2 major A.I. winters in the 70s, in the late 80s and early 90s. The Lighthill Report, in 1973 by the UK government, said there are no part of the field of discoveries made so far produced the major impact that was promised. So if the hype builds beyond the capabilities of our research, reports like this will come and they have the possibility of creating another A.I. winter. So I want to pare the optimism, some of the cool things we'll talk about in this class, with the reality of the challenges ahead of us. The focus of the research community, this is some of the key players in deep learning, what are the things that are next for deep learning, the five year vision? We want to run on smaller, cheaper mobile devices. We want to explore more in the space of unsupervised learning as I mentioned and reinforcement learning. We want to do things that explore the space of videos more, the recurring neural networks, like being able to summarize videos or generate short videos. One of the big efforts, especially in the companies we do in large data, is multi-modal learning. Learning from multiple data sets with multiple sources of data. And lastly, making money from these technologies. There's a lot of this despite the excitement. There has been an inability for the most part to make serious money from some of the more interesting parts of deep learning. And while I got made fun of by the TAs for including this slide because it's shown in so many sort of business type lectures, but it is true that we're at the peak of a hype cycle and we have to make sure be given the large amount of hype and excited there is, we proceed with caution. One example of that, let me mention, is we already talked about spoofing the cameras. Spoofing the cameras with a little bit of noise. So if you think about it, self-driving vehicles operate with a set of sensors and they rely on those sensors to convey to accurately capture that information. And what happens, not only when the world itself produces noisy visual information, but what if somebody actually tries to spoof that data. One of the fascinating things have been recently done is spoofing of LIDAR. So these LIDAR is a range sense that gives a 3D-point cloud of the objects in the external environment. And you're able to successfully do a replay attack where you have the car see people in other cars around it when there's actually nothing around it. In the same way that you can spoof a camera to see things that are not there. A neural network. So let me run through some of the libraries that we'll work with and they're out there that you my work with if you proceed with deep learning. TensorFlow, that is the most popular one these days. It's heavily backed and developed by Google. It's primarily a python interface and is very good at operating on multiple GPUs. There's Keras and also TF Learn and TF Slim which are libraries that operate on top of TensorFlow that make it slightly easier, slightly more user friendly interfaces, to get up and running. Torch, if you're interested to get in at the lower level tweaking of the different parameters of neural networks creating your own architectures. Torch is excellent for that with it's own Lua interface. Lua's a programming language and heavily backed by Facebook. There is the old school "theano" which is what I started on a lot of people early on, in deep learning started on, as one of the first libraries that supported ahead came with GPU support. It definitely encourages lower level tinkering, has a python interface. And many of these, if not all, rely on Nvidia's library for doing some of the low level computations involved with training these neural networks on Nvidia GPUs. "mxnet" heavily supported by Amazon and they have officially recently announced that they're going to be, their AWS, is going to be all in on the mxnet. Neon, recently bought by Intel, started out as a manufacturer of neural network chips which is really exciting and it performs exceptionally well. I hear good things. Caffe, started in Berkeley, also was very popular in Google before Tensorlow came out. It's primarily designed for computer vision with ConvNet's but has now expanded to all of the domains. There is CNTK, used to be known and now called the Microsoft Cognitive Toolkit. Nobody calls it that still I'm aware of. It says multi GPU support, has its own brain script custom language as well as other interfaces. And we'll get to play around in this class is, amazingly, deep learning in the browser, right. Our favorite is ConvNetJS, what you use, built by Andrej Karpathy from Stanford now OpenAI. It's good for explaining the basic concept of neural networks. It's fun to play around with. All you need is a browser and some very few requirements. It can't leverage GPUs, unfortunately. But for a lot of things that we're doing, you don't need GPUs. You'd be able to train a network with very little and relatively efficiently without the [01:30:15] GPUs. It has full support for CNNs, RNNs and even deeper reinforcement learning. Keras.js, which seems incredible, we try to use for this class. It has GPU support so it runs in the browser with GPU support with Open GL or however it works magically but we're able to accomplish a lot of things we need without the use of GPUs. It's incredible to live in a day and age when it literally, as I'll show on the tutorials, it takes just a few minutes to get started with building your own neural network that classifies images and a lot of these libraries are friendly in that way. So all the references mentioned in this presentation are available at this link and the slides are available there as well. So I think in the interest of time, let me wrap up. Thank you so much for coming in today and tomorrow I'll explain the deep reinforcement learning game and the actual competition and how you can win. Thanks very much guys.
B1 中級 美國腔 MIT 6.S094:深度學習和自動駕駛汽車介紹 (MIT 6.S094: Introduction to Deep Learning and Self-Driving Cars) 224 23 alex 發佈於 2021 年 01 月 14 日 更多分享 分享 收藏 回報 影片單字