字幕列表 影片播放 列印英文字幕 RUSS ALTMAN: It is now my pleasure to introduce my colleague and tireless co-organizer, Dr. Fei Fei Li. She's an associate professor of Computer Science and Psychology. She's director of the AI Lab here at Stanford. And she's done breakthrough research in human vision, high level visual recognition, and computational neuroscience, some of which I think we might hear about now-- Fei Fei. [APPLAUSE] FEI-FEI LI: Good evening, everyone. It's quite an honor to be here. And I'm going to share with you some of my recent work in visual intelligence. So we're going to begin 543 million years ago. Simple organisms lived in the vast ocean of this Precambrian age. And they floated around waiting for food to come by or becoming someone else's food. So life was very simple and so was the animal kingdom. There were only a few animal species around. And then something happened. The first trilobites started to develop eyes and life just changed forever after that. Suddenly animals can go seek food. Preys have to run from predators. And the number of animal species just exploded in an exceedingly short period of time. Evolutionary biologists call this period the Cambrian Explosion, or the big bang of evolution, and attributed to the development of vision to be the main factor that caused this animal speciation. So ever since then, vision played a very important role in animals for them to survive, to seek food, to navigate, to manipulate, and so on. And the same is true for humans. We use vision to live, to work, to communicate, and to understand this world. In fact, after 540 millions of evolution, the visual system is the biggest sensory system in our brain. And more than half of the brain neurons are involved in vision. So while animals have seen the light of the world 540 million years ago, our machines and computers are still very much in the dark age. We have security cameras everywhere but they don't alert us when a child is drowning in a swimming pool. Hundreds of hours of videos are uploaded every minute to the YouTube servers, yet we do not have the technology to tag and recognize the contents. We have drones flying over massive lands taking an enormous amount of imageries, but we do not have a method or algorithm to understand the landscape of the earth. So in short as a society, we're still pretty much collectively blind because our smartest machines and computers are blind. So as a computer vision scientist we seek to develop artificial intelligence algorithms that can learn about the visual world and recognize the contents in the images and the videos. We've got a daunting task to shine light on our digital world and we don't have 540 million years to do this. So the first step towards this goal is to recognize objects because they are the building blocks of our visual world. In the simplest terms, imagine this teaching process of teaching computers to recognize objects by first showing them a few training images of a particular object-- let's say a cat. And then we design a mathematical model that can learn from these training images. How hard could this be? Humans do this effortlessly. So that's what we tried at the beginning. In a straightforward way, we tried to express objects by designing parts and the configurations of their parts just using-- such as using simple geometric shapes to define a cat model. Well, there are lots of different cats. So for this one, we cannot use our original models. We have to do another model. Well, what about these cats? So now you get the idea. Even something as simple as a household pet can pose an infinite number of variations for us to model. And that's just one object. But this is what many of us were doing at that time. We keep designing-- tuning our algorithms-- and waiting for that magical algorithm to be able to model all the variations of an object using just a few training images. But about nine years ago, a very profound, but simple thought changed-- observation-- changed my thinking. This is not how children learn. We don't tell kids how to see. They do it by experiencing the real world and by experiencing real life examples. If you consider a child's eyes as a pair of biological cameras, they take a picture every 200 milliseconds. So by age three, a child would have seen hundreds of millions of images. And that's the amount of data we're talking about to develop a vision system. So before we come up with a better algorithm, we should provide our computer algorithms the kind of data that children were experiencing in their developmental years. And once we realized this, I know what we need to do. We need to collect a data set that has far more images than we have ever used before in machine learning and computer vision-- thousands of times larger than the standard dataset that was being used at the time. So together with my colleague Professor Kai Li and student Jia Deng, we started this ImageNet project back in 2007. After three years of very hard work, by 2009 the ImageNet project delivered a database of 15 million images organized across 22,000 categories of objects and things organized by every day English words. In quality and quantity, this was an unprecedented scale for the field of computer vision and machine learning. So more than ever we're now poised to tackle the problem of object recognition using ImageNet. This is the first take of the message I'm going to deliver today-- much of learning is about big data. This is a child's perspective. As it turned out, the wealth of information provided by ImageNet was a perfect match for a particular class of machine learning algorithms called the Convolutional Neural Network pioneered by computer scientists Kunihiko Fukushima, Geoffrey Hinton, Yann LeCun, back in the 1970s and 80s. Just like the brain is consisted of billions of neurons, a basic operating unit of the Convolutional Neural Network is a neuron-like node that gets input from other nodes and send output to others. More over, hundreds and thousands of these neuron-like nodes are layered together in a hierarchical fashion, also similar to the brain. This is a typical Convolutional Neural Network model we use in our lab to train our object recognition algorithm. It's consisted of 24 million nodes, 140 million parameters, and 15 billion connections. With the massive data provided by ImageNet and the modern computing hardware like CPUs and GPUs to train this humongous model, the Convolutional Neural Network algorithm blossomed in a way that no one had expected. It became the winning architecture for object recognition. Here is what the computer tells us, the image contains a cat and where the cat is. Here is a boy and his teddy bear. A dog on the beach with a person and a kite-- so, so far, what we have seen is to teach computers to recognize objects. This is like a young child learning to utter the first few nouns. It's a very impressive achievement, but it's only in the beginning. Children soon hit another developmental milestone. And they begin to communicate in sentences and tell stories. So instead of saying-- CHILD 1: That's a cat sitting in a bed. FEI-FEI LI: Right, this is a three-year-old telling us the story of the scene instead of just labeling it as a cat. Here's one more.