Placeholder Image

字幕列表 影片播放

  • RUSS ALTMAN: It is now my pleasure

  • to introduce my colleague and tireless co-organizer, Dr. Fei

  • Fei Li.

  • She's an associate professor of Computer Science

  • and Psychology.

  • She's director of the AI Lab here at Stanford.

  • And she's done breakthrough research in human vision,

  • high level visual recognition, and computational neuroscience,

  • some of which I think we might hear about now-- Fei Fei.

  • [APPLAUSE]

  • FEI-FEI LI: Good evening, everyone.

  • It's quite an honor to be here.

  • And I'm going to share with you some

  • of my recent work in visual intelligence.

  • So we're going to begin 543 million years ago.

  • Simple organisms lived in the vast ocean of this Precambrian

  • age.

  • And they floated around waiting for food to come by

  • or becoming someone else's food.

  • So life was very simple and so was the animal kingdom.

  • There were only a few animal species around.

  • And then something happened.

  • The first trilobites started to develop eyes and life

  • just changed forever after that.

  • Suddenly animals can go seek food.

  • Preys have to run from predators.

  • And the number of animal species just

  • exploded in an exceedingly short period of time.

  • Evolutionary biologists call this period the Cambrian

  • Explosion, or the big bang of evolution,

  • and attributed to the development of vision

  • to be the main factor that caused this animal speciation.

  • So ever since then, vision played a very important role

  • in animals for them to survive, to seek food, to navigate,

  • to manipulate, and so on.

  • And the same is true for humans.

  • We use vision to live, to work, to communicate,

  • and to understand this world.

  • In fact, after 540 millions of evolution,

  • the visual system is the biggest sensory system in our brain.

  • And more than half of the brain neurons are involved in vision.

  • So while animals have seen the light of the world

  • 540 million years ago, our machines and computers

  • are still very much in the dark age.

  • We have security cameras everywhere

  • but they don't alert us when a child is

  • drowning in a swimming pool.

  • Hundreds of hours of videos are uploaded every minute

  • to the YouTube servers, yet we do not

  • have the technology to tag and recognize the contents.

  • We have drones flying over massive lands

  • taking an enormous amount of imageries,

  • but we do not have a method or algorithm

  • to understand the landscape of the earth.

  • So in short as a society, we're still pretty much

  • collectively blind because our smartest machines and computers

  • are blind.

  • So as a computer vision scientist

  • we seek to develop artificial intelligence algorithms that

  • can learn about the visual world and recognize the contents

  • in the images and the videos.

  • We've got a daunting task to shine light

  • on our digital world and we don't have

  • 540 million years to do this.

  • So the first step towards this goal

  • is to recognize objects because they are the building

  • blocks of our visual world.

  • In the simplest terms, imagine this teaching process

  • of teaching computers to recognize objects

  • by first showing them a few training

  • images of a particular object-- let's say a cat.

  • And then we design a mathematical model

  • that can learn from these training images.

  • How hard could this be?

  • Humans do this effortlessly.

  • So that's what we tried at the beginning.

  • In a straightforward way, we tried

  • to express objects by designing parts and the configurations

  • of their parts just using-- such as using

  • simple geometric shapes to define a cat model.

  • Well, there are lots of different cats.

  • So for this one, we cannot use our original models.

  • We have to do another model.

  • Well, what about these cats?

  • So now you get the idea.

  • Even something as simple as a household pet

  • can pose an infinite number of variations for us to model.

  • And that's just one object.

  • But this is what many of us were doing at that time.

  • We keep designing-- tuning our algorithms--

  • and waiting for that magical algorithm

  • to be able to model all the variations of an object using

  • just a few training images.

  • But about nine years ago, a very profound, but simple thought

  • changed-- observation-- changed my thinking.

  • This is not how children learn.

  • We don't tell kids how to see.

  • They do it by experiencing the real world

  • and by experiencing real life examples.

  • If you consider a child's eyes as a pair

  • of biological cameras, they take a picture

  • every 200 milliseconds.

  • So by age three, a child would have seen

  • hundreds of millions of images.

  • And that's the amount of data we're talking about

  • to develop a vision system.

  • So before we come up with a better algorithm,

  • we should provide our computer algorithms the kind of data

  • that children were experiencing in their developmental years.

  • And once we realized this, I know what we need to do.

  • We need to collect a data set that has far more images

  • than we have ever used before in machine learning and computer

  • vision-- thousands of times larger

  • than the standard dataset that was being used at the time.

  • So together with my colleague Professor Kai

  • Li and student Jia Deng, we started this ImageNet project

  • back in 2007.

  • After three years of very hard work,

  • by 2009 the ImageNet project delivered a database

  • of 15 million images organized across 22,000 categories

  • of objects and things organized by every day English words.

  • In quality and quantity, this was an unprecedented scale

  • for the field of computer vision and machine learning.

  • So more than ever we're now poised

  • to tackle the problem of object recognition using ImageNet.

  • This is the first take of the message

  • I'm going to deliver today-- much

  • of learning is about big data.

  • This is a child's perspective.

  • As it turned out, the wealth of information provided

  • by ImageNet was a perfect match for a particular class

  • of machine learning algorithms called the Convolutional Neural

  • Network pioneered by computer scientists

  • Kunihiko Fukushima, Geoffrey Hinton, Yann LeCun,

  • back in the 1970s and 80s.

  • Just like the brain is consisted of billions of neurons,

  • a basic operating unit of the Convolutional Neural Network

  • is a neuron-like node that gets input from other nodes

  • and send output to others.

  • More over, hundreds and thousands

  • of these neuron-like nodes are layered

  • together in a hierarchical fashion,

  • also similar to the brain.

  • This is a typical Convolutional Neural Network

  • model we use in our lab to train our object recognition

  • algorithm.

  • It's consisted of 24 million nodes, 140 million parameters,

  • and 15 billion connections.

  • With the massive data provided by ImageNet

  • and the modern computing hardware like CPUs and GPUs

  • to train this humongous model, the Convolutional Neural

  • Network algorithm blossomed in a way that no one had expected.

  • It became the winning architecture

  • for object recognition.

  • Here is what the computer tells us, the image contains a cat

  • and where the cat is.

  • Here is a boy and his teddy bear.

  • A dog on the beach with a person and a kite-- so,

  • so far, what we have seen is to teach

  • computers to recognize objects.

  • This is like a young child learning

  • to utter the first few nouns.

  • It's a very impressive achievement,

  • but it's only in the beginning.

  • Children soon hit another developmental milestone.

  • And they begin to communicate in sentences and tell stories.

  • So instead of saying--

  • CHILD 1: That's a cat sitting in a bed.

  • FEI-FEI LI: Right, this is a three-year-old

  • telling us the story of the scene

  • instead of just labeling it as a cat.

  • Here's one more.