李飛飛：我們最聰明的電腦還是盲目的 (Fei-Fei Li: Our smartest computers are still blind)

字幕列表影片播放

RUSS ALTMAN: It is now my pleasure
to introduce my colleague and tireless co-organizer, Dr. Fei
Fei Li.
She's an associate professor of Computer Science
and Psychology.
She's director of the AI Lab here at Stanford.
And she's done breakthrough research in human vision,
high level visual recognition, and computational neuroscience,
some of which I think we might hear about now-- Fei Fei.
[APPLAUSE]
FEI-FEI LI: Good evening, everyone.
It's quite an honor to be here.
And I'm going to share with you some
of my recent work in visual intelligence.
So we're going to begin 543 million years ago.
Simple organisms lived in the vast ocean of this Precambrian
age.
And they floated around waiting for food to come by
or becoming someone else's food.
So life was very simple and so was the animal kingdom.
There were only a few animal species around.
And then something happened.
The first trilobites started to develop eyes and life
just changed forever after that.
Suddenly animals can go seek food.
Preys have to run from predators.
And the number of animal species just
exploded in an exceedingly short period of time.
Evolutionary biologists call this period the Cambrian
Explosion, or the big bang of evolution,
and attributed to the development of vision
to be the main factor that caused this animal speciation.
So ever since then, vision played a very important role
in animals for them to survive, to seek food, to navigate,
to manipulate, and so on.
And the same is true for humans.
We use vision to live, to work, to communicate,
and to understand this world.
In fact, after 540 millions of evolution,
the visual system is the biggest sensory system in our brain.
And more than half of the brain neurons are involved in vision.
So while animals have seen the light of the world
540 million years ago, our machines and computers
are still very much in the dark age.
We have security cameras everywhere
but they don't alert us when a child is
drowning in a swimming pool.
Hundreds of hours of videos are uploaded every minute
to the YouTube servers, yet we do not
have the technology to tag and recognize the contents.
We have drones flying over massive lands
taking an enormous amount of imageries,
but we do not have a method or algorithm
to understand the landscape of the earth.
So in short as a society, we're still pretty much
collectively blind because our smartest machines and computers
are blind.
So as a computer vision scientist
we seek to develop artificial intelligence algorithms that
can learn about the visual world and recognize the contents
in the images and the videos.
We've got a daunting task to shine light
on our digital world and we don't have
540 million years to do this.
So the first step towards this goal
is to recognize objects because they are the building
blocks of our visual world.
In the simplest terms, imagine this teaching process
of teaching computers to recognize objects
by first showing them a few training
images of a particular object-- let's say a cat.
And then we design a mathematical model
that can learn from these training images.
How hard could this be?
Humans do this effortlessly.
So that's what we tried at the beginning.
In a straightforward way, we tried
to express objects by designing parts and the configurations
of their parts just using-- such as using
simple geometric shapes to define a cat model.
Well, there are lots of different cats.
So for this one, we cannot use our original models.
We have to do another model.
Well, what about these cats?
So now you get the idea.
Even something as simple as a household pet
can pose an infinite number of variations for us to model.
And that's just one object.
But this is what many of us were doing at that time.
We keep designing-- tuning our algorithms--
and waiting for that magical algorithm
to be able to model all the variations of an object using
just a few training images.
But about nine years ago, a very profound, but simple thought
changed-- observation-- changed my thinking.
This is not how children learn.
We don't tell kids how to see.
They do it by experiencing the real world
and by experiencing real life examples.
If you consider a child's eyes as a pair
of biological cameras, they take a picture
every 200 milliseconds.
So by age three, a child would have seen
hundreds of millions of images.
And that's the amount of data we're talking about
to develop a vision system.
So before we come up with a better algorithm,
we should provide our computer algorithms the kind of data
that children were experiencing in their developmental years.
And once we realized this, I know what we need to do.
We need to collect a data set that has far more images
than we have ever used before in machine learning and computer
vision-- thousands of times larger
than the standard dataset that was being used at the time.
So together with my colleague Professor Kai
Li and student Jia Deng, we started this ImageNet project
back in 2007.
After three years of very hard work,
by 2009 the ImageNet project delivered a database
of 15 million images organized across 22,000 categories
of objects and things organized by every day English words.
In quality and quantity, this was an unprecedented scale
for the field of computer vision and machine learning.
So more than ever we're now poised
to tackle the problem of object recognition using ImageNet.
This is the first take of the message
I'm going to deliver today-- much
of learning is about big data.
This is a child's perspective.
As it turned out, the wealth of information provided
by ImageNet was a perfect match for a particular class
of machine learning algorithms called the Convolutional Neural
Network pioneered by computer scientists
Kunihiko Fukushima, Geoffrey Hinton, Yann LeCun,
back in the 1970s and 80s.
Just like the brain is consisted of billions of neurons,
a basic operating unit of the Convolutional Neural Network
is a neuron-like node that gets input from other nodes
and send output to others.
More over, hundreds and thousands
of these neuron-like nodes are layered
together in a hierarchical fashion,
also similar to the brain.
This is a typical Convolutional Neural Network
model we use in our lab to train our object recognition
algorithm.
It's consisted of 24 million nodes, 140 million parameters,
and 15 billion connections.
With the massive data provided by ImageNet
and the modern computing hardware like CPUs and GPUs
to train this humongous model, the Convolutional Neural
Network algorithm blossomed in a way that no one had expected.
It became the winning architecture
for object recognition.
Here is what the computer tells us, the image contains a cat
and where the cat is.
Here is a boy and his teddy bear.
A dog on the beach with a person and a kite-- so,
so far, what we have seen is to teach
computers to recognize objects.
This is like a young child learning
to utter the first few nouns.
It's a very impressive achievement,
but it's only in the beginning.
Children soon hit another developmental milestone.
And they begin to communicate in sentences and tell stories.
So instead of saying--
CHILD 1: That's a cat sitting in a bed.
FEI-FEI LI: Right, this is a three-year-old
telling us the story of the scene
instead of just labeling it as a cat.
Here's one more.
CHILD 2: Those are people.
They're going on a airplane.
That's a big airplane!
FEI-FEI LI: Very cute-- so to train
a computer to see a picture and generate a story,
the marriage between big data and machine learning algorithm
has to take another step, just like our brain integrates
vision and language.
We use a deep learning algorithm to learn
to connect the visual snippets with the words and phrases
to generate sentences.
Now I'm going to show you what a computer would
say for the first time when it sees a picture.
COMPUTER VOICE: A large airplane sitting
on top of an airport runway.
FEI-FEI LI: Not as cute, but still good.
COMPUTER VOICE: A man is standing next to an elephant.
FEI-FEI LI: So this is an algorithm we
did to generate one sentence.
Recently, we've taken the storytelling algorithm a step
further and created a deep learning
model that can generate multiple sentences
and phrases in a picture.
Our algorithm is computationally very efficient,
so it can process almost in real time.
Here I'm showing you the algorithm generating
regions and region descriptions for every frame of this video.
So we have successfully used neural network algorithms
to train computer vision models to begin telling
the story of the visual world.
This is a brain-inspired perspective.
With the availability of data and the blossoming
of the powerful neural network models,
we begin to see unprecedented advances
in the field-- in all areas of computer
vision-- both in my own lab, as well as in our field.
Now let me show you a few more examples
and their potential applications.
Collaborating with Google's YouTube team,
we developed a deep learning algorithm
that can classify hundreds of sports types.
We hope one day this technology can help us to manage,
index, and search massive amount of photos and videos
in big data repositories.
Working with a European train station,
we used hundreds of computer vision sensors
to help observing and tracking the behaviors of millions
of travelers and customers.
This provided invaluable information
for the train station to collect data analytics
of their customers and to optimize the use of space.
Furthermore, we developed a reinforcement learning
algorithm and deep learning model to process human activity
understanding in an extremely efficient manner,
achieving the same results as a state-of-the-art algorithm
in action detection using only 2% of the video frames.
In a different work, we used step sensor
to learn about human movements in very great details.
We collaborate with the Stanford hospitals
to deploy this technology to help the hospital
to improve health hygiene and workflow practices.
And in this work, we train the computer vision algorithm
that can do better object recognition than humans--
at least some of us-- by recognizing 3,000 types of cars
by make, model, year.
We apply this to 50 million Google Street View
images over 200 American cities and learned
very interesting social statistics,
like a visual census.
We learned that the average car price
can correlate very well with average household
incomes in cities.
Or they can correlate very well with crime rates in cities.
Or even voting patterns-- let's wait till later this year.
So in short, as a technologist, nothing
excites me more to be seeing the potentials of computer vision
algorithms to solve real world problems.
This is a technologists perspective.
So 500 million years ago the challenge
of animal vision and intelligence
is for the survival of individual organisms.
Today the challenge of machine vision
and artificial intelligence is for the thriving
of the human race.
An inevitable question for us technologist
is whether AI will become a force of destruction
or is it the hope we have for a better tomorrow?
So I've been thinking about this problem for a long time.
But recently I think I've had an epiphany moment.
So the future of AI-- just like what Megan just
said-- it's in the hands of those of us who create,
develop, and use AI.
So AI will change the world.
But the real question is, who will change AI?
So it is well-known that we have a crisis of lack
of diversity in STEM in our country,
including here in Silicon Valley.
So in academia only 25% of computer science majors
in colleges are women.
Less than 15% of faculty in our nation's top engineering
schools are women.
And the number is even lower for under-represented minority.
And in industry, the picture is not pretty either.
So in my opinion this is not just
a problem of culture at workplaces.
This is a economical and social problem.
It impacts the collective creativity of our technology
and it will impact the value we build
into our technology for this and for the future generation.
So recall the work I showed you about using computer vision
algorithm to help improve health care.
This was led by a woman PhD student
in my lab who's passionate about using technology
to improve health care.
And the work on using deep learning
algorithm to create a visual census of our American cities--
it was done by an African American PhD student, who
is not only a talented engineer, but very passionate
about social justice.
Each of these students could have done work
in many other applications of AI,
but they chose these dissertation topics
because they have found a humanistic mission
statement in their technology.
In fact, I dare say it is the humanistic mission that
kept them to working AI.
So 500 million years ago the development
of vision and intelligence resulted
in the Cambrian Explosion of the diversity of animal species.
Today, by including a humanistic message
in AI education and research, we can
encourage more diversity of technologists
to working AI and STEM.
Through them, we'll see a technological Cambrian
explosion of discoveries and innovations
that can make our world a better place for tomorrow.
This is the perspective of an educator and a mother.
Thank you.
[APPLAUSE]