字幕列表 影片播放
FEMALE SPEAKER: Please join me in
welcoming Mr. Kenneth Cukier.
[APPLAUSE]
KENNETH CUKIER: Thank you very much.
You can probably appreciate the fact that I've got a lot
of trepidation coming here to talk to you folks for the
obvious reason that I'm wearing a suit.
And the truth is I had a breakfast this morning at the
Council on Foreign Relations to talk to them about the
international implications and the foreign-policy
implications of big data.
That leads to the second trepidation and the context of
my remarks.
So the second trepidation is that this is a sort of
homecoming for the book.
Because my journey, so to speak, in the world of big
data started at Google and started at the
Googleplex in 2009.
It was you folks who opened up the kimono to what you were
doing in very small little slivers.
I never got the full picture.
But I was able to cobble it all together and see something
and then give it a label to it.
Luckily, there was a couple of labels that we
were thinking of.
And I reached for one that wasn't a popular term at the
time, and the term was big data.
And that was really helpful.
It was the cover story of "The Economist"
in February of 2010.
It was called "The Data Deluge", because they thought
they would sell it better than saying "big
data." But big data--
it was basically all about that and about what
you guys are doing.
And so it brings me great fear to walk into a room, because
you guys have been doing it for so long.
And that brings me into the context of my
conversation today.
I want it to be a conversation.
I was obviously just at the Council on Foreign Relations
thinking about this in ways that I am sure your engineers
never thought about it 10 years ago.
I may have heard a snort.
But here's the thing.
Many of you were thinking of it as a technological issue
when people around the world think of it in terms of the
competitivity of nations.
Our book, which is being released today in America, has
already been available in China, where it's been a
best-seller.
And when we hear questions from Chinese journalists to
us, they're all talking about the national project that
they're on.
Is this the way for us to leapfrog with the West?
Is this one area of technology, unlike the
internet and computing, where we can lead?
So the implications of this are vast.
And the implications are more than just technological.
I'm at a technology company-- in fact, the pioneer, in many
respects, of big data.
But I want to explain that I'm here as a journalist, as
someone who's looked in at your world and now can serve
as a sort of a filter.
And what I'd like to do is show you that world from a
non-engineer's perspective, from someone who just is
curious about the world and society and thinks deeply
about these issues.
Now there's a second disclosure I have to make, and
that is not only am I talking about big data, but my
presentation is big data.
Because there's 70 slides.
On top of it, I haven't actually really seen the
slides except for once or twice, because they just
arrived to my inbox this morning from someone who was
putting it together for me.
This is actually the recipe for disaster, so please have
forbearance.
I'm going to go really quickly, and I'm probably
going to skip through a couple of these slides.
So let me start with a story, and the story is the story of
a company called Farecast.
And it begins in the year 2003.
A guy named Oren Etzioni at the University of Washington
is on an airplane.
And he asks people how much they paid for the seats.
And it turns out, of course, for one person paid one fare,
and one person paid another fare.
But this made Oren Etzioni really, really upset.
And the reason why is that he took the time to book his air
ticket long in advance, figuring he was going to pay
the least amount of money.
Because that's the way the system worked.
And then he realized actually that that wasn't the case.
When he figured that out, he was really upset.
And he figured, if only I could knew what is the meaning
behind airfare madness.
How would I know if a price I'm being presented with at an
online travel site is a good one or a bad one?
And then he came up with the insight.
Because he's like you-- he's a computer scientist--
he realized actually--
that's actually just an information problem.
And I bet I can get the information.
All I would need is one simple thing--
the flight price record of every single flight in
commercial aviation in the United States for every single
route, every flight, and to identify every seat, and to
identify how long in advance the ticket was bought for the
departure, and what price was paid, and just run it through
a couple computers, and then make a prediction on whether
the price is likely to rise or fall, and score my degree of
confidence in the prediction.
Pretty simple.
So he scraped some data.
And it works pretty well.
And he runs a system.
It's great.
The academic paper that he writes is called "Hamlet--
To Buy or Not To Buy, That Is the Question." It works well,
but then he realizes, hey, this works so well, I'm going
to get more data.
And he gets more data, until he has 20 billion flight-price
records that he's crunching to make his prediction.
And now it works really well.
Now it's saving customers a lot of money.
It gets a little bit of traction, and Microsoft comes
knocking on the door.
He's in Washington.
He sells it for about $100 million--
not bad for a couple years work, and a couple PhDs in
computer science that was working with him.
But behind this, the key thing is this.
He took data that was generated for one purpose and
reused it for another.
When the Sabre database--
at the time probably the airline reservation system and
one of the biggest, actually the biggest civilian computer
project at its time when it was created in
the '50s and '60s--
was created by American Airlines and IBM.
They never imagined for a million years that the data of
the passenger manifest was going to become the raw
material for a new business, and a new source of value, and
a new form of economic activity.
And we're going to be creating markets with this data.
And if you want to understand what big data is, at least
from a person looking into it--
because Google's been doing big data for a long time.
What we're seeing across society is what you folks have
been doing for years.
We're seeing that data is becoming a new
raw material of business.
It is the oil, if you will, of the information economy.
There's a lot of data around in the world today.
You know this.
The arresting statistics are obvious.
Whenever we put on a big new sky survey--
telescope for you and me-- goes online.
Whenever it goes online, it usually ends up collecting as
much data in the first night or two as in the history of
astronomy prior to it going online.
And obviously, the human genome, et cetera.
You all know the data about big data, so I won't spend too
much time there.
But what we see behind big data are three features of
society, or shifts in the way that we think about
information in the world--
more, messy, and correlations.
So more.
We're going from an environment where we've always
been information-starved--
we've never had enough information--
to one where we-- that's no longer the operative
constraint.
It's still a constraint.
Of course, we never have all the information.
What is information?
Is it really the real thing?
But what's clear is that instead of having to optimize
our tools to presume that we can only have a small sliver
of information, when that changes, we
can get a lot more.
And so what does more mean?
Well, think of it as 23andMe.
What they do is they actually take a sample of your DNA, and
they look for very specific traits.
Now that works well, but it's imperfect as well.
That's one reason why it's only $100--
a couple hundred dollars.
When Steve Jobs had cancer, he was one of the first
individuals in the world to have his entire genome
sequenced and his tumor sequenced as well.
So he had personalized medicine, and it was
individually tailored to the state of his
health at that time.
When one drug would work, they'd continue.
When the cells mutated and blocked the drug from working,
they routed around it and tried something else.
They were able to do that because they had all of the
data, not just some of the data.
And that's one of the shifts that we're seeing
from some to more.
And in some cases, n equals all the data.
We also have messy data.
That's another feature as well.
In the past, we had highly curated databases--
information that we optimized our tools to get in the most
pristine way as possible.
And this was sensible.
When there's only a small amount of information that you
can bother collecting and processing, because the cost
is so high and it's so cumbersome, you have to make
sure the information you get is the best
possible thing you can.
But when you can avail yourself of orders and orders
of magnitude more information, that constraint goes away.
And suddenly, you can allow for a little bit of messiness.
Now, it can't be completely wrong.
But messiness is good.
You folks are pioneers of this in machine translation.
And you know the famous Peter Norvig, and Allen Harvey, and
others' paper on the unreasonable
effectiveness of data.
The idea here is that machine translation worked actually--
was a real step up.
When IBM tried it in around '56 with 20 Russian phrases
and English phrases that they programmed the computer to
translate, it looked impressive.
It was ridiculous, of course.
We now know.
It's like a punch card.
Then when IBM's project Candide came around in the
'90s, actually that was not machine translation.
That was statistical machine translation.
That was really good, relatively speaking.
What they did is they took the Canadian Hansard--
the parliamentary transcripts that were translated into both
English and into French--
and they just let the computer make the inferences of when a
word in French, and it would be a useful substitute for the
one in English.
They didn't try to presume what was
right or what was wrong.
They let the computer infer that itself and score the
probability that one would be the right word or not in that
particular context, and go forward.
C-Change--
they tried to optimize it and make it better.
Couldn't.
Couldn't at a reasonable way.
It just was--
it was a hard problem.
Then Google came along.
And you guys didn't avail yourself of just the
parliamentary transcripts in French and English in Canada.
You availed yourself of the World Wide Web.
It wasn't 1994.
It was 2006.
You poured in.
You got all of the European Union
translations of all 21 languages.
Your Google Books project became a signal for what was
good and not because of the translations that you could
find in the libraries.
Now in many instances, the data was far less clean than
in the past when we tried to do it with just a
small amount of data.
But the fact is more data beat clean data.
Messiness was good.
And the final point, which is obvious, is correlations.
We have had a society in which we've always looked for causes
behind things, and that made sense in a
small-data world as well.
In fact, causality is still very useful to know.
But for a lot of the problems that we're dealing with these
days, just knowing the correlation is good enough.
And in fact, what we're finding is that often we think
we see causality when we don't.
And it's hard to do.
So there's going to be cases where we actually still want
to know the reasons why.
But often, just knowing what is good enough, because we can
learn the correlation and go with that.
So a similar company like Farecast is Decide.com.
This is Oren Etzioni's company again that basically looks
across the web at all of the prices online, not just of
airlines, but of anything that has lots of price data and
high variability and just ranks it to say, is this a
good price or not?
And it leads to new markets.
It leads to transparency, which is good for customers.
More interestingly is what this means for human health.
Premature babies, known as preemies.
In the past, when we thought about health care, we would
take the vital signs of someone maybe once or twice a
day, couple more times if it was important enough.
And a doctor would look at the clipboard at the edge of the
bed and make a decision on what to do.
Feedback loop was really, really long.
Very, very imperfect.
What we're now able to do-- some researchers in Canada are
doing this, is they're looking at the real-time flow of 16
different streams of vital signs of premature babies.
And they're able to score it and look for
correlations with it.
And when they do that, they find that they can spot the
onset of infection 24 hours before overt symptoms appear.
By doing that, that means that you can have an intervention
sooner, see if the interventions working better,
react to it, and save lives.
But you learn something else as well.
You would have thought-- and you can imagine generations of
doctors looking at the clipboard, seeing the kid's
vitals stabilizing, and thinking it was safe to go
home to supper, that things were OK, and we'll treat the
patient tomorrow.
Just nurse, call me if there's a problem.
And then to get a frantic call at midnight saying something
had gone horribly wrong.
The fact is, what we're finding is that one of the
best predictors that there is going to be an onset of an
infection is that the baby's vital signs stabilize.
Weird, right?
Why?
We don't actually really know why, what's happening
biologically.
It kind of seems like the kid's little organs are just
battening down the hatches for a rough night ahead.
We don't know why, but we know that with that correlation, we
can do something better.
We can save its life.
And we didn't know that before big data.
Behind this is we have data.
Why do we have data?
Well, we're collecting more data for things that we've
always collected data on.
Weather--
great.
That's fantastic.
But it also is because we're collecting things that was
always informational but we never treated as data
before, like you.
So you're all sitting.
You're sitting down.
And you are sitting different than you, and you,
and you, and you.
You weigh different.
Your legs are different.
Your posture is different.
The distribution of weight.
And you know that if I have 20 sensors on your seat and on
your seat back that I can probably score with a high
degree of accuracy who you are based on the way you sit.
Why is that useful?
Well, for one purpose, you can imagine that this would be a
great anti-theft tool in cars.
Put this in, and suddenly you would know that the authorized
driver of your Lamborghini is you, and
it's not someone else.
Or if you have children, likewise, hey, I told you you
can't take the car out after 10:00 PM.
And so the engine didn't work when you tried to sit down and
turn the keys in the ignition.
That's great.
But what else can you do with it?
Well, think about it.
If everybody has their car seat instrumentized and you
actually datify posture, suddenly you would be able to,
perhaps, identify the telltale signs of a shift in posture 30
seconds prior to an accident.
The probability of you getting into an accident by a shift in
your posture.
Maybe what we've datafied is driver fatigue.
And the service here would be the car would
send an internal alarm.
Maybe the steering wheel would vibrate, or there'd be a chime
saying, hey, wake up.
You have a high likelihood of getting into an
accident right now.
That's the sort of thing that is left to play for as we
data-ize society in a world of big data.
So what we're seeing is lots of things
being datafied as well.
Facebook datafies our friendships; Twitter, our
stray thoughts, our whispers; LinkedIn, our
professional contacts.
Google datafies our intentions.
So obviously, Google Flu Trends is a wonderful way to
have a predictor of what the likelihood of
outbreaks of flu are.
Now that's great.
It's just you don't want to have to know causality.
You don't know why.
It just is what it is.
Now you may recall that there was a little bit of a grumble
in the scientific community recently when they said, the
CDC this year said that flu was going to be right here--
CDC, the Center of Disease Controls.
And Google Flu Trends like this, it
didn't work this year.
Bullshit.
How do we know?
Because CDC is reported data.
The person came in.
Maybe because of the economic crisis, people decided they
had to show up at work and didn't go see a doctor.
Maybe the Google Flu Trends is accurate, and
that's what's real.
And CDC is just reported data, not observed
behavior, isn't as good.
No one thought of that.
Big data has been with us for a while.
It turns out that there was an American commodore who had
data-ized all of the old log books inside of
the dusty Navy trunks.
And with that, he was able to create a whole new form of
nautical map that told sailors not just where they were but
the patterns of the winds.
No one realized that the world, and the winds, and the
waves conformed to natural patterns.
If you will, that the sea had its own physical geography,
and that if you [? allided ?] yourself with those things,
that you could have a safer voyage.
And we can do that now.
But the problem is it took him a decade and dozens of people
to do it, and we do the same sort of thing in about one
sixth of a second every day.
So it's a democratization play of techniques that we have
tried to do in the past.
We've done sometimes.
Obviously, censuses have been around since Jesus was born.
But now we're actually doing it in a widespread way.
Predictive maintenance is a good example of taking the
same idea about premature babies and
applying it to machines.
When your car is about to break down, it doesn't go
kaput all at once.
Usually, you can feel it.
There's a grumble, or it just doesn't drive right.
Well, now what we can do is instrumentize it, see what the
data signature of the heat and the vibration is, find out how
it correlates with previous incidences of a break down,
and know, perhaps, two days in advance that your fan belt is
going to break.
And that's happening today in fleets of cars by UPS, and
it's going to be in your car tomorrow if
it's not already there.
The value of data is hidden.
It's hidden not in the primary purpose for what it was
collected for.
But now with big-data techniques, it's often
uncovered in its multiple secondary uses that are just
limited by our imagination.
So INRIX is a car company that takes the sat-nav system and
makes a prediction on how long it's going to take from one
place to another.
Sounds great.
It's a good service.
Use it.
It's also used by economists to understand
the health of economies.
Because they see how cars drive, and the frequency and
propensity of cars, and the travel times as a proxy, an
indicator, for the health of a local municipal economy.
Hedge funds use this information to look at the car
circulation in the areas near a retailer on the weekends.
And so prior to the quarterly announcements, they have a
good indicator whether the sales are going to increase or
decrease, and they can short or go long on its shares.
No one would have thought of that in the past, that we
could do that sort of thing with information.
Obviously, everything that we do-- all of our interactions--
give off lots of data exhaust.
You folks are experienced with that, because you treat all of
the interactions of an individual who goes to your
website as a signal for something else.
You've built your systems and optimized it based on that
form of data exhaust, by treating information as a new
raw material that you can recycle back into the system
to improve it or to create a whole new system altogether.
There is going to be winners and losers in this new world.
There's three features that seem to be distinguishing
who's going to do well.
And that's the skills, the mindset, and the data.
The skills are kind of obvious.
It's the people who have technical knowledge, or it's
the vendors who sell you stuff.
That's great.
The mindset, in some ways, seems to be more important.
Because what you need is not just the skills.
That gets commoditized first, obviously.
History of computing suggested that.
The first computer scientists in the '50s and '60s--
actually, not the scientists.
But these were the doers, the software programmers.
They looked like they were sitting pretty wearing the
white lab coats.
But by the '70s and '80s, man, just an ordinary rinky-dink
software developer had been largely commoditized.
And we're going to see that with big data as well.
Today, some companies and people are at the high end.
It's going to filter through as the PhD
programs get forward.
The mindset is going to be critical and the creativity.
But ultimately, both of those things are going
to go to the wayside.
Because if you remember, Jeff Bezos had a great dot-com
mindset really early in '94 or so, and he executed
brilliantly as well.
But by the 2000, every executive was
thinking about the web.
So we're going to have the same thing with big data.
That advantage doesn't hold that long.
The data, however-- who has access to the data is going to
be critical.
That's the resource.
So weirdly, and ironically, what seems to be abundant
today is actually the source of scarcity, and vice versa.
Now in New York, we have a problem
with overcrowded buildings.
But before I tell you that story, let me see much time we
have, because I want to get questions as well.
I think we're doing OK.
So tenements, overcrowded buildings, and the problem of
just stuffing 10 times as many occupants into a single
dwelling as it was designed for.
This is a bad thing, and it's a bad thing
because it leads to crime.
It leads to drugs.
It leads to violence.
And it leads to fires, and not just any kind of fires.
Basically, these kind of fires kill the occupants.
And they also end up injuring and killing at much greater
rates the firefighters who go to help it out.
So this is a serious problem for the city.
The city gets something like 63,000 calls a year for
complaints of illegal,
overcrowded stuffing in buildings.
And there's only 200 inspectors at City Hall.
So there you see a problem.
But the problem is actually a big-data problem.
How can we solve it?
Well, the first thing that they do is they take a list, a
database of every single building in Manhattan and the
five Boroughs, and that's 900,000, give or take.
And then they look at everything as a signal,
whether it's going to be a predictor that the thing is
going to burst into flames, or that it's going to actually
improve the model by predicting that it's not going
to burst into flames.
So they look at things like ambulance calls, utility cuts.
Has there been a lean against the property?
Is there complaints of rodent infestation?
So the number of rats in the building is not datafied, but
complaints to the city's 311 line is.
So you find out the number of rodent complaints.
And all together, you look at it.
And you can score with a high degree that the building's
going to burst into flame.
They looked at weird stuff, from like the Department of
Building Works on whether exterior brickwork had been
recently done to the building.
That improved the model too.
Because if brickwork was done to the building, even if you
had all these other problems that were high indicators of a
fire, it went down.
When they pressed go on the system, now what happens is an
inspector, instead of going in-- and about 15% of the
times they would make a visit.
In the past, they would issue a vacate order, the stiffest
sanction which basically says, everyone out in 24 hours.
Before it was 15% of the time-- a high rate.
So it tells you there's a big problem.
Now it's 70% of the time.
And so what that means is that they like it, the inspectors,
because it's more effective.
The fire department likes it, because their firemen aren't
dying as much.
And it's just good for all of us in our communities to see
that people have good housing and that these buildings don't
catch on fire.
And that was because they turned the problem into a big
data problem and solved it successfully with information.
And they gave up trying to figure out the causality and
just went with the correlation.
There, of course, are serious issues of big data.
One is going to be privacy.
It's a problem now in the small-data era.
It's going to be a bigger problem with big data.
But there's going to be something on top of that.
And that is not privacy, but propensity--
but prediction.
The idea is that we're going to have algorithms predicting
our likelihood to do things, our behavior.
And it's going to be obvious that we're going to have
governments try to-- or our businesses sanction us on the
basis of that prediction.
It's going to look a little bit like "Minority Report" and
the idea of pre-crime.
We're going to be denied a loan, because we're going to
not have the likelihood to repay.
But instead of this profiling universe in which we take us
as a big clump--
and we have a small-data problem.
Here are the 13 predictors, and this is the explicit rules
by which we can tell you the formula.
Imagine if we have 1,000 variables.
It's a machine learning algorithm, and when we try to
knock on the door in front of a court and say, I was denied
surgery because you said I had a 90% mortality rate after
five years with my individual data, I want you to disclose
to me how you came to that decision
because that seems unfair.
They're going to say I don't know.
I can show you the formula.
I've frozen every instantiation of the data at
every moment, because regulators required me to.
If I printed out the formula, it would be on 600 pages.
You need a PhD to understand it.
It's true you have only 40 strong signals, but you've got
a long tail of 600 weak signals that all went into it.
I can't tell you why.
And then the person's going to say, OK, I don't even care
about that.
What makes you think I'm not going to be part of the 10%
that's going to live past 10 years?
Why are you denying me this operation because you think
I've got a high probability of not surviving longer after it?
I want to take the test.
And you can imagine with criminal-justice systems, it's
going to be the exact same thing.
This is the issue that we have.
And so what do we have to preserve in this instance?
Well, if it's the case of whether we think this fellow's
going to shoplift in the next 12 months with a 99-degree
percent probability, he can rightly say, I don't even care
about the mumbo-jumbo of big data.
I'm part of the 1% that's going to exercise free will,
moral choice, and do the right thing.
Now of course, all 100 of those individuals will say the
same argument.
But it does mean that it seems like we're going to have to
create a new value in our world.
Just as the printing press gave us the consciousness of
free speech--
prior to the printing press, we didn't have a guarantee or
the consciousness of free speech.
When Socrates drank the hemlock for corrupting the
youth of Athens, in his apologia did he make a
free-speech claim?
No.
It didn't exist.
It took the printing press to give us this idea that
expression was something that needed to be protected.
What will need to be protected in a world of big data?
Well, maybe human volition, free will, responsibility.
We have always had the risks of data.
We're going to have to deal with this even more as we
become more respectful of data and live with it in more parts
of our lives.
We've looked at data in lots of ways.
America went to war over a statistic in a data point, and
we saw the problems there.
We're going to need regulators to think about how we can
adopt this and get the most benefits of it.
Probably one of the biggest is going to be giving us a degree
of transparency.
When there was an information explosion at the beginning of
the 19th century and it was financial information, we
created accountants to do the bookkeeping and auditors to do
the surveillance function on top of the accountants.
And I think that in the future-- and we mention this
in the book--
that we're going to have to create a new
professional class.
And we might as well call them algorithmists, who are going
to be trained in big-data techniques and actually serve
internal to companies, as well as external in terms of an
expert witness and a master to a court, that they can
actually understand what's going on and serve as a
translation function between the public interest and what's
happening in the mathematics.
For privacy, the shift is probably going to have to go
from regulating the collection of data, as we do now in these
preposterous screens of 60 lines of all-capped letters in
which you just say, I agree with the terms of service and
not read it, to something where we
actually regulate use.
And luckily, that seems to be an idea that is actually
gathering steam.
It's a lot harder for regulators, but it's
definitely better for businesses.
And it's definitely better for consumers.
And of course, we're going to have to
sanctify human volition.
There is a role for antitrust regulators, as well.
Antitrust turns out to be an extremely fertile and fungible
public policy, because it's technology-neutral.
It doesn't really make many presumptions about what it's
regulating.
What it does is it just looks at market concentration.
So it looks like it's going to be a very useful tool with
which we try to understand what to do and to create an
open market.
Now there's a problem in this that I'm laying out, which is
regulators can understand what scale looks like when it's
something tangible, like a widget.
Actually, the antitrust came out of the railroad, so we'll
say a car, a carriage.
And then they applied it to telephones, and they called it
common carriage.
There's still a Common Carrier Bureau at the FCC.
The carrier, if you will, was from the Interstate Commerce
Act where the language was taken from.
They then applied it to software markets--
Microsoft.
What does scale mean in data?
What does it mean when the data is doubling every three
years or so?
What does it mean when the market is changing form, that
it's not the same market in five years as it was five
years earlier?
The businesses look different.
It is going to be really difficult to do, but we're
going to have to try to do that.
Because we're going to need the assurances that we can
have challengers as well as incumbents.
We need both.
This is about the way that we live.
We're going to need to act with our
humility and our humanity.
Thank you very much.
There's time for questions, so shout it out.
There's microphones as well.
Yeah, please.
Go to the mike.
Thanks.
AUDIENCE: Hi, my name is Cynthia Elliott.
I have a question.
So I can see this data being very useful, like in let's say
drafting of an athletic player.
Have you ever encountered anything where colleges would
want you to use this data to determine who could have the
athletic ability to be drafted?
KENNETH CUKIER: Yeah, well colleges and professional
sports are using the data already right now quite a lot.
The whole book and movie "Moneyball"
was just about that.
Partially about new statistics and new ways of examining it,
but then partially just applying data to it.
And Nate Silver, of course talks a lot about just
trusting the data and just doing--
Nate Silver is not doing big data.
He's just doing data.
He's doing statistics, but the small difference is he's just
listening to it.
He's just doing it seriously and trusting it.
So this is going to actually change lots of the ways that
we evaluate people.
So when we think about students and education, right
now what a teacher does is it scores what every person in
the class's grade and tells everyone, this person got a
95, and this person got an 85, and this person got a 75.
The teacher doesn't actually look at what is the content of
the-- or rarely, what is the content of what was corrected
and what was wrong.
What if that teacher was to find out that all of the
students in a math exam got the exact same-- not all.
But let's say 80% of the students got the exact same
answer wrong with the same answer.
Suddenly, he or she might say, hmm, I mistaught it.
They inverted the algebraic equation thinking that it
could be a or b and b or a.
But in fact, the sequence matters, and I've got to go
back and teach that.
So not only does the student learn more, but the teacher
learns as well.
So in terms of drafting, sports is one of the first
people to adopt these techniques.
And it's actually changing how they think about their game.
Certain players--
why would you have a defense for the opposing team?
You want a defense for who the player is because of his
propensity to score a shot-- if it's basketball--
from one part of the court versus another.
If that player on the left side always misses the basket
there, let him to take a shot.
I'll take it on the rebound and then pass it up.
Versus oh, don't let this guy get here into the key.
Then we're really in trouble.
AUDIENCE: With regulation, I think it's a
very different situation.
Because in the previous antitrust things, it's been
actually pro-consumer to limit the amount
of data people have.
There's a reason people go to Amazon, because they've got
more data and better data than anybody else.
So in fact if you use antitrust against there, you
may in fact make life worse for the
consumers rather than better.
KENNETH CUKIER: That is absolutely true.
Now the question where this is going to take us-- and we need
to have a societal conversation about it-- is
whose data is it?
Whose rights to the data should it be?
Does the individual own the data because it's his or her
click stream?
That sounds logical.
But of course, they decided to go to that website, and that
website invested in collecting the data and analyzing it.
Should they be required to hand it over, particularly if
they're going optimize--
let's take Amazon--
their own algorithms so that they can make great
recommendations.
Why would they want to be able to give that to the customer
so they can hand it off to Barnes and Noble?
You're enriching your competitor.
That sounds almost like eminent domain.
That sounds like a governmental takings.
We don't know how to answer these questions.
But the point here is that what rights does the
individual have to his or her data?
Should it be transparent?
Should there be data portability?
For telephone numbers, we had to create number portability.
And that seems to be a very good way to get carriers to
actually love us rather than to lock us in.
Do we need the same thing in the world of big data?
I think that your public-policy people should be
thinking about it.
They probably already are.
And we need to--
and not coming up with answers, but starting the
discussion and having the debate.
Please.
AUDIENCE: One of the three things which you mentioned are
important is mindset.
So you have some framework or principles that one can follow
and improve on that?
Because that's one thing which is not as commoditized as with
[INAUDIBLE].
KENNETH CUKIER: Yeah, no.
I don't have any--
there's no simple list or there's no recommendations I
would put forward.
Because this is sort of about the spark of creativity.
It's a little bit da Vinci like.
So I think what is required is just for very creative
individuals to look at what's going on around them and
breath the hurly-burly of humanity and see
the filament of it.
The whole point of Google was page rank.
But that was just the algorithm.
The genius was to understand that every single interaction
with the content gave you another signal to improve the
search result.
And that was, if you will, all you need is one
good idea in life.
And if you really go full-throttle with it-- and
it's a good idea--
it's limitless.
So like Oren Etzioni just had this idea that I can take
something that nobody knows the answer to.
But the answer exists.
It's hidden in plain sight.
I can get the data.
And if I do the right thing with the data, I can actually
transform it and get the insight that we need and
create new forms of value.
There is no simple way to develop that sort of mindset.
A lot of it is luck.
There are lots of people that I know of prior to Oren
Etzioni who had that idea.
There was a company called Strong Numbers in Boston about
1998 by Jeff Hyatt.
He wanted to be the Blue Book for everything.
He thought about all the things that you
could do with data.
He was a little bit too early.
Dot-com bubble burst, and so did his dreams.
Went on to build other companies and do
very well for himself.
But it just shows that there's a lot of factors involved.
AUDIENCE: Hi.
So my first question was actually already answered by
you in terms of who owns the data.
But more specifically with the American Express example.
So do you have to purchase that data from--
not American Express.
American Airlines.
Did he have to purchase the data, or was he able to gain
access to the data?
KENNETH CUKIER: Great question.
So the point about American Airlines was that was the
airline carrier who built the original airline computerized
reservation network called Sabre.
So when Oren Etzioni wanted to get the data, he wasn't going
to go to Sabre.
And the reason why is Sabre is probably the biggest airline
reservation network, and they have no incentive to sell it.
Because that's just not what they do.
So he had to go to one of the start-ups, one of sort of the
hungrier people that were the challengers in
there to get the data.
And he found one called ITA Software.
OK, you see where this story's going to end.
This is great.
So he goes to ITA Software and says, will you do it?
And man, they've got a problem on their hands.
Because on one hand, they need the data from the airlines,
and this is going to really screw over the airlines.
On the other hand.
Oren's going to pay them a little bit of money and give
them a commission.
So they don't know what to do.
So actually, I interviewed--
and they're just a bunch.
Are these airline executives?
No.
These are a bunch of MIT PhDs and stats who did it because
they thought it was a really complex problem
and a lot of fun.
So I interviewed one of the co-founders.
And I said, well, what did you do?
And he said, well, the truth is we actually kind of came up
with that idea ourselves independently
before Oren did it.
And we did it internally just for fun, but we could never
release it as a product.
Because we just knew it was going to really harm us.
We'd never be able to get the data from people.
But this was a way that we could license the data at an
arm's-length way and still get a couple guineas for it.
And he had the data.
So it shows you that there's these competitive interests.
So what happened to ITA Software?
They were acquired.
For how much?
Between $700 million and $800 million dollars?
By whom?
By Google.
So why?
Well, you guys can answer that.
I know that when I do my searches, I see the airline
listings in, and that's a great feature.
But I'm sure you're playing a stronger hand and a lot more
long-term one.
The regulators walked in.
It was one of the real first substantial essentially
antitrust remedies against Google on this.
And what it was it was sort of a must-license provision with
a reporting requirement going back to the companies, and
essentially to the FTC.
For a period of a couple years-- maybe two or three,
maybe five years--
they couldn't actually cut off the license that they had with
people like Kayak, et cetera, because they were afraid you
were going to become the world's biggest travel agent
and dominate everyone.
But the point here is this.
Think about the sums.
Oren Etzioni's Farecast, $100 million.
ITA Software, $700 million.
The difference here is that the algorithm and the skills
is really good, and the service is really good.
But he who has the data--
that's the gold.
AUDIENCE: Do you see this societal shifts, particularly
in America?
I think big data and individualism is a really
interesting area to think about and whether this changes
the way that we think about an individual's role in society.
The examples you were giving of a criminal who is 99%
likely to recommit versus 1% likely to rebuild his life and
be a productive member of society.
Hedging for the benefit of society is to put all 100 back
into jail and leave them there.
How does this impact how we think about individual benefit
versus societal benefit?
And does this also play into future governmental shifts?
KENNETH CUKIER: OK.
So in the case of a criminal-justice system, let's
have that debate.
Because I think it's not an easy one, and anyone who
thinks it is doesn't understand the problem.
If I can tell you with a 99% degree of accuracy that that
man is going to commit a violent crime, I would be
remiss from intervening.
And it would look like I'm almost anti-science if I said,
you know, I just think that 1%--
we got to give them the benefit of the doubt.
You just never know.
It just doesn't look right.
So on the other hand, this is one of the most heinous
affronts to the dignity of the individual that
we could ever conceive.
And we don't have any experience thinking through
this issue.
This is so essentially that we figure it out, that we have
the debate, that the debate starts now.
Now what about data for individuals?
What does that mean?
Well, in Athens if you were a male, you
served in the military.
If you didn't want to sever in the military, leave.
Right now, we think that one of the most precious things we
have is data about our bodies, our health care, our privacy.
Let's change the debate.
Let's change the argument entirely.
Let's invert the burden of proof.
Let's just say that if you're a citizen of a country, you
have to share your data on your health care in a global
commons so that researchers can learn from it and treat
everyone's health better.
You don't have to do that.
Leave.
It might sound draconian, but the fact is do you have a
property right, or some sort of moral right to your data?
Well, I don't to my image if I'm walking down the street.
And we do know that I can learn a lot from the data.
And we also know from stats that if we allow some people
to back out for whatever reason, it really
becomes very imperfect.
So suddenly, I think that we should change the debate.
And I think the most obvious one would be health care.
But often, when you look at these issues-- as you've
pointed out to in big data--
these are new issues that we have for us.
We've had a bit of sloppy thinking about it lately,
because we haven't had to deal with it.
And because the whole issue of big data has been absconded
with by the technology vendors as the latest flavor of
chocolate ice cream.
But now, let's calm down, and let's think about it.
The debate should start now.
AUDIENCE: My question is kind of a follow-up.
So let's assume that all data is public.
You don't have any data barons.
How does that influence the game?
And specifically, humans change their behavior, and how
does having data that's correct as of yesterday based
on what I can infer--
I'm sorry--
all of us in [INAUDIBLE] can infer and predict what we all
of us are going to do tomorrow if we are
reacting to this data?
KENNETH CUKIER: Yeah.
So yeah, there's a great circularity that the data is
going to be making predictions.
We're going to learn about these predictions, and we're
going to change our behavior based on those
predictions ever thus.
Right?
This is just going to be the reality that we're
going to live with.
So in a way, the data will always be fallible.
You'll never have the perfect prediction,
because you can always--
this individual will learn that the algorithms nailed me
for shoplifting before I've even gone into the store.
I'm not now going to go into the store, and so the
prediction was wrong.
Now if we arrest him, the burden of proof is gone.
Because we can never actually validate the fact the
prediction was going to be accurate, because we never
allowed him to commit the crime.
This recursvity--
this weird pernicious circle in which we're constantly
reacting to the algorithm, and thereby changing the
prediction that we're making, is going to be
a feature of life.
And you can imagine that this is another conversation we
need to have, another thing we have to think through.
Yeah.
Good.
Thank you very much.
It's been a delight.