Placeholder Image

字幕列表 影片播放

  • FEMALE SPEAKER: Please join me in

  • welcoming Mr. Kenneth Cukier.

  • [APPLAUSE]

  • KENNETH CUKIER: Thank you very much.

  • You can probably appreciate the fact that I've got a lot

  • of trepidation coming here to talk to you folks for the

  • obvious reason that I'm wearing a suit.

  • And the truth is I had a breakfast this morning at the

  • Council on Foreign Relations to talk to them about the

  • international implications and the foreign-policy

  • implications of big data.

  • That leads to the second trepidation and the context of

  • my remarks.

  • So the second trepidation is that this is a sort of

  • homecoming for the book.

  • Because my journey, so to speak, in the world of big

  • data started at Google and started at the

  • Googleplex in 2009.

  • It was you folks who opened up the kimono to what you were

  • doing in very small little slivers.

  • I never got the full picture.

  • But I was able to cobble it all together and see something

  • and then give it a label to it.

  • Luckily, there was a couple of labels that we

  • were thinking of.

  • And I reached for one that wasn't a popular term at the

  • time, and the term was big data.

  • And that was really helpful.

  • It was the cover story of "The Economist"

  • in February of 2010.

  • It was called "The Data Deluge", because they thought

  • they would sell it better than saying "big

  • data." But big data--

  • it was basically all about that and about what

  • you guys are doing.

  • And so it brings me great fear to walk into a room, because

  • you guys have been doing it for so long.

  • And that brings me into the context of my

  • conversation today.

  • I want it to be a conversation.

  • I was obviously just at the Council on Foreign Relations

  • thinking about this in ways that I am sure your engineers

  • never thought about it 10 years ago.

  • I may have heard a snort.

  • But here's the thing.

  • Many of you were thinking of it as a technological issue

  • when people around the world think of it in terms of the

  • competitivity of nations.

  • Our book, which is being released today in America, has

  • already been available in China, where it's been a

  • best-seller.

  • And when we hear questions from Chinese journalists to

  • us, they're all talking about the national project that

  • they're on.

  • Is this the way for us to leapfrog with the West?

  • Is this one area of technology, unlike the

  • internet and computing, where we can lead?

  • So the implications of this are vast.

  • And the implications are more than just technological.

  • I'm at a technology company-- in fact, the pioneer, in many

  • respects, of big data.

  • But I want to explain that I'm here as a journalist, as

  • someone who's looked in at your world and now can serve

  • as a sort of a filter.

  • And what I'd like to do is show you that world from a

  • non-engineer's perspective, from someone who just is

  • curious about the world and society and thinks deeply

  • about these issues.

  • Now there's a second disclosure I have to make, and

  • that is not only am I talking about big data, but my

  • presentation is big data.

  • Because there's 70 slides.

  • On top of it, I haven't actually really seen the

  • slides except for once or twice, because they just

  • arrived to my inbox this morning from someone who was

  • putting it together for me.

  • This is actually the recipe for disaster, so please have

  • forbearance.

  • I'm going to go really quickly, and I'm probably

  • going to skip through a couple of these slides.

  • So let me start with a story, and the story is the story of

  • a company called Farecast.

  • And it begins in the year 2003.

  • A guy named Oren Etzioni at the University of Washington

  • is on an airplane.

  • And he asks people how much they paid for the seats.

  • And it turns out, of course, for one person paid one fare,

  • and one person paid another fare.

  • But this made Oren Etzioni really, really upset.

  • And the reason why is that he took the time to book his air

  • ticket long in advance, figuring he was going to pay

  • the least amount of money.

  • Because that's the way the system worked.

  • And then he realized actually that that wasn't the case.

  • When he figured that out, he was really upset.

  • And he figured, if only I could knew what is the meaning

  • behind airfare madness.

  • How would I know if a price I'm being presented with at an

  • online travel site is a good one or a bad one?

  • And then he came up with the insight.

  • Because he's like you-- he's a computer scientist--

  • he realized actually--

  • that's actually just an information problem.

  • And I bet I can get the information.

  • All I would need is one simple thing--

  • the flight price record of every single flight in

  • commercial aviation in the United States for every single

  • route, every flight, and to identify every seat, and to

  • identify how long in advance the ticket was bought for the

  • departure, and what price was paid, and just run it through

  • a couple computers, and then make a prediction on whether

  • the price is likely to rise or fall, and score my degree of

  • confidence in the prediction.

  • Pretty simple.

  • So he scraped some data.

  • And it works pretty well.

  • And he runs a system.

  • It's great.

  • The academic paper that he writes is called "Hamlet--

  • To Buy or Not To Buy, That Is the Question." It works well,

  • but then he realizes, hey, this works so well, I'm going

  • to get more data.

  • And he gets more data, until he has 20 billion flight-price

  • records that he's crunching to make his prediction.

  • And now it works really well.

  • Now it's saving customers a lot of money.

  • It gets a little bit of traction, and Microsoft comes

  • knocking on the door.

  • He's in Washington.

  • He sells it for about $100 million--

  • not bad for a couple years work, and a couple PhDs in

  • computer science that was working with him.

  • But behind this, the key thing is this.

  • He took data that was generated for one purpose and

  • reused it for another.

  • When the Sabre database--

  • at the time probably the airline reservation system and

  • one of the biggest, actually the biggest civilian computer

  • project at its time when it was created in

  • the '50s and '60s--

  • was created by American Airlines and IBM.

  • They never imagined for a million years that the data of

  • the passenger manifest was going to become the raw

  • material for a new business, and a new source of value, and

  • a new form of economic activity.

  • And we're going to be creating markets with this data.

  • And if you want to understand what big data is, at least

  • from a person looking into it--

  • because Google's been doing big data for a long time.

  • What we're seeing across society is what you folks have

  • been doing for years.

  • We're seeing that data is becoming a new

  • raw material of business.

  • It is the oil, if you will, of the information economy.

  • There's a lot of data around in the world today.

  • You know this.

  • The arresting statistics are obvious.

  • Whenever we put on a big new sky survey--

  • telescope for you and me-- goes online.

  • Whenever it goes online, it usually ends up collecting as

  • much data in the first night or two as in the history of

  • astronomy prior to it going online.

  • And obviously, the human genome, et cetera.

  • You all know the data about big data, so I won't spend too

  • much time there.

  • But what we see behind big data are three features of

  • society, or shifts in the way that we think about

  • information in the world--

  • more, messy, and correlations.

  • So more.

  • We're going from an environment where we've always

  • been information-starved--

  • we've never had enough information--

  • to one where we-- that's no longer the operative

  • constraint.

  • It's still a constraint.

  • Of course, we never have all the information.

  • What is information?

  • Is it really the real thing?

  • But what's clear is that instead of having to optimize

  • our tools to presume that we can only have a small sliver

  • of information, when that changes, we

  • can get a lot more.

  • And so what does more mean?

  • Well, think of it as 23andMe.

  • What they do is they actually take a sample of your DNA, and

  • they look for very specific traits.

  • Now that works well, but it's imperfect as well.

  • That's one reason why it's only $100--

  • a couple hundred dollars.

  • When Steve Jobs had cancer, he was one of the first

  • individuals in the world to have his entire genome

  • sequenced and his tumor sequenced as well.

  • So he had personalized medicine, and it was

  • individually tailored to the state of his

  • health at that time.

  • When one drug would work, they'd continue.

  • When the cells mutated and blocked the drug from working,

  • they routed around it and tried something else.

  • They were able to do that because they had all of the

  • data, not just some of the data.

  • And that's one of the shifts that we're seeing

  • from some to more.

  • And in some cases, n equals all the data.

  • We also have messy data.

  • That's another feature as well.

  • In the past, we had highly curated databases--

  • information that we optimized our tools to get in the most

  • pristine way as possible.

  • And this was sensible.

  • When there's only a small amount of information that you

  • can bother collecting and processing, because the cost

  • is so high and it's so cumbersome, you have to make

  • sure the information you get is the best

  • possible thing you can.

  • But when you can avail yourself of orders and orders

  • of magnitude more information, that constraint goes away.

  • And suddenly, you can allow for a little bit of messiness.

  • Now, it can't be completely wrong.

  • But messiness is good.

  • You folks are pioneers of this in machine translation.

  • And you know the famous Peter Norvig, and Allen Harvey, and

  • others' paper on the unreasonable

  • effectiveness of data.

  • The idea here is that machine translation worked actually--

  • was a real step up.

  • When IBM tried it in around '56 with 20 Russian phrases

  • and English phrases that they programmed the computer to

  • translate, it looked impressive.

  • It was ridiculous, of course.

  • We now know.

  • It's like a punch card.

  • Then when IBM's project Candide came around in the

  • '90s, actually that was not machine translation.

  • That was statistical machine translation.

  • That was really good, relatively speaking.

  • What they did is they took the Canadian Hansard--

  • the parliamentary transcripts that were translated into both

  • English and into French--

  • and they just let the computer make the inferences of when a

  • word in French, and it would be a useful substitute for the

  • one in English.

  • They didn't try to presume what was

  • right or what was wrong.

  • They let the computer infer that itself and score the

  • probability that one would be the right word or not in that

  • particular context, and go forward.

  • C-Change--

  • they tried to optimize it and make it better.

  • Couldn't.

  • Couldn't at a reasonable way.

  • It just was--

  • it was a hard problem.

  • Then Google came along.

  • And you guys didn't avail yourself of just the

  • parliamentary transcripts in French and English in Canada.

  • You availed yourself of the World Wide Web.

  • It wasn't 1994.

  • It was 2006.

  • You poured in.

  • You got all of the European Union

  • translations of all 21 languages.

  • Your Google Books project became a signal for what was

  • good and not because of the translations that you could

  • find in the libraries.

  • Now in many instances, the data was far less clean than

  • in the past when we tried to do it with just a

  • small amount of data.

  • But the fact is more data beat clean data.

  • Messiness was good.

  • And the final point, which is obvious, is correlations.

  • We have had a society in which we've always looked for causes

  • behind things, and that made sense in a

  • small-data world as well.

  • In fact, causality is still very useful to know.

  • But for a lot of the problems that we're dealing with these

  • days, just knowing the correlation is good enough.

  • And in fact, what we're finding is that often we think

  • we see causality when we don't.

  • And it's hard to do.

  • So there's going to be cases where we actually still want

  • to know the reasons why.

  • But often, just knowing what is good enough, because we can

  • learn the correlation and go with that.

  • So a similar company like Farecast is Decide.com.

  • This is Oren Etzioni's company again that basically looks

  • across the web at all of the prices online, not just of

  • airlines, but of anything that has lots of price data and

  • high variability and just ranks it to say, is this a

  • good price or not?

  • And it leads to new markets.

  • It leads to transparency, which is good for customers.

  • More interestingly is what this means for human health.

  • Premature babies, known as preemies.

  • In the past, when we thought about health care, we would

  • take the vital signs of someone maybe once or twice a

  • day, couple more times if it was important enough.

  • And a doctor would look at the clipboard at the edge of the

  • bed and make a decision on what to do.

  • Feedback loop was really, really long.

  • Very, very imperfect.

  • What we're now able to do-- some researchers in Canada are

  • doing this, is they're looking at the real-time flow of 16

  • different streams of vital signs of premature babies.

  • And they're able to score it and look for

  • correlations with it.

  • And when they do that, they find that they can spot the

  • onset of infection 24 hours before overt symptoms appear.

  • By doing that, that means that you can have an intervention

  • sooner, see if the interventions working better,

  • react to it, and save lives.

  • But you learn something else as well.

  • You would have thought-- and you can imagine generations of

  • doctors looking at the clipboard, seeing the kid's

  • vitals stabilizing, and thinking it was safe to go

  • home to supper, that things were OK, and we'll treat the

  • patient tomorrow.

  • Just nurse, call me if there's a problem.

  • And then to get a frantic call at midnight saying something

  • had gone horribly wrong.

  • The fact is, what we're finding is that one of the

  • best predictors that there is going to be an onset of an

  • infection is that the baby's vital signs stabilize.

  • Weird, right?

  • Why?

  • We don't actually really know why, what's happening

  • biologically.

  • It kind of seems like the kid's little organs are just

  • battening down the hatches for a rough night ahead.

  • We don't know why, but we know that with that correlation, we

  • can do something better.

  • We can save its life.

  • And we didn't know that before big data.

  • Behind this is we have data.

  • Why do we have data?

  • Well, we're collecting more data for things that we've

  • always collected data on.

  • Weather--

  • great.

  • That's fantastic.

  • But it also is because we're collecting things that was

  • always informational but we never treated as data

  • before, like you.

  • So you're all sitting.

  • You're sitting down.

  • And you are sitting different than you, and you,

  • and you, and you.

  • You weigh different.

  • Your legs are different.

  • Your posture is different.

  • The distribution of weight.

  • And you know that if I have 20 sensors on your seat and on

  • your seat back that I can probably score with a high

  • degree of accuracy who you are based on the way you sit.

  • Why is that useful?

  • Well, for one purpose, you can imagine that this would be a

  • great anti-theft tool in cars.

  • Put this in, and suddenly you would know that the authorized

  • driver of your Lamborghini is you, and

  • it's not someone else.

  • Or if you have children, likewise, hey, I told you you

  • can't take the car out after 10:00 PM.

  • And so the engine didn't work when you tried to sit down and

  • turn the keys in the ignition.

  • That's great.

  • But what else can you do with it?

  • Well, think about it.

  • If everybody has their car seat instrumentized and you

  • actually datify posture, suddenly you would be able to,

  • perhaps, identify the telltale signs of a shift in posture 30

  • seconds prior to an accident.

  • The probability of you getting into an accident by a shift in

  • your posture.

  • Maybe what we've datafied is driver fatigue.

  • And the service here would be the car would

  • send an internal alarm.

  • Maybe the steering wheel would vibrate, or there'd be a chime

  • saying, hey, wake up.

  • You have a high likelihood of getting into an

  • accident right now.

  • That's the sort of thing that is left to play for as we

  • data-ize society in a world of big data.

  • So what we're seeing is lots of things

  • being datafied as well.

  • Facebook datafies our friendships; Twitter, our

  • stray thoughts, our whispers; LinkedIn, our

  • professional contacts.

  • Google datafies our intentions.

  • So obviously, Google Flu Trends is a wonderful way to

  • have a predictor of what the likelihood of

  • outbreaks of flu are.

  • Now that's great.

  • It's just you don't want to have to know causality.

  • You don't know why.

  • It just is what it is.

  • Now you may recall that there was a little bit of a grumble

  • in the scientific community recently when they said, the

  • CDC this year said that flu was going to be right here--

  • CDC, the Center of Disease Controls.

  • And Google Flu Trends like this, it

  • didn't work this year.

  • Bullshit.

  • How do we know?

  • Because CDC is reported data.

  • The person came in.

  • Maybe because of the economic crisis, people decided they

  • had to show up at work and didn't go see a doctor.

  • Maybe the Google Flu Trends is accurate, and

  • that's what's real.

  • And CDC is just reported data, not observed

  • behavior, isn't as good.

  • No one thought of that.

  • Big data has been with us for a while.

  • It turns out that there was an American commodore who had

  • data-ized all of the old log books inside of

  • the dusty Navy trunks.

  • And with that, he was able to create a whole new form of

  • nautical map that told sailors not just where they were but

  • the patterns of the winds.

  • No one realized that the world, and the winds, and the

  • waves conformed to natural patterns.

  • If you will, that the sea had its own physical geography,

  • and that if you [? allided ?] yourself with those things,

  • that you could have a safer voyage.

  • And we can do that now.

  • But the problem is it took him a decade and dozens of people

  • to do it, and we do the same sort of thing in about one

  • sixth of a second every day.

  • So it's a democratization play of techniques that we have

  • tried to do in the past.

  • We've done sometimes.

  • Obviously, censuses have been around since Jesus was born.

  • But now we're actually doing it in a widespread way.

  • Predictive maintenance is a good example of taking the

  • same idea about premature babies and

  • applying it to machines.

  • When your car is about to break down, it doesn't go

  • kaput all at once.

  • Usually, you can feel it.

  • There's a grumble, or it just doesn't drive right.

  • Well, now what we can do is instrumentize it, see what the

  • data signature of the heat and the vibration is, find out how

  • it correlates with previous incidences of a break down,

  • and know, perhaps, two days in advance that your fan belt is

  • going to break.

  • And that's happening today in fleets of cars by UPS, and

  • it's going to be in your car tomorrow if

  • it's not already there.

  • The value of data is hidden.

  • It's hidden not in the primary purpose for what it was

  • collected for.

  • But now with big-data techniques, it's often

  • uncovered in its multiple secondary uses that are just

  • limited by our imagination.

  • So INRIX is a car company that takes the sat-nav system and

  • makes a prediction on how long it's going to take from one

  • place to another.

  • Sounds great.

  • It's a good service.

  • Use it.

  • It's also used by economists to understand

  • the health of economies.

  • Because they see how cars drive, and the frequency and

  • propensity of cars, and the travel times as a proxy, an

  • indicator, for the health of a local municipal economy.

  • Hedge funds use this information to look at the car

  • circulation in the areas near a retailer on the weekends.

  • And so prior to the quarterly announcements, they have a

  • good indicator whether the sales are going to increase or

  • decrease, and they can short or go long on its shares.

  • No one would have thought of that in the past, that we

  • could do that sort of thing with information.

  • Obviously, everything that we do-- all of our interactions--

  • give off lots of data exhaust.

  • You folks are experienced with that, because you treat all of

  • the interactions of an individual who goes to your

  • website as a signal for something else.

  • You've built your systems and optimized it based on that

  • form of data exhaust, by treating information as a new

  • raw material that you can recycle back into the system

  • to improve it or to create a whole new system altogether.

  • There is going to be winners and losers in this new world.

  • There's three features that seem to be distinguishing

  • who's going to do well.

  • And that's the skills, the mindset, and the data.

  • The skills are kind of obvious.

  • It's the people who have technical knowledge, or it's

  • the vendors who sell you stuff.

  • That's great.

  • The mindset, in some ways, seems to be more important.

  • Because what you need is not just the skills.

  • That gets commoditized first, obviously.

  • History of computing suggested that.

  • The first computer scientists in the '50s and '60s--

  • actually, not the scientists.

  • But these were the doers, the software programmers.

  • They looked like they were sitting pretty wearing the

  • white lab coats.

  • But by the '70s and '80s, man, just an ordinary rinky-dink

  • software developer had been largely commoditized.

  • And we're going to see that with big data as well.

  • Today, some companies and people are at the high end.

  • It's going to filter through as the PhD

  • programs get forward.

  • The mindset is going to be critical and the creativity.

  • But ultimately, both of those things are going

  • to go to the wayside.

  • Because if you remember, Jeff Bezos had a great dot-com

  • mindset really early in '94 or so, and he executed

  • brilliantly as well.

  • But by the 2000, every executive was

  • thinking about the web.

  • So we're going to have the same thing with big data.

  • That advantage doesn't hold that long.

  • The data, however-- who has access to the data is going to

  • be critical.

  • That's the resource.

  • So weirdly, and ironically, what seems to be abundant

  • today is actually the source of scarcity, and vice versa.

  • Now in New York, we have a problem

  • with overcrowded buildings.

  • But before I tell you that story, let me see much time we

  • have, because I want to get questions as well.

  • I think we're doing OK.

  • So tenements, overcrowded buildings, and the problem of

  • just stuffing 10 times as many occupants into a single

  • dwelling as it was designed for.

  • This is a bad thing, and it's a bad thing

  • because it leads to crime.

  • It leads to drugs.

  • It leads to violence.

  • And it leads to fires, and not just any kind of fires.

  • Basically, these kind of fires kill the occupants.

  • And they also end up injuring and killing at much greater

  • rates the firefighters who go to help it out.

  • So this is a serious problem for the city.

  • The city gets something like 63,000 calls a year for

  • complaints of illegal,

  • overcrowded stuffing in buildings.

  • And there's only 200 inspectors at City Hall.

  • So there you see a problem.

  • But the problem is actually a big-data problem.

  • How can we solve it?

  • Well, the first thing that they do is they take a list, a

  • database of every single building in Manhattan and the

  • five Boroughs, and that's 900,000, give or take.

  • And then they look at everything as a signal,

  • whether it's going to be a predictor that the thing is

  • going to burst into flames, or that it's going to actually

  • improve the model by predicting that it's not going

  • to burst into flames.

  • So they look at things like ambulance calls, utility cuts.

  • Has there been a lean against the property?

  • Is there complaints of rodent infestation?

  • So the number of rats in the building is not datafied, but

  • complaints to the city's 311 line is.

  • So you find out the number of rodent complaints.

  • And all together, you look at it.

  • And you can score with a high degree that the building's

  • going to burst into flame.

  • They looked at weird stuff, from like the Department of

  • Building Works on whether exterior brickwork had been

  • recently done to the building.

  • That improved the model too.

  • Because if brickwork was done to the building, even if you

  • had all these other problems that were high indicators of a

  • fire, it went down.

  • When they pressed go on the system, now what happens is an

  • inspector, instead of going in-- and about 15% of the

  • times they would make a visit.

  • In the past, they would issue a vacate order, the stiffest

  • sanction which basically says, everyone out in 24 hours.

  • Before it was 15% of the time-- a high rate.

  • So it tells you there's a big problem.

  • Now it's 70% of the time.

  • And so what that means is that they like it, the inspectors,

  • because it's more effective.

  • The fire department likes it, because their firemen aren't

  • dying as much.

  • And it's just good for all of us in our communities to see

  • that people have good housing and that these buildings don't

  • catch on fire.

  • And that was because they turned the problem into a big

  • data problem and solved it successfully with information.

  • And they gave up trying to figure out the causality and

  • just went with the correlation.

  • There, of course, are serious issues of big data.

  • One is going to be privacy.

  • It's a problem now in the small-data era.

  • It's going to be a bigger problem with big data.

  • But there's going to be something on top of that.

  • And that is not privacy, but propensity--

  • but prediction.

  • The idea is that we're going to have algorithms predicting

  • our likelihood to do things, our behavior.

  • And it's going to be obvious that we're going to have

  • governments try to-- or our businesses sanction us on the

  • basis of that prediction.

  • It's going to look a little bit like "Minority Report" and

  • the idea of pre-crime.

  • We're going to be denied a loan, because we're going to

  • not have the likelihood to repay.

  • But instead of this profiling universe in which we take us

  • as a big clump--

  • and we have a small-data problem.

  • Here are the 13 predictors, and this is the explicit rules

  • by which we can tell you the formula.

  • Imagine if we have 1,000 variables.

  • It's a machine learning algorithm, and when we try to

  • knock on the door in front of a court and say, I was denied

  • surgery because you said I had a 90% mortality rate after

  • five years with my individual data, I want you to disclose

  • to me how you came to that decision

  • because that seems unfair.

  • They're going to say I don't know.

  • I can show you the formula.

  • I've frozen every instantiation of the data at

  • every moment, because regulators required me to.

  • If I printed out the formula, it would be on 600 pages.

  • You need a PhD to understand it.

  • It's true you have only 40 strong signals, but you've got

  • a long tail of 600 weak signals that all went into it.

  • I can't tell you why.

  • And then the person's going to say, OK, I don't even care

  • about that.

  • What makes you think I'm not going to be part of the 10%

  • that's going to live past 10 years?

  • Why are you denying me this operation because you think

  • I've got a high probability of not surviving longer after it?

  • I want to take the test.

  • And you can imagine with criminal-justice systems, it's

  • going to be the exact same thing.

  • This is the issue that we have.

  • And so what do we have to preserve in this instance?

  • Well, if it's the case of whether we think this fellow's

  • going to shoplift in the next 12 months with a 99-degree

  • percent probability, he can rightly say, I don't even care

  • about the mumbo-jumbo of big data.

  • I'm part of the 1% that's going to exercise free will,

  • moral choice, and do the right thing.

  • Now of course, all 100 of those individuals will say the

  • same argument.

  • But it does mean that it seems like we're going to have to

  • create a new value in our world.

  • Just as the printing press gave us the consciousness of

  • free speech--

  • prior to the printing press, we didn't have a guarantee or

  • the consciousness of free speech.

  • When Socrates drank the hemlock for corrupting the

  • youth of Athens, in his apologia did he make a

  • free-speech claim?

  • No.

  • It didn't exist.

  • It took the printing press to give us this idea that

  • expression was something that needed to be protected.

  • What will need to be protected in a world of big data?

  • Well, maybe human volition, free will, responsibility.

  • We have always had the risks of data.

  • We're going to have to deal with this even more as we

  • become more respectful of data and live with it in more parts

  • of our lives.

  • We've looked at data in lots of ways.

  • America went to war over a statistic in a data point, and

  • we saw the problems there.

  • We're going to need regulators to think about how we can

  • adopt this and get the most benefits of it.

  • Probably one of the biggest is going to be giving us a degree

  • of transparency.

  • When there was an information explosion at the beginning of

  • the 19th century and it was financial information, we

  • created accountants to do the bookkeeping and auditors to do

  • the surveillance function on top of the accountants.

  • And I think that in the future-- and we mention this

  • in the book--

  • that we're going to have to create a new

  • professional class.

  • And we might as well call them algorithmists, who are going

  • to be trained in big-data techniques and actually serve

  • internal to companies, as well as external in terms of an

  • expert witness and a master to a court, that they can

  • actually understand what's going on and serve as a

  • translation function between the public interest and what's

  • happening in the mathematics.

  • For privacy, the shift is probably going to have to go

  • from regulating the collection of data, as we do now in these

  • preposterous screens of 60 lines of all-capped letters in

  • which you just say, I agree with the terms of service and

  • not read it, to something where we

  • actually regulate use.

  • And luckily, that seems to be an idea that is actually

  • gathering steam.

  • It's a lot harder for regulators, but it's

  • definitely better for businesses.

  • And it's definitely better for consumers.

  • And of course, we're going to have to

  • sanctify human volition.

  • There is a role for antitrust regulators, as well.

  • Antitrust turns out to be an extremely fertile and fungible

  • public policy, because it's technology-neutral.

  • It doesn't really make many presumptions about what it's

  • regulating.

  • What it does is it just looks at market concentration.

  • So it looks like it's going to be a very useful tool with

  • which we try to understand what to do and to create an

  • open market.

  • Now there's a problem in this that I'm laying out, which is

  • regulators can understand what scale looks like when it's

  • something tangible, like a widget.

  • Actually, the antitrust came out of the railroad, so we'll

  • say a car, a carriage.

  • And then they applied it to telephones, and they called it

  • common carriage.

  • There's still a Common Carrier Bureau at the FCC.

  • The carrier, if you will, was from the Interstate Commerce

  • Act where the language was taken from.

  • They then applied it to software markets--

  • Microsoft.

  • What does scale mean in data?

  • What does it mean when the data is doubling every three

  • years or so?

  • What does it mean when the market is changing form, that

  • it's not the same market in five years as it was five

  • years earlier?

  • The businesses look different.

  • It is going to be really difficult to do, but we're

  • going to have to try to do that.

  • Because we're going to need the assurances that we can

  • have challengers as well as incumbents.

  • We need both.

  • This is about the way that we live.

  • We're going to need to act with our

  • humility and our humanity.

  • Thank you very much.

  • There's time for questions, so shout it out.

  • There's microphones as well.

  • Yeah, please.

  • Go to the mike.

  • Thanks.

  • AUDIENCE: Hi, my name is Cynthia Elliott.

  • I have a question.

  • So I can see this data being very useful, like in let's say

  • drafting of an athletic player.

  • Have you ever encountered anything where colleges would

  • want you to use this data to determine who could have the

  • athletic ability to be drafted?

  • KENNETH CUKIER: Yeah, well colleges and professional

  • sports are using the data already right now quite a lot.

  • The whole book and movie "Moneyball"

  • was just about that.

  • Partially about new statistics and new ways of examining it,

  • but then partially just applying data to it.

  • And Nate Silver, of course talks a lot about just

  • trusting the data and just doing--

  • Nate Silver is not doing big data.

  • He's just doing data.

  • He's doing statistics, but the small difference is he's just

  • listening to it.

  • He's just doing it seriously and trusting it.

  • So this is going to actually change lots of the ways that

  • we evaluate people.

  • So when we think about students and education, right

  • now what a teacher does is it scores what every person in

  • the class's grade and tells everyone, this person got a

  • 95, and this person got an 85, and this person got a 75.

  • The teacher doesn't actually look at what is the content of

  • the-- or rarely, what is the content of what was corrected

  • and what was wrong.

  • What if that teacher was to find out that all of the

  • students in a math exam got the exact same-- not all.

  • But let's say 80% of the students got the exact same

  • answer wrong with the same answer.

  • Suddenly, he or she might say, hmm, I mistaught it.

  • They inverted the algebraic equation thinking that it

  • could be a or b and b or a.

  • But in fact, the sequence matters, and I've got to go

  • back and teach that.

  • So not only does the student learn more, but the teacher

  • learns as well.

  • So in terms of drafting, sports is one of the first

  • people to adopt these techniques.

  • And it's actually changing how they think about their game.

  • Certain players--

  • why would you have a defense for the opposing team?

  • You want a defense for who the player is because of his

  • propensity to score a shot-- if it's basketball--

  • from one part of the court versus another.

  • If that player on the left side always misses the basket

  • there, let him to take a shot.

  • I'll take it on the rebound and then pass it up.

  • Versus oh, don't let this guy get here into the key.

  • Then we're really in trouble.

  • AUDIENCE: With regulation, I think it's a

  • very different situation.

  • Because in the previous antitrust things, it's been

  • actually pro-consumer to limit the amount

  • of data people have.

  • There's a reason people go to Amazon, because they've got

  • more data and better data than anybody else.

  • So in fact if you use antitrust against there, you

  • may in fact make life worse for the

  • consumers rather than better.

  • KENNETH CUKIER: That is absolutely true.

  • Now the question where this is going to take us-- and we need

  • to have a societal conversation about it-- is

  • whose data is it?

  • Whose rights to the data should it be?

  • Does the individual own the data because it's his or her

  • click stream?

  • That sounds logical.

  • But of course, they decided to go to that website, and that

  • website invested in collecting the data and analyzing it.

  • Should they be required to hand it over, particularly if

  • they're going optimize--

  • let's take Amazon--

  • their own algorithms so that they can make great

  • recommendations.

  • Why would they want to be able to give that to the customer

  • so they can hand it off to Barnes and Noble?

  • You're enriching your competitor.

  • That sounds almost like eminent domain.

  • That sounds like a governmental takings.

  • We don't know how to answer these questions.

  • But the point here is that what rights does the

  • individual have to his or her data?

  • Should it be transparent?

  • Should there be data portability?

  • For telephone numbers, we had to create number portability.

  • And that seems to be a very good way to get carriers to

  • actually love us rather than to lock us in.

  • Do we need the same thing in the world of big data?

  • I think that your public-policy people should be

  • thinking about it.

  • They probably already are.

  • And we need to--

  • and not coming up with answers, but starting the

  • discussion and having the debate.

  • Please.

  • AUDIENCE: One of the three things which you mentioned are

  • important is mindset.

  • So you have some framework or principles that one can follow

  • and improve on that?

  • Because that's one thing which is not as commoditized as with

  • [INAUDIBLE].

  • KENNETH CUKIER: Yeah, no.

  • I don't have any--

  • there's no simple list or there's no recommendations I

  • would put forward.

  • Because this is sort of about the spark of creativity.

  • It's a little bit da Vinci like.

  • So I think what is required is just for very creative

  • individuals to look at what's going on around them and

  • breath the hurly-burly of humanity and see

  • the filament of it.

  • The whole point of Google was page rank.

  • But that was just the algorithm.

  • The genius was to understand that every single interaction

  • with the content gave you another signal to improve the

  • search result.

  • And that was, if you will, all you need is one

  • good idea in life.

  • And if you really go full-throttle with it-- and

  • it's a good idea--

  • it's limitless.

  • So like Oren Etzioni just had this idea that I can take

  • something that nobody knows the answer to.

  • But the answer exists.

  • It's hidden in plain sight.

  • I can get the data.

  • And if I do the right thing with the data, I can actually

  • transform it and get the insight that we need and

  • create new forms of value.

  • There is no simple way to develop that sort of mindset.

  • A lot of it is luck.

  • There are lots of people that I know of prior to Oren

  • Etzioni who had that idea.

  • There was a company called Strong Numbers in Boston about

  • 1998 by Jeff Hyatt.

  • He wanted to be the Blue Book for everything.

  • He thought about all the things that you

  • could do with data.

  • He was a little bit too early.

  • Dot-com bubble burst, and so did his dreams.

  • Went on to build other companies and do

  • very well for himself.

  • But it just shows that there's a lot of factors involved.

  • AUDIENCE: Hi.

  • So my first question was actually already answered by

  • you in terms of who owns the data.

  • But more specifically with the American Express example.

  • So do you have to purchase that data from--

  • not American Express.

  • American Airlines.

  • Did he have to purchase the data, or was he able to gain

  • access to the data?

  • KENNETH CUKIER: Great question.

  • So the point about American Airlines was that was the

  • airline carrier who built the original airline computerized

  • reservation network called Sabre.

  • So when Oren Etzioni wanted to get the data, he wasn't going

  • to go to Sabre.

  • And the reason why is Sabre is probably the biggest airline

  • reservation network, and they have no incentive to sell it.

  • Because that's just not what they do.

  • So he had to go to one of the start-ups, one of sort of the

  • hungrier people that were the challengers in

  • there to get the data.

  • And he found one called ITA Software.

  • OK, you see where this story's going to end.

  • This is great.

  • So he goes to ITA Software and says, will you do it?

  • And man, they've got a problem on their hands.

  • Because on one hand, they need the data from the airlines,

  • and this is going to really screw over the airlines.

  • On the other hand.

  • Oren's going to pay them a little bit of money and give

  • them a commission.

  • So they don't know what to do.

  • So actually, I interviewed--

  • and they're just a bunch.

  • Are these airline executives?

  • No.

  • These are a bunch of MIT PhDs and stats who did it because

  • they thought it was a really complex problem

  • and a lot of fun.

  • So I interviewed one of the co-founders.

  • And I said, well, what did you do?

  • And he said, well, the truth is we actually kind of came up

  • with that idea ourselves independently

  • before Oren did it.

  • And we did it internally just for fun, but we could never

  • release it as a product.

  • Because we just knew it was going to really harm us.

  • We'd never be able to get the data from people.

  • But this was a way that we could license the data at an

  • arm's-length way and still get a couple guineas for it.

  • And he had the data.

  • So it shows you that there's these competitive interests.

  • So what happened to ITA Software?

  • They were acquired.

  • For how much?

  • Between $700 million and $800 million dollars?

  • By whom?

  • By Google.

  • So why?

  • Well, you guys can answer that.

  • I know that when I do my searches, I see the airline

  • listings in, and that's a great feature.

  • But I'm sure you're playing a stronger hand and a lot more

  • long-term one.

  • The regulators walked in.

  • It was one of the real first substantial essentially

  • antitrust remedies against Google on this.

  • And what it was it was sort of a must-license provision with

  • a reporting requirement going back to the companies, and

  • essentially to the FTC.

  • For a period of a couple years-- maybe two or three,

  • maybe five years--

  • they couldn't actually cut off the license that they had with

  • people like Kayak, et cetera, because they were afraid you

  • were going to become the world's biggest travel agent

  • and dominate everyone.

  • But the point here is this.

  • Think about the sums.

  • Oren Etzioni's Farecast, $100 million.

  • ITA Software, $700 million.

  • The difference here is that the algorithm and the skills

  • is really good, and the service is really good.

  • But he who has the data--

  • that's the gold.

  • AUDIENCE: Do you see this societal shifts, particularly

  • in America?

  • I think big data and individualism is a really

  • interesting area to think about and whether this changes

  • the way that we think about an individual's role in society.

  • The examples you were giving of a criminal who is 99%

  • likely to recommit versus 1% likely to rebuild his life and

  • be a productive member of society.

  • Hedging for the benefit of society is to put all 100 back

  • into jail and leave them there.

  • How does this impact how we think about individual benefit

  • versus societal benefit?

  • And does this also play into future governmental shifts?

  • KENNETH CUKIER: OK.

  • So in the case of a criminal-justice system, let's

  • have that debate.

  • Because I think it's not an easy one, and anyone who

  • thinks it is doesn't understand the problem.

  • If I can tell you with a 99% degree of accuracy that that

  • man is going to commit a violent crime, I would be

  • remiss from intervening.

  • And it would look like I'm almost anti-science if I said,

  • you know, I just think that 1%--

  • we got to give them the benefit of the doubt.

  • You just never know.

  • It just doesn't look right.

  • So on the other hand, this is one of the most heinous

  • affronts to the dignity of the individual that

  • we could ever conceive.

  • And we don't have any experience thinking through

  • this issue.

  • This is so essentially that we figure it out, that we have

  • the debate, that the debate starts now.

  • Now what about data for individuals?

  • What does that mean?

  • Well, in Athens if you were a male, you

  • served in the military.

  • If you didn't want to sever in the military, leave.

  • Right now, we think that one of the most precious things we

  • have is data about our bodies, our health care, our privacy.

  • Let's change the debate.

  • Let's change the argument entirely.

  • Let's invert the burden of proof.

  • Let's just say that if you're a citizen of a country, you

  • have to share your data on your health care in a global

  • commons so that researchers can learn from it and treat

  • everyone's health better.

  • You don't have to do that.

  • Leave.

  • It might sound draconian, but the fact is do you have a

  • property right, or some sort of moral right to your data?

  • Well, I don't to my image if I'm walking down the street.

  • And we do know that I can learn a lot from the data.

  • And we also know from stats that if we allow some people

  • to back out for whatever reason, it really

  • becomes very imperfect.

  • So suddenly, I think that we should change the debate.

  • And I think the most obvious one would be health care.

  • But often, when you look at these issues-- as you've

  • pointed out to in big data--

  • these are new issues that we have for us.

  • We've had a bit of sloppy thinking about it lately,

  • because we haven't had to deal with it.

  • And because the whole issue of big data has been absconded

  • with by the technology vendors as the latest flavor of

  • chocolate ice cream.

  • But now, let's calm down, and let's think about it.

  • The debate should start now.

  • AUDIENCE: My question is kind of a follow-up.

  • So let's assume that all data is public.

  • You don't have any data barons.

  • How does that influence the game?

  • And specifically, humans change their behavior, and how

  • does having data that's correct as of yesterday based

  • on what I can infer--

  • I'm sorry--

  • all of us in [INAUDIBLE] can infer and predict what we all

  • of us are going to do tomorrow if we are

  • reacting to this data?

  • KENNETH CUKIER: Yeah.

  • So yeah, there's a great circularity that the data is

  • going to be making predictions.

  • We're going to learn about these predictions, and we're

  • going to change our behavior based on those

  • predictions ever thus.

  • Right?

  • This is just going to be the reality that we're

  • going to live with.

  • So in a way, the data will always be fallible.

  • You'll never have the perfect prediction,

  • because you can always--

  • this individual will learn that the algorithms nailed me

  • for shoplifting before I've even gone into the store.

  • I'm not now going to go into the store, and so the

  • prediction was wrong.

  • Now if we arrest him, the burden of proof is gone.

  • Because we can never actually validate the fact the

  • prediction was going to be accurate, because we never

  • allowed him to commit the crime.

  • This recursvity--

  • this weird pernicious circle in which we're constantly

  • reacting to the algorithm, and thereby changing the

  • prediction that we're making, is going to be

  • a feature of life.

  • And you can imagine that this is another conversation we

  • need to have, another thing we have to think through.

  • Yeah.

  • Good.

  • Thank you very much.

  • It's been a delight.

FEMALE SPEAKER: Please join me in

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級

Viktor Mayer-Schonberger和Kenneth Cukier,"大數據。革命將改變..." (Viktor Mayer-Schonberger and Kenneth Cukier, "BIG DATA: A Revolution That Will Transform...")

  • 132 7
    richardwang 發佈於 2021 年 01 月 14 日
影片單字