大數據領域的領導者 (Leaders in Big Data)

字幕列表影片播放

LEE FLEMING: Good evening.
I am really pleased to welcome you all to "Leaders in Big
Data" hosted by Google and the Fung Institute of Engineering
Leadership at UC Berkeley.
I'm Lee Fleming.
I'm director of the Institute and this is a Ikhlaq Sidhu,
chief scientist and co-founder.
The first and most important thing is to thank Google for
hosting the event.
So thank you very, very much.
There's a couple people in particular, Irena Coffman and
Gail Hernandez--
thank you-- and also Arnav Anant, our entrepreneur in
residence at the Fung Institute.
So here's Arnav.
AUDIENCE: A lot of work.
LEE FLEMING: Huge amount of work.
The Fung Institute-- we were founded about two years ago.
And the intent is to do research and pedagogical
development in topics of engineering leadership.
We have our degree, the Master's of Engineering--
professional Master's of Engineering M. Eng. program--
mainly around the Institute.
We also have ties though across the campus, as you'll
see shortly.
This is our intent to have a series of talks on topics of
interest to engineering leaders.
As it turns out, this Wednesday we
have our next talk.
It's sponsored by [? Thai ?] and the Fung Institute.
And the topic is entrepreneurship--
being an entrepreneur within your firm.
And fittingly, we have representatives from Google,
and Cisco, and SAP.
That's Wednesday.
Consult the Fung website or the [? Thai ?] website for
details on that.
So besides enjoying a good discussion tonight, we have an
ulterior motive, as you can probably tell.
We're trying to advertise all of our fantastic programs in
big data at Cal.
Now, whether you're interested in computation, or inference,
or application, or some combination of those things,
we've got the right program for you.
As I mentioned, the professional Masters of
Engineering, or M. Eng., across all the different
engineering departments--
one year degree.
We have another one-year degree in the stats
department-- a professional degree.
There's a two-year degree in the Information School.
And finally, there's the Haas MBA.
Tonight we've got people from all these programs.
You can find their tables, ask them questions, and hopefully
we'll see you see at Cal soon.
And we also have an additional executive and other programs
associated with each of those departments
and schools as well.
Ikhlaq will now introduce our speakers.
IKHLAQ SIDHU: OK, thanks.
So let me see.
LEE FLEMING: Just slide this here.
IKHLAQ SIDHU: All right.
Welcome, I want to also thank a couple of people.
One is [? Claus Nickoli ?], who is not here at the moment,
but to you in the ether, he's just not at the meeting.
But he's our host here, and so thank you.
You guys can tell him that I thanked him.
And also, many of you I've seen here are basically
friends, and so thanks for coming.
It's good to see you again.
This is an event on big data.
And so I'm going to give you a little data on
who is speaking today--
who is here.
And the way I think of this is, what we've got is three
perspectives of big data from leading firms--
from people who represent leading firms in the area.
And so let's start with NetApp.
We've got Gustav Horn.
He is a senior consulting engineer with 25 years of
experience.
And he's built some of the largest enterprise-class
Hadoop systems in the world-- on the planet.
And from Google, Theodore Vassilakis, and he's a
principal engineer at Google.
He's ahead of the team that works on data analytics.
And he's been responsible for numerous contributions to
Google in terms [? about ?] search, and the visualization
and representation of the results.
And from VMware, Charles Fan, who's senior VP of strategic
R&D. He co-founded Rainfinity and was CTO of the company
prior to its acquisition by EMC in 2005.
And our distinguished set of speakers is moderated by our
distinguished moderator, Hal Varian.
He is chief economist here at Google.
He's an emeritus professor at UC Berkeley and the founding
dean of the School of Information.
So with that, there's hardly anything more I
could possibly say.
Come on up Hal and take it away.
HAL VARIAN: Thank you.
I'm very impressed with the turnout tonight, seeing as
you're missing both the debate and the baseball game.
But at least it eliminates a difficult
choice for many people.
I will say that I'm going to follow the same rules as the
presidential debates.
So no kicking, biting, scratching, or bean balls are
allowed during this performance.
We're going to talk about foreign policy, wasn't that
the agreement?
No.
All right.
In any event, what I thought we'd would do is, we'd have
each person talk for about five minutes, lay out their
theme, where they're coming from, what their perspective
is on big data.
And I will take some notes, and then ask some questions,
get a conversation going.
And I think we'll have a little time at the end for
some questions from the floor.
So, take it away.
THEO VASSILAKIS: Sure.
So, should I start, Hal?
HAL VARIAN: Yes.
THEO VASSILAKIS: All right.
Well, hey it's a real pleasure to be here.
Thank you guys also, and thank you guys for coming.
It's a huge, huge audience.
Just a couple of words.
As you heard, my name is Theo.
I lead some of our analytical systems.
So I'm responsible--
well, actually up until two weeks ago, I was responsible
for a stack that had parallel data warehousing components,
query engines, pieces like Dremel, and Tenzing systems
that let you query this data, and
visualization layers on top.
And that's one of the many, many systems at Google that I
think, outside, one would think of as
big-data type of systems.
And so I'll try to give you my perspective at least on the
Google view of big data.
And hopefully someone will cut me off when it's time.
I think I'll probably go for five minutes.
This could take a while.
AUDIENCE: [INAUDIBLE]
THEO VASSILAKIS: All right, sounds good.
Thank you.
I think, as you guys know, Google's business is primarily
about taking data and organizing the world's
information, and making it universally
accessible and useful.
So a lot of what the company does is really about sucking
in data-- whether it be the web, whether it be the imagery
from Street View, or satellite imagery, or maps information,
or Android pings, or you name it.
And then transforming it into usable forms.
So really, Google is kind of a big data
machine in some sense.
And I think the term big data came into
currency relatively recently.
And we all said, yeah, OK, that speaks to what we do.
Because we don't really have a word for it.
We just kind of knew that the data was large.
But just to try to put maybe more structure on to that, I
think the Google view on a lot of "what is big data
processing" kind of splits up into probably what I would
call ingestion type of processes--
things like the crawlers, things like all those Street
View cars running through all the streets of the world.
And then goes into transaction processing systems, where
perhaps we capture data through interactions on a lot
of our web properties, or a lot of the web properties that
we partner with.
This means people clicking on search, or people interacting
with docs, or people interacting with maps.
All generate many, many clicks and many, many interactions
that then become transactional big data.
Of course, that also includes people using let's say Google
Analytics on their sites to measure traffic on their
properties, which then generates huge volumes of
pings into Google--
many tens of thousands of QPS of pings.
So that's kind of the second big component.
And then probably the third component is the processing
side of all of that.
The process side includes things like map [? reduce, ?]
analysis, generating insights from that data--
maybe in the form of building machine learning models.
Maybe in the form of building, for example, Zeitgeist top
queries that can then be served out to the world to
say, hey here is what people are searching for.
Maybe in the form of engrams of all the books that Google
scanned over many, many years of its ingestion processes.
But it's really baking all of that information and then
presenting it in some usable form, either through a system
such as our ad system that takes models and decides what
ads to show, or in a more direct
form such as the engrams.
Just to say, OK, here are those three broad classes--
ingestion, transaction processing, and analytical
processing.
To dig a little bit deeper into each of those areas, I
would say the ingestion processes, especially the very
large scale ingestion processes, are
highly custom systems.
If you think about our web crawlers, if you think about
the Street View cars, if you think about maps stitching, or
satellite imagery stitching--
those are very, very custom processes that I think, at
least to this date, don't have a clear analog
in the general industry.
And maybe this is something that you guys might address or
might see differently than how I see the version.
They're still highly-specialized systems
that produce very large images.
And they're very high performance, very complex
systems that are run by dedicated engineering teams.
The transaction processing systems or the storage systems
are things like the Google File System.
These are things like Big Table.
These are things like Megastore.
Those are the ones that we've actually published papers
about and that are now reasonably well
known in the industry--
have evolved a little bit past the purely custom stage, where
they're fairly general purpose.
And there was a time at Google where actually most people did
their own storage in some form or another, until these
GFS-like systems evolved to the point where they were good
enough that more than one team could use them.
And actually, that evolution had many steps in which, for
example, everybody ran their own GFS.
And so maybe the ads team had their own GFS cells, and the
search team maybe had their own GFS cells.
And in time, the systems matured to the point where
actually we could have a centrally-managed file system.
And I think recently you may have seen, we've now talked
about this global file system called Spanner which takes
that to yet another level of transactions and global
availability.
And then the third step, which is I think still in a
relatively immature stage compared to some of the
storage systems, is the analysis.
And I think a lot of people know about MapReduce and some
of the systems that have been built on top of that.
So for example, Flume is the way of chaining MapReduces in
a more programmer-friendly way so that you don't end up with
50 MapReduce stages that are individually managed.
But rather, you end up with one program that can then be
pushed down into many MapReduces that are
automatically managed.
The process there is still very engineering focused and
essentially requires engineering teams to process
this large data.
And so I think what we're seeing in that area is the
same maturation that we saw in the storage and transaction
processing systems.
Where little by little, systems such as Dremel, such
as Tenzing--
such as many others inside of Google that we haven't talked
about externally--
are aggregating a lot of that usage, and saying hey, we
really should do it in a much simpler manner.
And not really require people to have a full engineering
team to get the value out of all that big data.
Because at the end of the day, that's what
Google wants as a whole.
And that's what Google's customers want as a whole.
How do we get the value out of those big pieces of
information?
I would just leave you with those three big pieces.
And also this idea that, this is evolving into a
higher-level service that people can use without
necessarily being very, very
low-level engineering oriented.
And that more and more value is being derived out of that,
and hopefully something that you're seeing in Google's
properties and Google's services.
I don't know how much if I'm over, but I
can hand over here.
Thank you.
GUSTAV HORN: I'm Gus Horn, and thanks again for everybody for
coming tonight.
I know it's a big baseball night and you probably want us
to get done quick.
I come to it from a different approach in a sense and feel,
because Theo has--
Google has-- really been at the forefront of big data, big
data analytics, and in
particular Hadoop and MapReduce.
So I'm not going to go on the premise that everybody in this
room understands what MapReduce is, or what big data
is, and what data scientists are.
These are all buzz words that are really evolving.
I think what I found in my travels globally is that we're
really at the forefront right now of big data analytics.
I have a presentation that really characterizes it more
like a tsunami of data.
It's relentless, and it's coming at us.
It's coming at us from our Android
phones, from our iPhones.
It's coming at us from cameras that are everywhere, from our
TiVo boxes, from our PVR boxes, from everything we do
and touch in our world today.
We're generating data.
And the question is, do we either let the data fall on
the floor--
and we do nothing with it-- or are we going to pick that data
up and actually do intelligent things with it?
And we're finding more and more commercial applications.
Google I look at from a pragmatic perspective.
It's a commercial entity, but they are having a much more
philanthropic and broad approach to the world as well.
It was great back in 2003 that they defined GFS and gave us
MapReduce, which brought us back to the
mainframe days of old IBM.
But this is basically what it feels like to me, right?
Because it's batch-oriented processing at that time, when
we're talking MapReduce jobs.
But basically that was the genesis or the beginning of
what we call the Hadoop as we know it--
the Facebooks, the Yahoos, the LinkedIn--
all of these companies that are embracing this technology.
But now we look at companies like Progressive Insurance,
where they're giving you these dongles to plug into your car.
They're generating data.
They're collecting data on your habits,
your driving habits.
Health care industry is looking at how often do you
see the doctor, what are your statistics?
I was at the Mayo Clinic recently, and they have a
human genome initiative where they are looking at all of
their patients.
And they're actually doing a full genetic map all of their
cancer patients.
And their following these people for their entire life
expectancy.
And they want to keep their data 25 years, post mortem.
They want to build a repository where they can
understand exactly how does that one genetic mutation
affect your propensity to be carrying a disease.
Because they recognize that diseases
aren't just on or off.
There can't just be one mutation that
gives you that problem.
It's your environment, the mutation.
And that builds a susceptibility.
They're trying to really paint a huge picture, and that's a
big data problem.
So I see big data problems from health care.
I see big data problems in consumer-related industries,
whether they be the Walmarts, the Targets.
And not everybody is trying to be evil about this.
If you think about Target or Walmart, they would much
rather show you an advertisement that you care
about than to bore you to tears with something that
doesn't matter.
Just as Google doesn't want you to see a pop-up ad for
baby diapers if you're 60 years old and you're not going
to have a baby.
It doesn't do them many good, it doesn't do you any good.
There are a lot of positive things to take away from a lot
of this big data, and there's some negative things, too.
I'll focus on the positive in that I look at what companies
like the auto manufacturers in Europe are doing.
You look at BMW.
All of these cars are data-generating monsters.
And nowadays, you don't even know when you have to go for
an oil change, because they're predictively analyzing the
fluids in that car.
And they're determining when is it time for you to get that
oil change.
It's not like, oh I have to do to every 4,000 miles.
Your car tells you when you need to get it done because of
viscosity changes and because of analytical testing.
And they're collecting all of this data.
I think we're very lucky that we are at this forefront.
And I think that big data-- big data scientists--
are going to become more and more important.
And I think that, as Theo said, that it's going to get
to the point where, you don't have to become a
MapReduce job expert.
You really need to become a logical thinker and be able to
articulate the questions you're asking against a data
set, where you don't even care where the data came from.
You just know that all the data is in there.
And that's the key-- is to have a repository that's able
to hold all the data, and be able to allow for this kind of
processing to take place on that data, and produce results
in a timely fashion.
And what I've done is, I'm approaching it from more of a
corporate perspective, where people are looking at
enterprise-class systems, versus what we call white box
or dirt cheap.
And there're different kind of cut-offs for companies.
And I think as you go through your process at UC Berkeley,
and you're learning about where you want to go, you'll
see that you have to pick and choose your battles when it
comes to big data.
And the battle you have to choose is, am I going to be
setting up my data centers and my infrastructure to support
commodity-based platforms, and this-- do I want to own all
the data internally?
Do I want to virtualize the data in the cloud?
At what point do I bring that data internally.
Do I want to use services from Google?
They're all inflection points that you are going to be
making decisions over the next five years to
decide how to do that.
And this is what I'm dealing with all the time.
I think, hopefully, we all learn a lot from this
experience.
CHARLES FAN: Thank you for coming.
My name is Charles.
And unlike presidential debate, I agree with
what they just said.
Big data is like an elephant.
We were told we are allowed to touch this elephant from
different angles, from different perspectives.
But before that, I'll just try to repeat what Theo and Gus
just mentioned.
First, I think Internet is pretty big in terms of its
impact to our lives.
And not only to our lives, but also to enterprise IT.
And I think what we have seen in the last 20 years has been
the repeated tidal waves that's caused by the Internet
and the leaders in the Internet
space, including Google.
The advances they are making, and how those are hitting the
enterprise world.
And I think big data is the latest of such a tidal wave.
Essentially what the scale of data that the Internet
providers are dealing with, with consumers, the
enterprises are facing the same.
And now the challenge is, how do we adopt and massage this
technology so it's consumable by the various people inside
the enterprise worlds.
And that's what's behind the big data world we see.
And I think, like what Gus said.
Enterprises are working different sectors.
There are people doing retailing--
selling stuff.
There are people doing a manufacturing--
building cars.
There are people in health care.
There are people doing financial trading.
In almost every field, they are generating
more and more data.
And almost every field has many questions they need to
ask based on those data.
And they need to make decisions based on those data.
And unlike the DWBI world, which has been around also for
20 years, the amount of data, the variety of data, and the
speed of data coming at you are going beyond the existing
infrastructure can take.
And that's why to answer these different questions in
different verticals, everybody is seeing a need for new
infrastructure, a new database, a new storage to be
created to support the decision making based on all
these data.
What's different in those data, besides just the size or
the volume of it?
When people typically refer to big data, they call it the
"three v," which is volume, velocity and variety of data.
Some of them call them "four s." It's the source--
there more data sources--
the size, the speed, and structure of data that are
very different.
And I have another name for it, which is probably less
elegant, but also I think it's pretty true.
When we look at the old data, the small data, or the classic
data, they're typically record-based data, especially
those generated by transactional applications.
They usually have people generate it.
And they go through the whole life cycle.
So we typically call them CRUD data that you need to create,
read, update, and delete.
I'm sure all of you Berkeley students know
the CRUD data word.
You manage on the storage front.
You also have database design for it.
But with the new data, more and more of
them are machine generated.
We just have more and more devices that's connected to
the Internet.
Not all of them have a warm body sitting behind them.
There're both servers, as well as sensors, RFID, mobile
devices, cameras, and so on.
And they're all generating Google Cars, they're all
generating tons and tons of data, without people sitting
behind them.
But you still need to create them, but you don't update
them that much.
Those are usually write once and read many type of data.
So there's not much update.
And there's not much delete.
You need to retain data 25 years after people die.
And even after 20 years--
25 years-- people don't remember to delete them.
So there's not much delete, not much update.
There are a lot of application.
So instead of CRUD, now it's like
create, replicate, append.
There's more and more append.
All the data in append-only mode.
And process--
there's a constant need to process them in real-time,
during ingestion, or interactive.
So it's just crap data, is what big data is.
It's C-R-A-P--
create, replicate, append, process.
And when we are talking about structured data verses
unstructured data, we say there are more and more data
that are unstructured than structured.
I think it's just because the database technology or the
underlying technology is not scalable enough to put them in
a schema or in some kind of structure.
That's why they are all CRAP.
But you still need to process them in a more efficient way.
And that causes a lot of your challenges.
I think essentially whoever designs the new data
management system for CRAP and makes them consumable by
enterprises, is going to be the winner of
this big data race.
GUSTAV HORN: So Google invented the new crapper?
HAL VARIAN: Yes, OK, thank you for starting us out on such
provocative comments.
I wanted to follow up on your own your little troika there
with the ingestion, transaction, and analytical.
I come at the end of that food chain.
So what we get is, the data's been pulled in, the data is
available to us, and we're working on
the analytical side.
I want to say a few words about that.
When we have these analytical systems at Google, one of the
things you can do is just monitor the system and make
sure everything's running the way we expect it to.
And these guys have done a fantastic job, because now you
can take almost anything that's gathering data at
Google and create a dashboard with about 20 minutes of work,
which is a fantastic thing for running the business.
The other you can do is, you can build the machine-learning
models that he alluded to and engage in this kind of
predictive analytics.
That's very in-vogue these days and it's a
great thing to do.
But the thing that a lot of people miss, I think, is you
can use that data to conduct experiments.
And that's really the secret sauce at Google.
Our leader of the search team, Amit Singhal, said that a
couple years ago, we did over 5,000 experiments with the
search algorithm-- made 400 changes.
On the ad side, we're running roughly 500
experiments at any one time.
Any time you're logged into Google-- or any time you're
accessing Google, I should say--
you're probably in a dozen or more experiments.
And it's having the capability to manage that data, not just
for the current incarnation of the system, but all the
variations you might contemplate, is really a
fantastic help in moving the whole system forward.
So that experimentation rule is very, very
important at Google.
I wanted to raise a question of standards and
interoperability.
You mentioned Hadoop.
That's really become an industry
standard here at Google.
We have our own internal staff.
It's a lot easier to enforce these standards for
interoperability internally, than industry wide.
But to make this system work--
of starting with ingestion and transactions,
and then the analysis--
outside of Google, or outside of other big data companies,
you've got to have this kind of standards to interconnect
the flow of data.
And Charles, why don't you say a few things about what's
going on in that area
CHARLES FAN: I do think we are at the early
stage of this industry.
And right now there is no standards, per se, to my
knowledge that has emerged.
Hadoop has been a very popular technology that's born out of
the open source community's effort to--
based on the Google papers-- to create the MapReduce and
the GFS, as well as the other things they
built on top of it.
And I think, in lieu of standards, my perspective is,
open source plays a huge role here.
That in terms of overall data management as I mentioned, we
are going from a world that every thing is relational.
You basically have your relational data model, which
is the standard across all--
SQL being the standard query language.
Go into a more chaotic world, where there's many kinds of
data stores, many kinds of queries.
Even on Hadoop, there are various ways you can
query on top of it.
And open source really gives people the choice.
In this chaotic period, it is the choice.
It's basically the developers and users who's going to
decide which will become the standard.
And open source really provide this way to make it happen.
GUSTAV HORN: I just want to make one comment.
I think open source actually is the best way to make sure
that you don't get yourself pigeonholed into anything
that's proprietary.
And I think that with Hadoop and big data, as I look five
or ten years down the road, I think that standards aren't
going to provide structure.
It'll be more of an inhibitor than it's going to be of a
benefit in this area.
I think one of the key attributes--
and I think you can maybe talk more about that-- is the fact
that you want to be able to connect or stitch together a
bunch of disparate data sets.
You want to be able to look at things where you don't have to
be rigidly defined from the standard.
You want to be able to look at strange queries where weather
patterns, and people's buying habits, and the cars they
drive have some correlation.
And if you start imposing standards on top of something
that is that robust, I think it's going to probably stifle
development.
So I think the key here is open source.
The key is to have published innovations so that people are
publishing their works.
And I think as we get better and better at natural-language
processing and being able to get away from having to be
hard-core programmers, to glean insight into any of this
data store, it's going to be more beneficial.
I think in the next decade you'll find that you'll
probably be doing less and less Java programming and more
and more just natural language logic, I would think.
HAL VARIAN: Theo, I hope you're going to say a word or
two about protocol buffers.
THEO VASSILAKIS: Protocol buffers, yes, of course.
I'll plug protocol buffers for sure.
HAL VARIAN: Which you made as an open standard, right?
THEO VASSILAKIS: Right.
It's actually an open source system.
But before that, I was actually going to say I really
agree with your point about experimentation.
And I actually remember a time at Google where, if you wanted
to run an experiment--
for example, on search-- there was one engineer who is one of
our distinguished engineers now, Diane.
And you had to go ask her for some cookies on which you
could run your experiment.
It was sort of like, she would allot you some cookies.
Those days are over, but they really do generate a lot of
this CRAP data, because all of those experiments accumulate
over the years.
And yet it's really important to have the historical view of
hey, we tried this.
Here's what happened then.
And I think actually this plugs directly into this
problem of standards, because the way that all of the
engineers years back recorded their results, was very, very
different than the ways that engineers today
record their results.
So maybe, at the time, some of them didn't
have protocol buffers.
Which is, if you like, a kind of XML-like format for
representing data that Google created, but is a much more
efficient to represent type of format.
And so I think the problem comes because we want to
integrate all of this variety of data.
And what I would say is, I agree with Gus that I don't
see a lot of appetite for very generic standards.
But I do see people having a need to bridge all of their
old data and the new data.
And I would basically make two analogs here-- is that I think
one of the things that really helped the development of data
warehousing was fairly standard SQL.
And it was never a standard standard.
Like, there existed a standard, but no one really
followed the standard very closely.
But if it was close enough, you could get
your systems to work.
And I think the other aspect is file formats.
If you can take a file format and feed it into different
systems, that will really help.
And so until now, CSV was the end-all, be-all file format
for interchange.
I think we'll see more of these as we need to trade data
that's more structured--
that has protocol buffers or XML.
THEO VASSILAKIS: And if I could, let me add a plug for
VMware, as well.
As we mentioned, I think we are agreeing that we should
allow the chaos to continue for a little awhile.
However, there are certain parts I think we can help
people to make it easier.
Which is how do you stand things up.
Hadoop has a great system, but as Gus can probably tell you,
it's not so easy for enterprises to stand up a
Hadoop cluster.
Often the enterprise needs to stand up many of
those Hadoop clusters.
And some will need to stand up other type of data stores.
And that's where VMware is a leader in the virtualization
software and cloud infrastructure.
And we are building tools which includes some open
source project called Serengeti, which is helping
people to easily stand up their Hadoop clusters, as well
as other data stores--
really automate some of those headaches or tough work.
And so they can focus on the work that matters.
HAL VARIAN: Let me put in a good word about standards.
Because when you look at companies, how do they grow?
They grow through acquisition.
When they grow through acquisition, you end up with
data silos everywhere.
And data silos are the enemy of big data.
And the amazing thing about Google, because of the work
that Theo and his team do, is we have no
data silos at Google.
Now that's not 100% true, of course, but when we bring an
acquisition in, we spend a lot of time trying to get their
data infrastructure aligned with our own internal
infrastructure.
And what it means is, you can basically pick an engineer off
of one project and move them on to another project,
completely at the other side of the company.
And they're productive in the first week because of having
that standardized infrastructure that we have.
And that is not something that most companies have the luxury
of dealing with.
The biggest problem that most companies face in data
management is trying to get this interoperation among the
different legacy systems.
You know, there's this old line, how did God create the
world in only six days?
And the answer is, he didn't have a legacy
system to worry about.
So everybody in the business faces, how's
that going to be solved?
That's my question.
How do you solve that?
GUSTAV HORN: I think you're right.
There are a lot of heterogeneous databases and a
lot of things that need to be stitched together.
And I think that big data--
again from the Hadoop prospective-- . there are lots
of connectors out there-- from Flume, from Scoop.
And I think that's key.
You'll find that a lot of these big database companies
are having to embrace open source.
They're having to embrace Hadoop, because if they don't
embrace it, they're going to become roadkill.
So they're looking for ways to monetize it, from consulting
services and things like that.
And also how then can they play in this market and become
leaders in this market, so they retain
their customer base.
Because the bottom line is, the Oracles of the world, the
SAPs, these people make money through selling licenses.
Hadoop is a license killer.
So that's going to directly impact their ability to be
profitable from a stock market perspective.
They need to find ways to innovate that allow them to
keep that trajectory.
And then the other thing I would say is, that a lot of
times the biggest problem I've found in industry, when I go
meeting with big customers or potential customers, is that
they don't know where to start.
They have a huge data problem, not just a big data problem.
They have data everywhere and silos in different corners of
the organization.
And they don't have one person who is competent enough from a
technical perspective to know how to move forward.
They have individual islands or teams that are looking at
how they can move forward.
And the real strength in big data and big data analytics is
the heterogeneous nature of the data.
That's one of the key strengths
of this entire industry--
is the fact that you want to stitch together all of these
different data sources, and then be able to find those
correlations amongst them.
It doesn't do anybody any good to do a structured database in
Hadoop, and you're just doing the same old thing.
What's the benefit?
There is no benefit.
The benefit is when you're able to combine all of these
sources into one place and you find that
needle in the haystack.
Or you're able to better understand your customer.
Because fundamentally, all of these things
are customer driven.
I don't care whether it's Google.
I don't care whether it's VMware.
If the customer isn't happy, they're not
going to come back.
They're not going to like your website.
They're not going to like your product.
So the bottom line is, how can you find ways to modify what
you're doing to make it better for the customer.
And if you're able to find those needles because you can
stitch together all of these different sources--
including social media, including global search
engines and global communities--
and find out what people are doing, you'll find out those
subtle differences that really become the real game changer.
And that's really what big data is about.
CHARLES FAN: Yeah and I think another way I'll dissect the
big data, is that it can be looked at as four layers of
functionalities.
From the very top is the big data applications.
And to the second layer, which is big data analytics--
the various machine learning and other
algorithms you can apply.
The third layer is the big data management--
the query engines and so on, that you can query the data.
And the bottom layer is the data infrastructure--
the storage, and so on where you store the data.
I think to the question, the more bottom the layer, I think
it's closer to standardization.
I think there is, maybe to Theo's comment, there probably
can be a unified big data store, where all the bits, all
the CRAP, eventually end up somewhere.
There's a sync, a common sync for all the CRAP.
And they come into here.
I think right now we should still allow various different
ways for them to be queried.
Even in our Hadoop system, some people like to use Pick,
some people to use Hide, some people like to just do H-based
direct on HDFS.
Some people like to, Dremel is another way you can
interact with it.
And I'm sure there are new innovations coming out of
Google, out of everywhere in the ecosystem.
And like in [INAUDIBLE].
When I talk about standardization chaos,
sometimes I'll go back to the history--
for me, it's Chinese history.
Where, for those of you who have read the Chinese book
called "The Romance of Three Kingdoms," where the first
line of the novel is, "After unification it's chaos.
After chaos, it's unification. "
And it's describing how often of all the warlords fighting
chaos, inevitably somebody's struggle will emerge
and unify the land.
And that will be your emperor.
And also inevitably, whether it's after he gets old or,
whether he dies and his kids get weak, that it will fall
back into chaos.
And this is traditional dynasties that repeat about a
dozen times.
That's 4,000 years of Chinese history.
And I think that can apply to the history of anywhere else.
As well, it can apply to the data processing, the data
management here.
Where we are in this period, going from a more unified SQL
interface, a more unified data management query engines, to a
more diversified world.
But I would predict in ten years, there will be leading
standards or ad hoc standards-- de facto
standards-- that's going to emerge where the majority of
the big data problem going to be solved in that way.
THEO VASSILAKIS: Yeah, I agree with that.
I don't know if it'll be in the form of a W3C standard or
something like that, but I think that's a little bit the
dynamic that Hal was referring to inside of Google.
That after n years of fighting with all of the different
varieties of things, people kind of said, well we
understand now that it's not the purpose of our team here
over in maps to really build that entire stack.
Because now that we know what all that entire stack entails,
we realize that it's really far too big for us
to do on our own.
And so we're willing to concentrate further up the
stack in the parts that we really care about.
And that then led lots of groups of Google to look
around and say, OK well, what is a piece of technology that
exists, and is reasonably mature.
And a lot of people will use it, and it
gives us this advantage.
And so that's how some of the components such as Dremel and
others emerged as de facto standards of
how we analyze data.
And I think that those de facto standards will in time,
probably lead into some kind of more formal standards that
can be adopted across companies and across
organizations.
HAL VARIAN: Let me switch gears and turn to the
infrastructure, the hardware infrastructure.
So there's two models out there.
You could buy your infrastructure, and people to
maintain it, and run it in-house.
Or you can lease it on the cloud.
And what do you see as the advantages and disadvantages
of those two approaches?
GUSTAV HORN: I think that there's a place for both, to
be honest with you.
I think that you'll find that the cloud is a great place to
get started.
It's a great place for you to kick the tires.
I think you're always going to have the open source-- what I
call white box, commodity-based approach.
And a lot of groups where you're going to be doing your
sandbox, your proof of concept, you're going to be
testing out your code, from an infrastructure perspective.
And also I think that there's a place even for what's being
done over at VMware, where they're looking at
fundamentally providing an infrastructure and product in
a box, so that people can go to service providers and spin
up map producers and build their file systems.
At some point in time there is going to be again, like I
said, a decision where companies are either going to
embrace the technology because that internal leadership or
their leader within the company has proven
the value of this.
And that's going to be the tough slog that everybody in
this room is going to have to deal with over the next five
years-- is that you're going to be battling internal
processes, internal fights with in every organization
that I've met.
Where you have the legacy database people-- the people
who said this is how we do it, this is why we do it.
We have these checks and balances.
We have these constraints.
That data has to stay within our walls.
And then you're going to have the leaders, who are more
aware of what's available in technology with
virtualization, with cloud-based technology.
And in some cases, it does make sense.
There're regulations and laws that are going to dictate
where data resides, or where it can reside, or
where it has to be.
And there're going to be places where the cloud is
going to be paramount.
But you're going to find in the next five years, that
you're going to be fighting more political battles than
doing anything else.
THEO VASSILAKIS: I agree with that.
I think there will certainly be lots of ways to run
infrastructure locally, as well as on the cloud.
I think though, that what people will realize over time
is that a lot of the reason why it may sometimes appear
cheaper to run locally than it is to run on the cloud these
days, is because with cloud services, you get a lot of
services by default.
So perhaps you would get back-up by default, perhaps
you would get certain compliance
functionality by default.
Whereas sort of on reasonably bare machine, in perhaps your
own data center, you wouldn't get these automatically.
And I think over time, as more of this computation becomes a
commodity, in that you just expect it to
work and that's it--
you won't be able to live without some of those things
that are today considered value-added services.
And I think there will be a crossover point where it'll
start to be more expensive to actually do all of these
things on your own appliance, than it will be to do it at
scale in somebody's data center.
And I think the fatter and fatter pipes that connect us
to these data centers are going to make that a
possibility.
HAL VARIAN: Go ahead.
CHARLES FAN: Again, in the anti-presidential presidential
debate style, I agree with both Theo and Gus.
And VMware's view is that it's a hybrid cloud where that we
want to provide the same benefit to customers, whether
they are running things in their data centers or out of a
cloud service provider.
All that being said, I do think there will be an
increasing amount of infrastructure moving out of
data center, over time, to the cloud services.
Meaning the applications will be more and more delivered as
a service to the enterprise customers, as opposed to as a
packaged software today.
That will take time, but I think that will happen.
But even after that happens, even after the infrastructure
is outsourced so called, to the cloud service providers,
ownership of the data, of the big data, medium data, small
data, still is with the enterprises.
And it is still their responsibility to be able to
make their decisions based on the data that they own.
Even some of the data may be sitting at the service
provider, at the cloud provider.
It is still their responsibility to analyze
those data and to make decisions based on those.
THEO VASSILAKIS: And clearly, security is going to be one of
those big items.
And so if anyone's working on cryptography, that's going to
continue to be a pretty hot thing.
HAL VARIAN: It's always good to have a job
where there's an adversary.
Coming back to the elections again--
same model.
Let's come down to the query language.
We've seen SQL mentioned a few times.
What about NoSQL?
Tell me what's the role of that in today's world?
Is SQL going to be obsolete?
Or are we going to continue to rely on that
as our query basis?
CHARLES FAN: OK I'll start.
I'm sure Gus and Theo have more to add.
I think NoSQL is sort of part of this common, chaotic
phenomenon that we are seeing.
It's driven by a few factors.
Still by far, SQL is the most popular query language today.
But NoSQL is one out of the need for people for looking
for more flexible schema.
And they're developing applications.
Sometimes they have the data stay the same, but they want
to structure them differently.
And they want to do that in a more easier way.
And they want to relax some of the consistency requirement of
their databases so they can deal with scale in a much
easier and much better way.
And it's basically driven through
various different needs.
So there are different flavors of new
query model that emerged.
And I think there are no better name.
So the easiest to one is you call them what they are not.
It's just NoSQL.
I do see that there is a strong trend in terms of
developers embracing them.
But again there is no clear new winners
in the query languages.
And I think in different companies there may be
different preferences being set up.
It doesn't mean five, ten years from now,
there won't be one.
I think right now, it [? is in ?] the model, let
developers decide--
let the developers of the world decide--
whether there is a newer querying language that can
replace SQL as a new one.
GUSTAV HORN: I would only say that SQL is going to be around
for a long time to come.
I still run into companies that are running
COBOL, of all things.
It's not going anywhere, any time soon.
I think what NoSQL is-- versus SQL, versus any
of these other things--
is yet another way of exposing all these internal politics
and battles that happen in big industry.
And that you're going to have legacy databases that that's
the only way you can talk to that.
And you're going to have next generation things coming out.
And if it wins the battle, which I think it will, you'll
find NoSQL becoming more and more popular.
And you'll find more and more of these aggregate,
heterogeneous kind of data stores becoming more popular--
provided they provide the answers that
they're supposed to.
Which just means that they have to be faster.
They have to be infinite in volume and size,
and they can grow.
And they have to never forget anything.
That's kind of the key.
When we talk big data, I always get a laugh sometimes
because they say, well we only need a 200-node system.
I said, well, that's today.
What are you going to do in five years?
How are you going to grow that?
I mean the most important thing in big data--
it's not the complete computational engines.
That's the most volatile thing in your big data system.
You want to get rid of that old crap
anyway, every two years.
You don't want to have to then re-migrate all your data.
The most important thing-- and OK, so this little plug for me
from NetApp-- is that the data is what's important.
The thing that it runs on this the most volatile or least
important thing.
It's the thing that you want to be able to flush out, and
read, and make faster--
over and over again-- provided that data stays and you don't
have to move it.
Because moving stuff is a waste.
And in Google, you don't want to be moving data either.
That's wasted energy.
THEO VASSILAKIS: Absolutely.
And I agree with that point, that the systems change.
Many, many systems have changed over
the years at Google.
And we'd migrate it forward and the older storage
systems have died.
But the data is always there.
I'm pretty sure that Jim Gray, who's a Turing Award winner,
felt like he needed to apologize for SQL in his
Turing Award acceptance speech-- sorry for SQL.
And as a builder SQL systems, I think SQL will stay.
It's great.
But actually the only thing I would point out about it is, I
think it's main and most positive attribute is that
it's a declarative language.
Meaning, it doesn't say how to compute what you want to
compute, but it just says what you want as the answer.
And I think that that's the key characteristic that--
whatever the language is, be it SQL, be it something else--
will be important.
Because the bigger the computation is, the more
complex the program is that you would have to write if
you're writing a real procedural program.
And so you're going to need systems to actually turn that
into computation for you.
So whatever the language is-- maybe it's SQL, maybe it's a
variant, maybe it's something else--
if it's declarative, then it gives the maximum ability to
the execution system to actually do
the right thing fast.
HAL VARIAN: We do have a few minutes for
questions from the audience.
We have a hard stop at seven because of a plane leaving.
But questions?
Back there.
Speak loudly please.
AUDIENCE: [INAUDIBLE]
THEO VASSILAKIS: Sure.
Privacy and what are we going to do.
So the question is, what are we going to
do about data privacy?
How are we going to make these systems protect people's data?
I can give you one view from Google which is, obviously
privacy is one of the critical things that we do here.
In the sense that if people don't trust
Google, none of it works.
And I think I would go back to this point
about declarative languages.
I think in the early stages of the development of analytical
systems, you wrote things down to the metal
because you had to.
There was no other way to do it.
And that gave no safeguards for what people
did with the data.
That you had to give them a code of conduct and say hey,
you should only apply it like this.
But actually when you go up the stack and up the
abstraction level, and you say, look tell me what you
want to compute.
And the system will actually compute it for you, then you
have a lot of opportunity to actually apply policy--
privacy policy in particular--
in an automatic manner.
So I think that ultimately, that's kind of
the long-term answer--
is that there will be mediation between the people
asking the questions, and systems that are executing the
queries, that then apply the right policies there.
CHARLES FAN: And I think this question mostly apply to the
service provider-- the cloud analytics, big data analytics,
the service provider.
And VMware recently bought a company called Cetas.
And we're looking at the same problem.
There's customers of various online gaming companies
uploading their data into our services.
And there are various technology encryptions around
to protect the privacy.
But I think more important than technology, is really the
business model.
It's like you're operating an information bank, similar to a
bank for money.
So you can argue that, why would you give your money to
another service provider, rather than
keep it in your home.
But the thing is, as Google, as Cetas--
if you breach that, they cannot continue
to exist as a business.
So it's in every interest of the operating company, of the
service, to protect the privacy of their customers so
that it can continue to exist as a bank--
as a service provider, just like bank do to the money.
So arguably, in most countries in the world, putting money in
the bank is safer--
we have the chief accountant can tell if I'm wrong here--
than putting your money under mattress.
And similar, it can be argued is already better protection
of the privacy of data if you entrust it to a service
provider in most cases.
GUSTAV HORN: So in short, trust no one.
And I actually mean that.
I mean you shouldn't really trust the banks almost.
And you shouldn't trust service providers
to any great extent.
It has to be earned.
And this is always proven, time and time again.
I think Google has done a very good job.
For better or worse, people have argued about how privacy
statements can be morphed and changed.
Think nothing stays consistent.
Nothing will stay the same.
It will always change.
So the minute you let any of your private information out
there, even if you believe it to be private, and only
amongst a small circle of people, I would never make
that assumption.
If you want to keep something private,
you keep it to yourself.
That's the only way.
HAL VARIAN: Yes.
AUDIENCE: I have a similar question for both of you.
[INAUDIBLE]
Also, the four layers that you discussed, [INAUDIBLE]
GUSTAV HORN: You can have privacy in the data.
You can have control of your data.
For example, you can encrypt the data on the drive.
So you can encrypt the data throughout the past.
But if you hand the key to that data out to anyone, then
you've already basically unlocked the door.
So I think compliance, privacy, protection of data,
that's a very tough problem.
HAL VARIAN: I think that Theo's point was really an
excellent one.
That you can build a lot of this
compliance into the system.
So you just can't link this with that, unless you have
some specific override from higher up.
And one of the advantage of having those declarative
languages is exactly that.
That the system can enforce compliance in
ways that humans can't.
THEO VASSILAKIS: Right.
I think I would sort of see privacy as a special case of
compliance.
You want the data in your organization, in your cloud,
to be managed according to a set of rules.
Now those rules will sometimes be about protecting
individuals.
But sometimes they'll be about financial regulations, and
whether revenues can be viewed by certain people, or modified
by certain people, or whatever it is.
And so exactly.
I'm actually responsible-- or was up until recently-- for a
lot of our billing-related computation.
And those are a lot of the questions as well.
Who gets to be able to touch any of that
data along that path.
And I think I agree with you, trust no one.
But I would apply that more to people than to systems.
I hope that we can get the systems where the systems are
proven over time to have the right behaviors.
CHARLES FAN: Right.
And to the four layers, I do think compliance cuts across
the entire stack--
the entire environment.
From my experience, I had more experience with
the bottom two layers.
Compliance certainly big with both of those layers.
But I can imagine there're probably things on the
application layer that you need to pay
attention to as well.
OK.
HAL VARIAN: Well, on that note, let us thank the group,
and thanks for coming.
Thanks to all of you.