紐約市技術講座系列。谷歌如何備份互聯網 (NYC Tech Talk Series: How Google Backs Up the Internet)

字幕列表影片播放

RAYMOND BLUM: Hi, everybody.
So I'm Raymond.
I'm not television's Raymond Blum,
that you may remember from "The Facts of Life."
I'm a different Raymond Blum.
Private joke.
So I work in site reliability at Google
in technical infrastructure storage.
And we're basically here to make sure that things, well,
we can't say that things don't break.
That we can recover from them.
Because, of course, we all know things break.
Specifically, we are in charge of making sure
that when you hit Send, it stays in your sent mail.
When you get an email, it doesn't go away
until you say it should go away.
When you save something in Drive,
it's really there, as we hope and expect forever.
I'm going to talk about some of the things
that we do to make that happen.
Because the universe and Murphy and entropy
all tell us that that's impossible.
So it's a constant, never-ending battle
to make sure things actually stick around.
I'll let you read the bio.
I'm not going to talk about that much.
Common backup strategies that work maybe outside of Google
really don't work here, because they typically scale effort
with capacity and demand.
So if you want twice as much data backed up,
you need twice as much stuff to do it.
Stuff being some product of time, energy, space, media, et
cetera.
So that maybe works great when you go from a terabyte
to two terabytes.
Not when you go from an exabyte to two exabytes.
And I'm going to talk about some of the things we've tried.
And some of them have failed.
You know, that's what the scientific method is for,
right?
And then eventually, we find the things that work,
when our experiments agree with our expectations.
We say, yes, this is what we will do.
And the other things we discard.
So I'll talk about some of the things we've discarded.
And, more importantly, some of things
we've learned and actually do.
Oh, and there's a slide.
Yes.
Solidifying the Cloud.
I worked very hard on that title, by the way.
Well, let me go over the outline first, I guess.
Really we consider, and I personally
obsess over this, is that it's much more important.
You need a much higher bar for availability of data
that you do for availability of access.
If a system is down for a minute, fine.
You hit Submit again on the browser.
And it's fine.
And you probably blame your ISP anyway.
Not a big deal.
On the other hand, if 1% of your data goes away,
that's a disaster.
It's not coming back.
So really durability and integrity of data
is our job one.
And Google takes this very, very seriously.
We have many engineers dedicated to this.
Really, every Google engineers understands this.
All of our frameworks, things like Big Table and formerly
GFS, now Colossus, all are geared towards insuring this.
And there's lots of systems in place
to check and correct any lapses in data availability
or integrity.
Another thing we'll talk about is redundancy,
which people think makes stuff recoverable,
but we'll see why it doesn't in a few slides from now.
Another thing is MapReduce.
Both a blessing and a curse.
A blessing that you can now run jobs
on 30,000 machines at once.
A curse that now you've got files
on 30,000 machines at once.
And you know something's going to fail.
So we'll talk about how we handle that.
I'll talk about some of the things
we've done to make the scaling of the backup resources
not a linear function of the demand.
So if you have 100 times the data,
it should not take 100 times the effort to back it up.
I'll talk about some of the things
we've done to avoid that horrible linear slope.
Restoring versus backing up.
That's a long discussion we'll have in a little bit.
And finally, we'll wrap up with a case study, where Google
dropped some data, but luckily, my team at the time
got it back.
And we'll talk about that as well.
So the first thing I want to talk about is what I said,
my personal obsession.
In that you really need to guarantee
the data is available 100% of the time.
People talk about how many nines of availability
they have for a front end.
You know, if I have 3 9s 99% of the time is good.
4 9s is great.
5 9s is fantastic.
7 9s is absurd.
It's femtoseconds year outage.
It's just ridiculous.
But with data, it really can't even be 100 minus epsilon.
Right?
It has to be there 100%.
And why?
This pretty much says it all.
If I lose 200 k of a 2 gigabyte file,
well, that sounds great statistically.
But if that's an executable, what's
200 k worth of instructions?
Right?
I'm sure that the processor will find some other instruction
to execute for that span of the executable.
Likewise, these are my tax returns
that the government's coming to look at tomorrow.
Eh, those numbers couldn't have been very important.
Some small slice of the file is gone.
But really, you need all of your data.
That's the lesson we've learned.
It's not the same as front end availability,
where you can get over it.
You really can't get over data loss.
A video garbled, that's the least of your problems.
But it's still not good to have.
Right?
So, yeah, we go for 100%.
Not even minus epsilon.
So a common thing that people think,
and that I thought, to be honest, when I first
got to Google, was, well, we'll just make lots of copies.
OK?
Great.
And that actually is really effective against certain kinds
of outages.
For example, if an asteroid hits a data center,
and you've got a copy in another data center far away.
Unless that was a really, really great asteroid, you're covered.
On the other hand, picture this.
You've got a bug in your storage stack.
OK?
But don't worry, your storage stack
guarantees that all rights are copied
to all other locations in milliseconds.
Great.
You now don't have one bad copy or your data.
You have five bad copies of your data.
So redundancy is far from being recoverable.
Right?
It handles certain things.
It gives you location isolation, but really, there
aren't as many asteroids as there
are bugs in code or user errors.
So it's not really what you want.
Redundancy is good for a lot of things.
It gives you locality of reference
for I/O. Like if your only copy is in Oregon,
but you have a front end server somewhere in Hong Kong.
You don't want to have to go across the world
to get the data every time.
So redundancy is great for that, right?
You can say, I want all references
to my data to be something fairly local.
Great.
That's way you make lots of copies.
But as I said, to say, I've got lots of copies of my data,
so I'm safe.
You've got lots of copies of your mistaken deletes
or corrupt rights of a bad buffer.
So this is not at all what redundancy protects you
against.
Let's see, what else can I say?
Yes.
When you have a massively parallel system,
you've got much more opportunity for loss.
MapReduce, I think-- rough show of hands,
who knows what I'm talking about when I say MapReduce?
All right.
That's above critical mass for me.
It's a distributed processing framework
that we use to run jobs on lots of machines at once.
It's brilliant.
It's my second favorite thing at Google,
after the omelette station in the mornings.
And it's really fantastic.
It lets you run a task, with trivial effort,
you can run a task on 30,000 machines at once.
That's great until you have a bug.
Right?
Because now you've got a bug that's
running on 30,000 machines at once.
So when you've got massively parallel systems like this,
the redundancy and the replication
really just makes your problems all the more horrendous.
And you've got the same bugs waiting to crop up everywhere
at once, and the effect is magnified incredibly.
Local copies don't protect against site outages.
So a common thing that people say is, well, I've got Raid.
I'm safe.
And another show of hands.
Raid, anybody?
Yeah.
Redundant array of inexpensive devices.
Right?
Why put it on one disc when you can put it on three disks?
I once worked for someone who, it was really sad.
He had a flood in his server room.
So his machines are underwater, and he still said,
but we're safe.
We've got Raid.
I thought, like-- I'm sorry.
My mom told me I can't talk to you.
Sorry.
Yeah.
It doesn't help against local problems.
Right?
Yes, local, if local means one cubic centimeter.
Once it's a whole machine or a whole rack, I'm sorry.
Raid was great for a lot of things.
It's not protecting you from that.
And we do things to avoid that.
GFS is a great example.
So the Google File System, which we use throughout all of Google
until almost a year ago now, was fantastic.
It takes the basic concept of Raid,
using coding to put it on n targets at once,
and you only need n minus one of them.
So you can lose something and get it back.
Instead of disks, you're talking about cities or data centers.
Don't write it to one data center.
Write it to three data centers.
And you only need two of them to reconstruct it.
That's the concept.
And we've taken Raid up a level or five levels, maybe.
So to speak.
So we've got that, but again, redundancy isn't everything.
One thing we found that works really, really well,
and people were surprised.
I'll jump ahead a little bit.
People were surprised in 2011 when
we revealed that we use tape to back things up.
People said, tape?
That's so 1997.
Tape is great, because it's not disk.
If we could, we'd use punch cards, like that guy in XKCD
just said.
Why?
Because, imagine this.
You've got a bug in the prevalent device drivers.
Used first SATA disks.
Right?
OK.
Fine, you know what?
My tapes are not on a SATA disk.
OK?
I'm safe.
My punch cards are safe from that.
OK?
You want diversity from everything.
If you're worried about site problems,
locality, put it in multiple sites.
If you've worried about soft user error,
have levels of isolation from user interaction.
If you want protection from software bug,
put it on different software.
Different media implies different software.
So we look for that.
And, actually, what my team does is guarantee
that all the data is isolated in every combination.
Take the Cartesian product, in fact, of those factors.
I want location isolation, isolation from application
layer problems, isolation from storage layer problems,
isolation from media failure.
And we make sure that you're covered for all of those.
Which is not fun until, of course, you needed it.
Then it's fantastic.
So we look to provide all of those levels of isolation,
or I should say, isolation in each of those dimensions.
Not just location, which is what redundancy gives you.
Telephone.
OK.
This is actually my-- it's a plain slide,
but it's my favorite slide-- I once worked at someplace, name
to be withheld, OK?
Because that would be unbecoming.
Where the people in charge of backups, there
was some kind of failure, all the corporate data
and a lot of the production data was gone.
They said don't worry.
We take a backup every Friday.
They pop a tape in.
It's empty.
The guy's like, oh, well, I'll just
go to the last week's backup.
It's empty, too.
And he actually said, "I always thought
the backups ran kind of fast.
And I also wondered why we never went to a second tape."
A backup is useless.
It's a restore you care about.
One of the things we do is we run continuous restores.
We take some sample of our backup, some percentage,
random sampling of n percent.
It depends on the system.
Usually it's like 5.
And we just constantly select at random 5% of our backups,
and restore them and compare them.
Why?
We check some of them.
We don't actually compare them to the original.
Why?
Because I want to find out that my backup was empty before I
lost all the data next week.
OK?
This is actually very rare, I found out.
So we went to one of our tape drive vendors,
who was amazed that our drives come back with all this read
time.
Usually, when they give them back the drives log,
how much time they spent reading and writing,
most people don't actually read their tape.
They just write them.
So I'll let you project out what a disaster that is
waiting to happen.
That's the, "I thought the tapes were-- I always
wondered why we never went to a second tape."
So we run continuous restores over some
sliding window, 5% of the data, to make
sure we can actually restore it.
And that catches a lot of problems, actually.
We also run automatic comparisons.
And it would be a bit burdensome to say we read back
all the data and compare to the original.
That's silly, because the original
has changed in the more than microseconds
since you backed it up.
So we do actually check some of everything,
compare it to the check sums.
Make sure it makes sense.
OK.
We're willing to say that check some algorithms,
know what they're doing, and this is the original data.
We actually get it back onto the source media,
to make sure it can make it all the way back to disk
or to Flash, or wherever it came from initially.
To make sure it can make a round trip.
And we do this all the time.
And we expect there'll be some rate of failure,
but we don't look at, oh, like, a file did not
restore in the first attempt.
It's more like, the rate of failure on the first attempt
is typically m.
The rate of failure on the second attempt is typically y.
If those rates change, something has gone wrong.
Something's different.
And we get alerted.
But these things are running constantly in the background,
and we know if something fails.
And, you know, luckily we find out
before someone actually needed the data.
Right?
Which [INAUDIBLE] before, because you only
need the restore when you really need it.
So we want to find out if it was bad before then, and fix it.
Not through the segue, but I'll take it.
One thing we have found that's really interesting
is that, of course, things break.
Right?
Everything in this world breaks.
Second law of thermodynamics says that we can't fight it.
But we do what we can to safeguard it.
Who here thinks that tapes break more often than disks?
Who thinks disk breaks more often than tape?
Lifetime.
It's not a fair comparison.
But let's say meantime to be fair if they were constantly
written and read at the same rates, which
would break more open?
Who says disk?
Media failure.
Yeah.
Disk.
Disk breaks all the time, but you know what?
You know when it happens.
That's the thing.
So you have Raid, for example.
A disk breaks, you know it happened.
You're monitoring it.
Fine.
A tape was locked up in a warehouse somewhere.
It's like a light bulb.
People say, why do light bulbs break when you hit the switch
and turn them on?
That's stupid.
It broke last week, but you didn't
know until you turned it on.
OK?
It's Schrodinger's breakage.
Right?
It wasn't there until you saw it.
So tapes actually last a long time.
They last very well, but you didn't find out
until you needed it.
That's the lesson we've learned.
So the answer is re-code and read them before you need it.
OK?
Even given that, they do break.
So we do something that, as far as I know is fairly unique,
we have Raid on tape, in effect.
We have Raid 4 on tape.
We don't write your data to a tape.
Because if you care about your data,
it's too valuable to put on one tape
and trust this one single point to failure.
Because they're cartridges.
The robot might drop them.
Some human might kick it across the parking lots.
Magnetic fields.
A nutrino may finally decide to interact with something.
You have no idea.
OK?
So we don't take a chance.
When we write something to tape, we
tell you, hold on to your source data
until we say it's OK to delete it.
Do not alter this.
If you do, you have broken the contract.
And who knows what will happen.
We build up some number of full tapes, typically four.
And then we generate a fifth tape, a code tape,
by [INAUDIBLE] everything together.
And we generate a check sum.
OK.
Now you've got Raid 4 on tape.
When we've got those five tapes that you could lose any one of,
and we could reconstruct the data by [INAUDIBLE] back,
in effect.
We now say, OK, you can change your source data.
These tapes have made it to their final physical
destination.
And they are redundant.
They are protected.
And if it wasn't worth that wait, it
wasn't worth your backup, it couldn't
have been that important, really.
So every bit of data that's backed up
gets subjected to this.
And this gives you a fantastic increase.
The chances of a tape failure, I mean
we probably lose hundreds a month.
We don't have hundreds of data losses a month because of this.
If you lose one tape, our system detects this through this
continues restore.
It immediately recall the sibling tapes,
rebuilds another code tape.
All is well.
In the rare case, and I won't say it doesn't happen.
In the rare case where two tapes in the center broken,
well, now you're kind of hosed.
Only if the same spot on the two tapes was lost.
So we do reconstruction at the sub-tape level.
And we really don't have data loss,
because of these techniques.
They're expensive, but that's the cost of doing business.
I talked about light bulbs already, didn't I?
OK.
So let's switch gears and talk about backups
versus backing up lots of things.
I mentioned MapReduce.
Not quite at the level of 30,000,
but typically our jobs produce many, many files.
The files are sharded.
So you might have replicas in Tokyo, replicas in Oregon,
and replicas in Brussels.
And they don't have exactly the same data.
They have data local to that environment.
Users in that part of the world, requests
that have been made referencing that.
Whatever.
But the data is not redundant across all [INAUDIBLE]
the first.
So you have two choices.
You can make a backup of each of them, and then say,
you know what?
I know I've got a copy of every bit.
And when I have to restore, I'll worry
about how to consolidate it then.
OK.
Not a great move, because when will that happen?
It could happen at 10:00 AM, when you're well
rested on a Tuesday afternoon and your inbox is empty.
It could happen at 2:30 in the morning, when you just
got home from a party at midnight.
Or it could happen on Memorial Day in the US, which is also,
let's say, a bank holiday in Dublin.
It's going to happen in the last one.
Right?
So it's late at night.
You're under pressure because you've
lost a copy of your serving data, and it's time to restore.
Now let's figure out how to restore all this
and consolidate it.
Not a great idea.
You should have done all your thinking back
when you were doing the backup.
When you had all the time in the world.
And that's the philosophy we follow.
We make the backups as complicated
and take as long as we need.
To restores have to be quick and automatic.
I want my cat to be able to stumble over my keyboard,
bump her head against the Enter key,
and start a successful restore.
And she's a pretty awesome cat.
She can almost do that, actually.
But not quite, but we're working on it.
It's amazing what some electric shock can do.
Early on, we didn't have this balance, to be honest.
And then I found this fantastic cookbook,
where it said, hey, make your backups quick and your restores
complicated.
And yes, that's a recipe for disaster.
Lesson learned.
We put all the stress on restores.
Recovery should be stupid, fast, and simple.
But the backups take too long.
No they don't.
The restore is what I care about.
Let the backup take forever.
OK?
There are, of course, some situations
that just doesn't work.
And then you compromise with the world.
But this carries the huge percentage
of our systems work this way.
The backups take as long as they take.
The client services that are getting the data backup
know this expectation and deal with it.
And our guarantee is that the restores
will happen quickly, painlessly, and hopefully without user
problems.
And I should in little while in the case
that I'm going to talk about, what fast means.
Fast doesn't necessarily mean microseconds in all cases.
But relatively fast within reason.
As a rule, yes.
When the situation calls for data recovery, think about it.
You're under stress.
Right?
Something's wrong.
You've probably got somebody who is a much higher pay
grade than you looking at you in some way.
Not the time to sit there and formulate a plan.
OK?
Time to have the cat hit the button.
OK.
And then we have an additional problem at Google
which is scale.
So I'll confess, I used to lie a lot.
I mean, I still may lie, but not in this regard.
I used to teach, and I used to tell all of my students,
this is eight years ago, nine years ago.
I used to tell them there is no such thing as a petabyte.
That is like this hypothetical construct, a thought exercise.
And then I came to Google.
And in my first month, I had copied multiple petabyte files
from one place to another.
It's like, oh.
Who knew?
So think about what this means.
If you have something measured in gigabytes or terabytes
and it takes a few hours to backup.
No big deal.
If you have ten exabytes, gee, if that scales is up linear,
I'm going to spend ten weeks backing up every day's data.
OK?
In this world, that cannot work.
Linear time and all that.
So, yeah, we have to learn how to scale these things up.
We've got a few choices.
We've got dozens of data centers all around the globe.
OK.
Do you give near infinite backup capacity in every site?
Do you cluster things so that all the backups in Europe
happen here, all the ones in North America
happen here, Asia and the Pacific Rim happen here.
OK, then you've got bandwidth considerations.
How do I ship the data?
Oh, didn't I need that bandwidth for my serving traffic?
Maybe it's more important to make money.
So you've got a lot of considerations
when you scale this way.
And we had to look at the relative costs.
And, yeah, there are compromises.
We don't have backup facilities in every site.
We've got this best fit.
Right?
And it's a big problem in graph theory, right?
And how do I balance the available capacity
on the network versus where it's cost effective to put backups?
Where do I get the most bang for the buck, in effect.
And sometimes we tell people, no, put your service here not
there.
Why?
Because you need this much data backed up,
and we can't do it from there.
Unless you're going to make a magical network for us.
We've got speed of light to worry about.
So this kind of optimization, this kind of planning,
goes into our backup systems, because it has to.
Because when you've got, like I said, exabytes,
there are real world constraints at this point
that we have to think about.
Stupid laws of physics.
I hate them.
So and then there's another interesting thing
that comes into play, which is what do you scale?
You can't just say, I want more network bandwidth and more tape
drives.
OK, each of those tape drives breaks every once in while.
If I've got 10,000 times the number of drives,
then I've got to have 10,000 times the number of operators
to go replace them?
Do I have 10,000 times the amount
of loading dock to put the tape drives on until the truck comes
and picks them up?
None of this can scale linear.
It all has to do better than this.
My third favorite thing in the world is this quote.
I've never verified how accurate it is, but I love it anyway.
Which was some guy in the post World War II era,
I think it is, when America's starting
to get into telephones, and they become one in every household.
This guy makes a prediction that in n years time, in five years
time, and our current rate of growth,
we will employ a third percent of the US population
as telephone operators to handle the phone traffic.
Brilliant.
OK?
What he, of course, didn't see coming
was automated switching systems.
Right?
This is the leap that we've had to take
with regards to our backup systems.
We can't have 100 times the operators standing by,
not phone operators.
Computer hardware operators standing by
to replace bad tape drives, put tapes into slots.
It doesn't scale.
So we've automated everything we can.
Scheduling is all automated.
If you have a service at Google, you say, here's my data stores.
I need a copy every n.
I need the restores to happen within m.
And systems internally schedule the backups, check on them,
run the restores.
The restore testing that I mentioned earlier.
You don't do this.
Because it wouldn't scale.
The restore testing, as I mentioned earlier,
is happening continuously.
Little funny demons are running it for you.
And alert you if there's a problem.
Integrity checking, likewise.
The check sum is being compared automatically.
It's not like you come in every day
and look at your backups to make sure they're OK.
That would be kind of cute.
When tapes break.
I've been involved in backups for three years,
I don't know when a tape breaks.
When a broken tape is detected, the system automatically
looks up who the siblings are in the redundancy set described
earlier.
It recalls the siblings, it rebuilds the tape,
sends the replacement tape back to its original location.
Marks tape x has been replaced by tape y.
And then at some point, you can do a query,
I wonder how many tapes broke.
And if the rate of breakage changes,
like we typically see 100 tapes a day broken.
All of a sudden it's 300.
Then I would get alerted.
But until then, why are you telling me about 100 tapes?
It was the same as last week.
Fine.
That's how it is.
But if the rate changes, you're interested.
Right?
Because maybe you've just got a bunch of bad tape drives.
Maybe, like I said, a nutrino acted up.
We don't know what happened.
But something is different.
You probably want to know about it.
OK?
But until then, it's automated.
Steady state operations, humans should really not be involved.
This is boring.
Logistics.
Packing the drives up and shipping them off.
Obviously, humans have to, at this point
in time, still, humans have to go and actually remove
it and put it in a box.
But as far as printing labels and getting RMA numbers,
I'm not going to ask some person to do that.
That's silly.
We have automated interfaces that
get RMA numbers, that prepare shipping labels,
look to make sure that drives that should have gone out
have, in fact, gone out.
Getting acknowledgement of receipt.
And if that breaks down, a person has to get involved.
But if things are running normally,
why are you telling me?
Honestly, this is not my concern.
I have better things to think about.
Library software maintenance, likewise.
If we get firmware updates, I'm not going to rep an SD card
because the library's-- that's crazy.
OK?
Download it.
Let it get pushed to a Canary library.
Let it be tested.
Let the results be verified as accurate.
Then schedule upgrades in all the other libraries.
I really don't want to be involved.
This is normal operations.
Please don't bother me.
And this kind of automation is what lets us--
in the time I've been here, our number of tape libraries
and backup systems have gone up at least a full order
of magnitude.
And we don't have 10 or 100 times the people involved.
Obviously, we have some number more,
but it's far from a linear increase in resources.
So how do you do this?
Right.
We have to do some things, as I say,
that are highly parallelizable.
Yes.
We collect the source data from a lot of places.
OK?
This is something that we have our Swiss Army Knife.
MapReduce.
OK?
MapReduce is, like I said, it's my favorite.
One of my favorites.
When we say we've got files in all these machines, fine.
Let a MapReduce go collect them all.
OK?
Put them into a big funnel.
Spit out one cohesive copy.
OK?
And I will take that thing and work on it.
If a machine breaks, because when you've
got 30,000 machines, each one has a power supply, four
hard disks, a network card, something's
breaking every 28 seconds or so.
OK?
So let MapReduce handle that, and its sister systems,
handle that.
Don't tell me a machine died.
One's dying twice a minute.
That's crazy.
OK?
Let MapReduce and its cohorts find another machine,
move the work over there, and keep hopping around
until everything gets done.
If there's a dependency, like, oh, this file
can't be written until this file is written.
Again, don't bother me.
Schedule a wait.
If something waits too long, by all means, let me know.
But really, you handle your scheduling.
This is an algorithm's job, not a human's.
OK?
And we found clever ways to adapt
to MapReduce to do all of this.
It really is a Swiss army knife.
It handles everything.
Then this went on for a number of years at Google.
I was on the team for I guess about a year
when this happened.
And then we actually sort of had to put our money
where our mouth was.
OK?
In early 2011, you may or may not
recall Gmail had an outage, the worst kind.
Not, we'll be back soon.
Like, oops, where's my account?
OK.
Not a great thing to find.
I remember at 10:31 PM on a Sunday,
the pager app on my phone went off
with the words, "Holy crap," and a phone number.
And I turned to my wife and said, this can't be good news.
I doubt it.
She was like, maybe someone's calling to say hi.
I'm going to bet not.
Sure enough, not.
So there was a whole series of bugs and mishaps that you can
read about what was made public and what wasn't.
But I mean, this was a software bug, plain and simple.
We had unit tests, we have system tests,
we have integration tests.
Nonetheless, 1 in 8 billion bugs gets through.
And of course, this is the one.
And the best part is it was in the layer
where replication happens.
So as I said, I've got two other copies, yes.
And you have three identical empty files.
You work with that.
So we finally had to go to tape and, luckily for me,
reveal to the world that we use tape.
Because until then, I couldn't really
tell people I did for a living.
So what do you do?
I eat lunch.
I could finally say, yes, yes, I do that.
So we had to restore from tape.
And it's a massive job.
And this is where I mentioned that the meaning
of a short time or immediately is relative to the scale.
Right?
Like if you were to say, get me back that gigabyte
instantly, instantly means milliseconds or seconds.
If you say, get me back those 200,000 inboxes
of several gig each, probably you're
looking at more than a few hundred milliseconds
at this point.
And we'll go into the details of it in a little bit.
But we decided to restore from tape.
I woke up a couple of my colleagues
in Europe because it was daytime for them,
and it was nighttime for me.
And again, I know my limits.
Like you know what?
I'm probably stupider than they are
all the time, but especially now when it's midnight,
and they're getting up anyway.
OK?
So the teams are sharded for this reason.
We got on it.
It took some number of days that we'll
look at in more detail in a little bit.
We recovered it.
We restored the user data.
And done.
OK.
And it didn't take order of weeks or months.
It took order of single digit days.
Which I was really happy about.
I mean, there have been, and I'm not
going to go into unbecoming specifics
about other companies, but you can look.
There have been cases where providers of email systems
have lost data.
One in particular I'm thinking of
took a month to realize they couldn't get it back.
And 28 days later, said, oh, you know, for the last month,
we've been saying wait another day.
It turns out, nah.
And then a month after that, they actually got it back.
But nobody cared.
Because everybody had found a new email provider by then.
So we don't want that.
OK.
We don't want to be those guys and gals.
And we got it back in an order of single digit days.
Which is not great.
And we've actually taken steps to make
sure it would happen a lot faster this time.
But again, we got it back, and we
had our expectations in line.
How do we handle this kind of thing,
like Gmail where the data is everywhere?
OK.
We don't have backups, like, for example, let's
say we had a New York data center.
Oh, my backups are in New York.
That's really bad.
Because then if one data center grows and shrinks,
the backup capacity has to grow and shrink with it.
And that just doesn't work well.
We view our backup system as this enormous global thing.
Right?
This huge meta-system or this huge organism.
OK?
It's worldwide.
And we can move things around.
OK.
When you back up, you might back up to some other place
entirely.
The only thing, obviously, is once something is on a tape,
the restore has to happen there.
Because tapes are not magic, right?
They're not intangible.
But until it makes a tape, you might say,
my data is in New York.
Oh, but your backup's in Oregon.
Why?
Because that's where we had available capacity, had
location isolation, et cetera, et cetera.
And it's really one big happy backup system globally.
And the end users don't really know that.
We never tell any client service,
unless there's some sort of regulatory requirement,
we don't tell them, your backups will be in New York.
We tell them, you said you needed location isolation.
With all due respect, please shut up
and leave us to our jobs.
And where is it?
I couldn't tell you, to be honest.
That's the system's job.
That's a job for robots to do.
Not me.
And this works really well, because it
lets us move capacity around.
And not worry about if I move physical disks around,
I don't have to move tape drives around with it.
As long as the global capacity is good
and the network can support it, we're OK.
And we can view this one huge pool of backup resources
as just that.
So now the details I mentioned earlier.
The Gmail restore.
Let me see.
Who wants to give me a complete swag.
Right?
A crazy guess as to how much data
is involved if you lose Gmail?
You lose Gmail.
How much data?
What units are we talking about?
Nobody?
What?
Not quite yottabytes.
In The Price is Right, you just lost.
Because you can't go over.
I'm flattered, but no.
It's not yottabytes.
It's on the order of many, many petabytes.
Or approaching low exabytes of data.
Right?
You've got to get this back somehow.
OK?
Tapes are finite in capacity.
So it's a lot of tapes.
So we're faced with this challenge.
Restore all that as quickly as possible.
OK.
I'm not going to tell you the real numbers, because if I did,
like the microwave lasers on the roof
would probably take care of me.
But let's talk about what we can say.
So I read this fantastic, at the time,
and I'm not being facetious.
There was a fantastic analysis of what
Google must be doing right now during the Gmail
restore that they have publicized.
And it wasn't perfectly accurate,
but it had some reasonable premises.
It had a logical methodology.
And it wasn't insane.
So I'm going to go with the numbers they said.
OK?
They said we had 200,000 tapes to restore.
OK?
So at the time, the industry standard was LTO 4.
OK?
And LTO 4 tapes hold about 0.8 terabytes of data,
and 128 megabytes per second.
They take roughly two hours to read.
OK.
So if you take the amount of data
we're saying Gmail must have, and you figure out
how many at 2 hours per tape, capacity of tape,
that's this many drive hours.
You want it back in an hour?
Well, that's impossible because the whole tape takes two hours.
OK, let's say I want it back in two hours.
So I've got 200,000 tape drives at work at once.
All right.
Show of hands.
Who thinks we actually have 200,000 tape drives?
Even worldwide.
Who thinks we have 200,000 tape drives.
Really?
You're right.
OK.
Thank you for being sane.
Yes, we do not have-- I'll say this--
we do not have 200,000 tape drives.
We have some number.
It's not that.
So let's say I had one drive.
No problem.
In 16.67 thousand days, I'll have it back.
Probably you've moved on to some other email system.
Probably the human race has moved on to some other planet
by then.
Not happening.
All right?
So there's a balance.
Right?
Now we restored the data from tape
in less than two days, which I was really proud of.
So if you do some arithmetic, this tells you
that it would've taken 8,000 drives to get it back,
non-stop doing nothing but, 8,000 drives
are required to do this, with the numbers I've given earlier.
OK.
Typical tape libraries that are out there from Oracle and IBM
and Spectra Logic and Quantum, these things
hold several dozen drives.
Like between 1,500 drives.
So that means if you take the number of drives
we're talking about by the capacity of the library,
you must have had 100 libraries.
100 large tape libraries doing nothing else
but restoring Gmail for over a day.
If you have them in one location,
if you look at how many kilowatts or megawatts of power
each library takes, I don't know.
I'm sorry.
You, with the glasses, what's your question?
AUDIENCE: How much power is that?
RAYMOND BLUM: 1.21 gigawatts, I believe.
AUDIENCE: Great Scott!
RAYMOND BLUM: Yes.
We do not have enough power to charge a flux
capacitor in one room.
If we did, that kid who beat me up in high school
would never have happened.
I promised you, we did not have that much power.
So how did we handle this?
Right?
OK.
I can tell you this.
We did not have 1.21 gigawatts worth of power in one place.
The tapes were all over.
Also it's a mistake to think that we actually
had to restore 200,000 tapes worth of data
to get that much content back.
Right?
There's compression.
There's check sums.
There's also prioritized restores.
Do you actually need all of your archived folders
back as quickly as you need your current inbox
and your sent mail?
OK.
If I tell you there's accounts that have not
been touched in a month, you know what?
I'm going to give them the extra day.
On the other hand, I read my mail every two hours.
You get your stuff back now, Miss.
That's easy.
All right?
So there's prioritization of the restore effort.
OK.
There's different data compression and check summing.
It wasn't as much data as they thought, in the end,
to get that much content.
And it was not 1.21 gigawatts in a room.
And, yeah, so that's a really rough segue into this slide.
But one of the things that we learned from this
was that we had to pay more attention to the restores.
Until then, we had thought backups are super important,
and they are.
But they're really only a tax you
pay for the luxury of a restore.
So we started thinking, OK, how can we optimize the restores?
And I'll tell you, although I can't give you exact numbers,
can't and won't give you exact numbers,
it would not take us nearly that long to do it again today.
OK.
And it wouldn't be fraught with that much human effort.
It really was a learning exercise.
And we got through it.
But we learned a lot, and we've improved things.
And now we really only worry about the restore.
We view the backups as some known, steady state thing.
If I tell you you've got to hold onto your source data
for two days, because that's how long it takes us to back it up,
OK.
You can plan for that.
As long as I can promise you, when you need the restore,
it's there, my friend.
OK?
So the backup, like I said, it's a steady state.
It's a tax you pay for the thing you really want,
which is data availability.
On the other hand, when restore happens, like the Gmail
restore, we need to know that we can
service that now and quickly.
So we've adapted things a lot towards that.
We may not make the most efficient use of media,
actually, anymore.
Because it turns out that taking two hours to read a tape
is really bad.
Increase the parallelism.
How?
Maybe only write half a tape.
You write twice as many tapes, but you
can read them all in parallel.
I get the data back in half the time
if I have twice as many drives.
Right?
Because if I fill up the tape, I can't take the tape,
break it in the middle, say to two drives,
here, you take half.
You take half.
On the other hand, if I write the two tapes, side A and side
B, I can put them in two drives.
Read them in parallel.
So we do that kind of optimization now.
Towards fast reliable restores.
We also have to look at the fact that a restore is
a non-maskable interrupt.
So one of the reasons we tell people,
your backups really don't consider
them done until we say they're done.
When a restore comes in, that trumps everything.
OK?
Restores are suspended.
Get the restore done now.
That another lesson we've learned.
It's a restore system, not a backup system.
OK?
Restores win.
Also, restores have to win for another reason.
Backups, as I mentioned earlier, are portable.
If I say backup your data from New York,
and it goes to Chicago, what do you care?
On the other hand, if your tape is in Chicago,
the restore must happen in Chicago.
Unless I'm going to FedEx the tape over to New York,
somehow, magically.
Call the Flash, OK?
So we've learned how to balance these things.
That we, honestly, we wish backups would go way.
We want a restore system.
So quickly jumping to a summary.
OK?
We have found also, the more data
we have, the more important it is to keep it.
Odd but true.
Right?
There's no economy of scale here.
The larger things are, the more important they are, as a rule.
Back in the day when it was web search, right, in 2001,
what was Google?
It was a plain white page, right?
With a text box.
People would type in Britney Spears
and see the result of a search.
OK.
Now Google is Gmail.
It's stuff held in Drive, stuff held in Vault.
Right?
Docs.
OK?
It's larger and more important, which kind of stinks.
OK.
Now the backups are both harder and more important.
So we've had to keep the efficiency improving.
Utilization and efficiency have to skyrocket.
Because we can't have it, as I said earlier,
twice as much data or 100 times the data
can't require 100 times the staff
power and machine resources.
It won't scale.
Right?
The universe is finite.
So we've had to improve utilization and efficiency
a lot.
OK?
Something else that's paid off enormously
is having good infrastructure.
Things like MapReduce.
I guarantee when Jeff and Sanjay wrote MapReduce,
they never thought it would be used for backups.
OK?
But it's really good.
It's really good to have general purpose Swiss army knives.
Right?
Like someone looked, and I really
give a lot of credit, the guy who
wrote our first backup system said,
I'll bet I could write a MapReduce to do that.
I hope that guy got a really good bonus that year,
because that's awesome thinking.
Right?
That's visionary.
And it's important to invest in infrastructure.
If we didn't have MapReduce, it wouldn't have been there
for this dire [INAUDIBLE].
In a very short, and this kind of thing
has paid off for us enormously, when
I joined tech infrastructure who was responsible for this,
we had 1/5 or 1/6 maybe the number
of backup sites and the capacity that we did.
And maybe we doubled the staff.
We certainly didn't quintuple it or sextuple it.
We had the increase, but it's not linear at all.
But scaling is really important, and you
can't have any piece of it that doesn't scale.
Like you can't say, I'm going to be
able to deploy more tape drives.
OK.
But what about operation staff?
Oh, I have to scale that up.
Oh, let's hire twice as many people.
OK.
Do you have twice as many parking lots?
Do you have twice as much space in the cafeteria?
Do you have twice as much salary to give out?
That last part is probably not the problem, actually.
It's probably more likely the parking spots
and the restrooms.
Everything has to scale up.
OK.
Because if there's one bottleneck,
you're going to hit it, and it'll stop you.
Even the processes, like I mentioned our shipping
processes, had to scale and more efficient.
And they are.
And we don't take anything for granted.
One thing that's great, one of our big SRE mantras
is hope is not a strategy.
This will not get you through it.
This will not get you through it.
Sacrificing a goat to Minerva, I've tried,
will not get you through it.
OK.
If you don't try it, it doesn't work.
That's it.
When you start a service backing up at Google-- I mean,
force is not the right word-- but we
require that they restore it and load it back
into a serving system and test it.
Why?
Because it's not enough to say, oh, look,
I'm pretty sure it made it this far.
No.
You know what?
Take it the rest of the way, honestly.
Oh, who knew that would break?
Indeed, who knew.
The morgue is full of people who knew
that they could make that yellow light before it turned red.
So until you get to the end, we don't
consider it proven at all.
And this has paid off enormously.
We have found failures at the point of what could go wrong.
Who knew?
Right?
And it did.
So unless it's gone through an experiment all
the way to completion, we don't consider it.
If there's anything unknown, we consider it a failure.
That's it.
And towards that, I'm going to put a plug-in
for DiRT, one of my pet projects.
So DiRT is something we publicized for the first time
last year.
It's been quite a while.
Disaster Recovery Testing.
Every n months at Google, where n
is something less than 10 billion and more than zero,
we have a disaster.
It might be that martians attack California.
It might be Lex Luther finally is sick of all of us
and destroys the Northeast.
It might be cosmic rays.
It might be solar flares.
It might be the IRS.
Some disaster happens.
OK?
On one of those cosmic orders of magnitude.
And the rest of the company has to see
how will we continue operations without California,
North America, tax returns.
Whatever it is.
And we simulate this to every level,
down to the point where if you try
to contact your teammate in Mountain View and say, hey,
can you cover?
Your response will be, I'm underwater.
Glub, glub.
Look somewhere else.
OK?
Or I'm lying under a building.
God forbid.
Right?
But you have to see how will the company survive and continue
operations without that.
That being whatever's taken by the disaster.
We don't know what the disaster will be.
We find out when it happens.
You'll come in one day, hey, I can't log on.
Oops.
I guess DiRT has started.
OK.
And you've got to learn to adapt.
And this finds enormous holes in our infrastructure,
in physical security.
Imagine something like, we've got a data center
with one road leading to it.
And we have trucks filled with fuel
trying to bring fuel for the generators.
That road is off.
Gee, better have another road and another supplier
for diesel fuel for your generators.
That level, through simple software changes.
Like, oh, you should run in two cells that are not
in any way bound to each other.
So we do this every year.
It pays off enormously.
It's a huge boon to the caffeine industry.
I'm sure the local coffee suppliers love when we do this.
They don't know why, but every year, around that time,
sales spike.
And it teaches a lot every year.
And what's amazing is after several years of doing this,
we still find new kinds of problems every year.
Because apparently the one thing that is infinite is trouble.
There are always some new problems to encounter.
Really, what's happening is you got
through last year's problems.
There's another one waiting for you just beyond the horizon.
Like so.
Disaster.
Wow.
It looks like I really did something.
OK.
So I'll just carry it.
Ah.
And, yes, there's no backup.
But luckily, I'm an engineer.
I can clip things.
So with that, that pretty much is
what I had planned to talk about.
And luckily, because that is my last slide.
And I'm going to open up to any questions.
Please come up to a mic or, I think this is the only mic.
Right?
AUDIENCE: Hi.
RAYMOND BLUM: Hi.
AUDIENCE: Thanks.
I've no shortage of questions.
RAYMOND BLUM: Start.
AUDIENCE: My question was do you dedupe the files
or record only one copy of a file
that you already have an adequate number of copies of?
RAYMOND BLUM: There's not a hard, set answer for that.
And I'll point out why.
Sometimes the process needed to dedupe
is more expensive than keeping multiple copies.
For example, I've got a copy in Oregon and a copy in Belgium.
And they're really large.
Well, you know what?
For me to run the check summing and the comparison--
you know what?
Honestly, just put it on tape.
AUDIENCE: That's why I said an adequately backed up copy.
RAYMOND BLUM: Yes.
On the other hand, there are some,
like for example, probably things like Google Play Music
have this, you can't dedupe.
Sorry.
The law says you may not.
I'm not a lawyer, don't take it up with me.
AUDIENCE: You can backup, but you can't dedupe?
RAYMOND BLUM: You cannot dedupe.
You cannot say, you're filing the same thing.
Not-uh.
If he bought a copy, and he bought a copy,
I want to see two copies.
I'm the recording industry.
I'm not sure that's exactly use case, but those sorts of things
happen.
But yeah.
There's a balance.
Right?
Sometimes it's a matter of deduping,
sometimes it's just back up the file, sometimes.
AUDIENCE: But my question is what do you do?
RAYMOND BLUM: It's a case by case basis.
It's a whole spectrum.
AUDIENCE: Sometimes you dedupe, sometimes you back up.
RAYMOND BLUM: Right?
There's deduping.
Dedupe by region.
Deduping by time stamp.
AUDIENCE: And when you have large copies of things which
you must maintain an integral copy of,
and they're changing out from under you.
How do you back them up.
Do you front run the blocks?
RAYMOND BLUM: Sorry, you just crashed my parser.
AUDIENCE: Let's say you have a 10 gigabit database,
and you want an integral copy of it in the backups.
But it's being written while you're backing it up.
RAYMOND BLUM: Oh.
OK.
Got it.
AUDIENCE: How do you back up an integral copy?
RAYMOND BLUM: We don't.
We look at all the mutations applied,
and we take basically a low watermark.
We say, you know what, I know that all the updates as
of this time were there.
Your backup is good as of then.
There may be some trailing things after,
but we're not guaranteeing that.
We are guaranteeing that as of now, it has integrity.
And you'll have to handle that somehow.
AUDIENCE: I don't understand.
If you have a 10 gigabyte database,
and you back it up linearly, at some point,
you will come upon a point where someone
will want to write something that you haven't yet backed up.
Do you defer the write?
Do you keep a transaction log?
I mean, there are sort of standard ways of doing that.
How do you protect the first half from being integral,
and the second half from being inconsistent
with the first half?
RAYMOND BLUM: No.
There may be inconsistencies.
We guarantee that there is consistency
as of this point in time.
AUDIENCE: Oh, I see.
So the first half could be integral.
RAYMOND BLUM: Ask Heisenberg.
I have no idea.
But as of then, I can guarantee that everything's cool.
AUDIENCE: You may never have an integral copy.
RAYMOND BLUM: Oh, there probably isn't.
No, you can say this snapshot of it is good as of then.
AUDIENCE: It's not a complete.
RAYMOND BLUM: Right.
The whole thing is not.
But I can guarantee that anything as of then is.
After that, you're on your own.
AUDIENCE: Hi.
You mentioned having backups in different media,
like hard disk and tape.
RAYMOND BLUM: Cuneiform.
AUDIENCE: Adobe tablets.
And you may recall that there was
an issue in South Korea a while ago,
where the supply of hard disk suddenly dropped,
because there was a manufacturing issue.
Do you have any supply chain management redundancy
strategies?
RAYMOND BLUM: Yes.
Sorry.
I don't know that I can say more than that, but I can say yes.
We do have some supply chain redundancy strategy.
AUDIENCE: OK.
And one other question was do you have like an Amazon Web
Services Chaos Monkey like strategy
for your backup systems?
In general for testing them?
Kind of similar to DiRT, but only for backups?
RAYMOND BLUM: You didn't quite crash it,
but my parser is having trouble.
Can you try again?
AUDIENCE: Amazon Web Services has this piece
of software called Chaos Monkey that randomly kills processes.
And that helps them create redundant systems.
Do you have something like that?
RAYMOND BLUM: We do not go around sniping our systems.
We find that failures occur quite fine on their own.
No.
We don't actively do that.
But that's where I mentioned that we monitor the error rate.
In effect, we know there are these failures.
All right.
Failures happen at n.
As long as it's at n, it's cool.
There is a change in the failure rate,
that is actually a failure.
It's a derivative, let's say, of failures.
It's actually a failure.
And the systems are expected to handle the constant failure
rate.
AUDIENCE: But if it goes down?
RAYMOND BLUM: That's a big change in the failure rate.
AUDIENCE: If the rate goes down, is that OK, then?
RAYMOND BLUM: If what goes down?
AUDIENCE: The failure rate.
RAYMOND BLUM: Oh, yeah, that's still a problem.
AUDIENCE: Why is it a problem?
RAYMOND BLUM: It shouldn't have changed.
That's it.
So we will look and say, ah, it turns out
that zero failure, a reduction of failure
means that half the nodes aren't reporting anything.
Like that kind of thing happens.
So we look at any change as bad.
AUDIENCE: Thank you.
RAYMOND BLUM: You're welcome.
AUDIENCE: Hi.
Two questions.
One is simple, kind of yes or no.
So I have this very important cat video
that I made yesterday.
Is that backed up?
RAYMOND BLUM: Pardon?
AUDIENCE: The important cat video that I made yesterday,
is that backed up?
RAYMOND BLUM: Do you want me to check?
AUDIENCE: Yes or no?
RAYMOND BLUM: I'm going to say yes.
I don't know the specific cat, but I'm going to say yes.
AUDIENCE: That's a big data set.
You know what I mean?
RAYMOND BLUM: Oh, yeah.
Cats.
We take cats on YouTube very seriously.
My mother knows that "Long Cat" loves her.
No, we hold on to those cats.
AUDIENCE: So the second question.
So if I figure out your shipping schedule,
and I steal one of the trucks.
RAYMOND BLUM: You assume that we actually ship things
through these three physical dimensions,
you think something as primitive as trucks.
AUDIENCE: You said so.
Twice, actually.
RAYMOND BLUM: Oh, that was just a smoke screen.
AUDIENCE: You said, physical security of trucks.
RAYMOND BLUM: Well no.
I mean, the stuff is shipped around.
And if that were to happen, both departure and arrival
are logged and compared.
And if there's a failure in arrival, we know about it.
And until then, until arrival has been logged,
we don't consider the data to be backed up.
It actually has to get where it's going.
AUDIENCE: I'm more concerned about the tapes and stuff.
Where's my data?
RAYMOND BLUM: I promise you that we
have this magical thing called encryption that,
despite what government propaganda would have you
believe, works really, really well if you use it properly.
Yeah.
No, your cats are encrypted and safe.
They look like dogs on the tapes.
AUDIENCE: As for the talk, it was really interesting.
I have three quick questions.
The first one is something that I always wondered.
How many copies of a single email in my Gmail inbox
exist in the world?
RAYMOND BLUM: I don't know.
AUDIENCE: Like three or 10?
RAYMOND BLUM: I don't know, honestly.
I know it's enough that there's redundancy guaranteed,
and it's safe.
But, like I said, that's not a human's job to know.
Like somebody in Gmail set up the parameters,
and some system said, yeah, that's good.
Sorry.
AUDIENCE: This second one is related to something
that I read recently, you know Twitter is going to--
RAYMOND BLUM: I can say this.
I can say more than one, and less than 10,000.
AUDIENCE: That's not a very helpful response,
but it's fine.
RAYMOND BLUM: I really don't know.
AUDIENCE: I was reading about you
know Twitter is going to file an IPO,
and something I read in an article
by someone from Wall Street said that actually, Twitter
is not that impressive.
Because they have these banks systems, and they never fail,
like ever.
RAYMOND BLUM: Never, ever?
Keep believing in that.
AUDIENCE: That's exactly what this guy said.
He said, well, actually, Twitter, Google, these things
sometimes fail.
They lose data.
So they are not actually that reliable.
What do you think about that?
RAYMOND BLUM: OK.
I'm going to give you a great analogy.
You, with the glasses.
Pinch my arm really hard.
Go ahead.
Really.
I can take it.
No, like really with your nails.
OK.
Thanks.
So you see that?
He probably killed like, what, 20,000 cells?
I'm here.
Yes.
Google stuff fails all the time.
It's crap.
OK?
But the same way these cells are.
Right?
So we don't even dream that there
are things that don't die.
We plan for it.
So, yes, machines die all the time?
Redundancy is the answer.
Right now, I really hope there are other cells kicking in,
and repair systems are at work.
I really hope.
Please?
OK?
And that's how our systems are built.
On a biological model, it's called.
We expect things to die.
AUDIENCE: I think at the beginning,
I read that you also worked on something
related to Wall Street.
My question was also, is Google worse
than Wall Street's system--
RAYMOND BLUM: What does that mean, worse than?
AUDIENCE: Less reliable?
RAYMOND BLUM: I actually would say quite the opposite.
So at one firm I worked at, that I, again, will not name,
they bought the best.
Let me say, I love Sun Equipment.
OK?
Before they were Oracle, I really loved the company, too.
OK?
And they've got miraculous machines.
Right?
You could like, open it up, pull out RAM,
replace a [INAUDIBLE] board with a processor,
and the thing stays up and running.
It's incredible.
But when that asteroid comes down,
that is not going to save you.
OK?
So, yes, they have better machines than we do.
But nonetheless, if I can afford to put machines
in 50 locations, and the machines are half as reliable,
I've got 25 times the reliability.
And I'm protected from asteroids.
So the overall effect is our stuff is much more robust.
Yes, any individual piece is, like I said, the word before.
Right?
It's crap.
Like this cell was crap.
But luckily, I am more than that.
And likewise, our entire system is incredibly robust.
Because there's just so much of it.
AUDIENCE: And my last question is could you
explain about MapReduce?
I don't really know how it works.
RAYMOND BLUM: So MapReduce is something
I'm not-- he's probably better equipped,
or she is, or that guy there.
I don't really know MapReduce all that well.
I've used it, but I don't know it that well.
But you can Google it.
There are White Papers on it.
It's publicly owned.
There are open source implementations of it.
But it's basically a distributing processing
framework that gives you two things to do.
You can split up your data, and you
can put your data back together.
And MapReduce does all the semaphores, handles
race conditions, handles locking.
OK?
All for you.
Because that's the stuff that none of us
really know how to get right.
And, as I said, if you Google it,
there's like a really, really good White Paper on it
from several years ago.
AUDIENCE: Thank you.
RAYMOND BLUM: You're welcome.
AUDIENCE: Thanks very much for the nice presentation.
I was glad to hear you talked about 100%
guaranteed data backup.
And not just backup, but also recoverability.
I think it's probably the same as the industry term, the RPO,
recovery point objective equals zero.
My first question is in the 2011 incident,
were you able to get 100% of data?
RAYMOND BLUM: Yes.
Now availability is different from recoverability.
It wasn't all there in the first day.
As I mentioned, it wasn't all there in the second day.
But it was all there.
And availability varied.
But at the end of the period, it was all back.
AUDIENCE: So how could you get 100% of data
when your replication failed because
of disk corruption or some data corruption?
But the tape is the point of time copy, right?
So how could you?
RAYMOND BLUM: Yes.
OK So what happened is-- without going into things
that I can't, or shouldn't, or I'm not sure if I should
or not-- the data is constantly being backed up.
Right?
So let's say we have the data as of 9:00 PM.
Right?
And let's say the corruption started at 8:00 PM,
but hadn't made it to tapes yet.
OK?
And we ceased the corruption.
We fall back to an early version of the software that
doesn't have the bug.
Pardon me.
At 11:00.
So at some point in the stack, all the data is there.
There's stuff on tape.
There's stuff being replicated.
There's stuff in the front end that's
still not digested as logs.
So we're able to reconstruct all of that.
And there was overlap.
So all of the logs had till then.
The backups had till then.
This other layer in between had that much data.
So there was, I don't know how else to say it.
I'm not articulating this well.
But there was overlap, and that was built into it.
The idea was you don't take it out of this stack until n hours
after it's on this later.
Why?
Just because.
And we discovered the cause really paid off.
AUDIENCE: So you keep the delta between those copies.
RAYMOND BLUM: Yes.
There's a large overlap between the strata,
I guess is the right way to say it.
AUDIENCE: I just have one more question.
The speed of these data increasing in our days
is going to be double and triple in certain years.
Who knows, right?
Do you think there is a need for new medium,
rather than tape, to support that backup?
RAYMOND BLUM: It's got to be something.
Well, I would have said yes.
What days am I thinking about?
In the mid '90s, I guess, when it was 250 gig on a cartridge,
and those were huge.
And then things like Zip disks.
Right?
And gigabytes.
I would have said yes.
But now, I'm always really surprised at how
either the laws of physics, laws of mechanics,
or the ingenuity of tape vendors.
I mean, the capacity is going with Moore's law pretty well.
So LTO 4 was 800 gig.
LTO 5 is 1.
almost 5 t.
LTO 6 tapes are 2.4 t, I think, or 2.3 t.
So, yeah, they're climbing.
The industry has called their bluff.
When they said, this is as good as it can get, turns out,
it wasn't.
So it just keeps increasing.
At some point, yes.
But we've got a lot more random access
media, and not things like tape.
AUDIENCE: Is Google looking at something
or doing research for that?
RAYMOND BLUM: I can say the word yes.
AUDIENCE: Thank you.
RAYMOND BLUM: You're welcome.
AUDIENCE: Thank you for your talk.
RAYMOND BLUM: Thank you.
AUDIENCE: From what I understand,
Google owns its own data centers.
But there are many other companies that cannot afford
to have their own data centers, so many companies operate
in the cloud.
And they store their data in the cloud.
So based on your experience, do you
have any strategies for backing up
and the storing from the cloud, data
that's stored on the cloud?
RAYMOND BLUM: I need a tighter definition of cloud.
AUDIENCE: So, for example, someone operating
completely using Amazon net sources.
RAYMOND BLUM: I would hope that, then, Amazon
provides-- I mean, they do, right?
A fine backup strategy.
Not as good as mine, I want to think.
I'm just a little biased.
AUDIENCE: But are there any other strategies
that companies that operate completely in the cloud
should consider?
RAYMOND BLUM: Yeah.
Just general purpose strategies.
I think the biggest thing that people
don't do that they can-- no matter what your resources,
you can do a few things, right?
Consider the dimensions that you can move sliders around on,
right?
There's location.
OK?
There's software.
And I would view software as vertical and location
as horizontal.
Right?
So I want to cover everything.
That would mean I want a copy in, which one did I say?
Yeah.
Every location.
And in every location in different software,
layers in a software stack.
And you can do if you even just have VMs from some provider,
like, I don't know who does that these days.
But some VM provider.
Provider x, right?
And they say, our data centers are in Atlanta.
Provider y says our data centers are in,
where's far away from Atlanta?
Northern California.
OK.
Fine.
There.
I've got location.
We store stuff on EMC sand devices.
We store our stuff on some other thing.
OK.
Fine.
I say it again, it's vendor bugs.
And just doing it that way, there's a little research
and I don't want to say hard work.
But plotting it out on a big map, basically.
And just doing that, I think, is a huge payoff.
It's what we're doing, really.
AUDIENCE: So increasing the redundancy factor.
RAYMOND BLUM: Yes.
But redundancy in different things.
Most people think of redundancy only in location.
And that's my point.
It has to be redundancy in these different-- redundant software
stacks and redundant locations.
The product of those, at least.
And also Alex?
Yes.
What he said.
Right?
Which was, he said something.
I'm going to get it back in a second.
Loading.
Loading.
OK, yes.
Redundancy even in time.
It was here in the stack, migrating here.
You know what?
Have it in both places for a while.
Like don't just let it migrate through my stack.
Don't make the stacks like this.
Make them like that, so there's redundancy.
Why?
Because if I lose this and this, look,
I've got enough overlap that I've got everything.
So redundancy in time, location, and software.
Hi again.
AUDIENCE: Hi.
So if I understand your description of the stacks,
it sounds as if every version of every file
will end up on tape, somewhere, sometime.
RAYMOND BLUM: Not necessarily.
Because there are some things that we don't.
Because it turns out that in the time
it would take to get it back from tape,
I could reconstruct the data.
I don't even have to open the tape.
They don't bother me.
AUDIENCE: Do you ever run tape faster than linear speed?
In other words, faster than reading the tape
or writing the tape by using the fast forwarding
or seek to a place?
RAYMOND BLUM: We do what the drives allow.
Yes.
AUDIENCE: And how are we going to change the encryption
key, if you need to?
With all of these tapes and all of these drives?
You have a gigantic key distribution problem.
RAYMOND BLUM: Yes, we do.
AUDIENCE: Have you worried about that in your DiRT?
RAYMOND BLUM: We have.
And it's been solved to our satisfaction, anyway.
Sorry.
I can say yes, though.
And I'll say one more thing towards that,
which is actually, think about this.
A problem where a user says something
like, I want my stuff deleted.
OK.
It's on a tape with a billion other things.
I'm not recalling that tape and rewriting it just for you.
I mean, I love you like a brother,
but I'm not doing that.
OK?
So what do I do?
Encryption.
Yes.
So our key management system is really good for reasons
other than what you said.
I mean, it's another one of those Swiss army knives that
keeps paying off in the ways I just described.
Anything else?
Going once and twice.
AUDIENCE: Allright.
So there's all these different layers
of this that are all coming together.
How do you coordinate all that?
RAYMOND BLUM: Very well.
Thank you.
Can you give me a more specific?
AUDIENCE: I mean between maintaining the data center,
coming up with all the different algorithms for cacheting
different places, things like that.
Are there a few key people who know
how everything fits together?
RAYMOND BLUM: There are people-- so this
is a big problem I had at first when I came to Google.
So before I came to Google, and this is not
a statement about me maybe as much as of the company
I kept, but I was the smartest kid in the room.
And then, when I came to Google, I was an idiot.
And a big problem with working here
is you have to accept that.
You are an idiot, like all the other idiots around you.
Because it's just so big.
So what we're really good at is trusting our colleagues
and sharding.
Like I know my part of the stack.
I understand there is someone who
knows about that part of the stack.
I trust her to do her job.
She hopefully trusts me to do mine.
The interfaces are really well-defined.
And there's lots of tests at every point of interaction.
And there are people who maybe have the meta-picture,
but I would pretty much say no one knows all of it in depth.
People have broad views, people have deep, vertical slices.
But no one's got it all.
It's just not possible.
AUDIENCE: Hi.
So I wanted to ask how much effort
is required to deal with local regulation?
So you described your ability to backup
in this very abstract way, like we'll just
let the system decide whether it goes into Atlanta or London
or Tokyo or wherever.
But obviously, now we're living in a world
where governments are saying, you can't do that.
Or this needs to be deleted.
Or this needs to be encrypted in a certain way.
Or we don't want our user data to leave the country.
RAYMOND BLUM: Yes.
This happens a lot.
This is something that I know is publicly known,
so I'm going to be bold for a change.
No, I'm bold all the time, actually, but I'll say this.
So we have actually apps.
We won, some number of years and months
ago, the contract to do a lot of, let's call it,
IT systems for the US government.
A lot of government agencies are on Gmail.
Gmail for enterprise, in effect, but the enterprise
is the US government.
And that was the big thing.
The data may not leave the country, period.
If we say it's in the state of Oregon,
it's in the state of Oregon.
And we've had to go back and retrofit this
into a lot of systems.
But it was a massive effort.
We've had to go back and build that in.
Luckily, all the systems were modular enough
it wasn't a terrible pain, because
of the well-defined interfaces I stressed so much in response
to, I think it was, Brad's question a few minutes ago.
Yeah.
It had to be done.
And there are two kinds, right?
There's white-listing and blacklisting.
Our data must stay here.
And there's, our data must not be there.
And most of our systems and infrastructure do that now.
And what we've tried to do is push
that stuff down as far as possible
for the greatest possible benefit.
So we don't have 100 servicers at Google
who know how to isolate the data.
We know how to say, on the storage layer,
it must have these white or blacklists associated with it.
And all the services write to it,
just say, I need profile x or y.
And hopefully, the right kind of magic happens.
But, yeah, it's a huge problem.
And we have had to deal with it, and we have.
AUDIENCE: Could you go into a little more detail about using
MapReduce for the backup and restore tasks.
Just sort of like what parts of the backup and restore it
applies to.
RAYMOND BLUM: Oh, sure.
Do you know what Big Table is?
AUDIENCE: Yeah.
RAYMOND BLUM: OK.
So Big Table is like our "database" system.
Quotes around database.
It's a big hash map on disk, basically, and the memory.
So if I've got this enormous Big Table,
and remember the first syllable of Big Table.
Right?
I'm going to read this serially.
OK, that'll take five years.
On the other hand, I can shard it and say,
you know what, I'm going to take the key distribution,
slice it up into 5,000 roughly equidistant slices,
and give 5,000 workers, OK, you seek ahead to your position.
You back that up.
And then on the restore, likewise.
OK?
I'm going to start reading.
OK, well, I've put this onto 20 tapes.
OK, give me 20 workers.
What they're each doing is reading a tape,
sending it off to n, let's say 5,000 workers up near Big Table
whose job it is to get some compressed [INAUDIBLE],
unpack it, stick it into the Big Table at the right key range.
It's just that.
It's a question of sharding into as small of units as possible.
AUDIENCE: So it's just a matter of creating that distribution.
RAYMOND BLUM: Yeah.
And that's what MapReduce is really, really good at.
That's its only thing it's really good at,
but that's enough.
AUDIENCE: It's just getting it to simplify it
to that point that would be hard.
RAYMOND BLUM: Yeah.
I agree.
It's very hard.
And that's why I fell in love with MapReduce.
I was like, wow.
I just wrote that in an afternoon.
AUDIENCE: And what's roughly the scale?
You said for Gmail, it's on exabyte.
I mean, for all of the data that Google's mapping out.
Is that also in the exabyte or?
RAYMOND BLUM: It's not yottabytes.
It's certainly more than terabytes.
Yeah.
There's a whole spectrum there.
AUDIENCE: Including all the pedafiles?
RAYMOND BLUM: I can't.
You know I can't answer that.
AUDIENCE: Thank you.
RAYMOND BLUM: You're welcome.
And I'll give one more round for questions?
Going thrice.
All right.
Sold.
Thank you all.