Placeholder Image

字幕列表 影片播放

  • SETH VARGO: Hi there.

  • And welcome to the third video in our series

  • on SRE and DevOps.

  • My name is Seth, and I'm a Developer Advocate

  • at Google focused on infrastructure and operations.

  • LIZ FONG-JONES: And I'm Liz, an SRE focused

  • on teaching Google Cloud customers how to build

  • and operate reliable services.

  • SETH VARGO: In the previous video,

  • we discussed the differences between SLIs, SLOs, and SLAs.

  • SLIs are quantitative measurements, like latency.

  • SLOs are the amount of time that an SLI

  • can be out of specification.

  • And SLAs are business agreements with explicit consequences

  • for failing to deliver service.

  • But what stops teams from breaking their agreed upon SLOs

  • and forcing SREs to work overtime?

  • It seems like a classic DevOps problem where

  • product teams want to ship new features,

  • but SREs need to maintain a reliability.

  • Is there anything in the SRE program

  • that can help with this classic problem?

  • LIZ FONG-JONES: That's what we use error budgets for.

  • SETH VARGO: Error budgets?

  • LIZ FONG-JONES: Well, before we talk about error budgets,

  • let's talk about risk and availability.

  • As I mentioned in the previous episode, trying to go for 100%

  • availability just isn't a good idea because it's expensive.

  • It's technically complex.

  • And in a lot of cases, it winds up being the case

  • that users don't even see the benefits of it because

  • of end reliability somewhere else in this system.

  • SETH VARGO: I see.

  • That makes a lot of sense.

  • Because if my cellular network is only 99% reliable,

  • but my service is 99.9% reliable,

  • my users are never going to experience that additional 0.9%

  • of reliability because they're cellular network

  • is likely to fail before my service does.

  • LIZ FONG-JONES: Yes.

  • That's exactly correct.

  • So while we want to reduce the risk of system failures,

  • we have to accept some degree of risk

  • in order to deliver these products and features.

  • SETH VARGO: But how do we determine

  • how much risk a service is willing to tolerate?

  • LIZ FONG-JONES: So that's a product decision.

  • So we have to work with the product's management team

  • to figure out what is our explicit goal

  • for the availability target of our service.

  • And there are many things to think about,

  • like how much is it going to cost to add extra fault

  • tolerance, or to add extra testing time,

  • or to reduce our frequency of pushes,

  • or to increase how long it takes for us to decide

  • that a release is good compared to the benefits to the user

  • of increased reliability.

  • SETH VARGO: I see.

  • So the acceptable risk of a system dictates the SLO,

  • and the SLO mathematically defines the error budget.

  • If the service incurs too much downtime,

  • we have to reduce the risk to remain within the SLO, which

  • might mean halting deployments.

  • If service owners want to deliver

  • a lot of risky features, they have

  • to be willing to accept a much looser SLO.

  • Because if they were to choose a strict SLO,

  • they would quickly exceed their error budget, which

  • could halt future deployments.

  • LIZ FONG-JONES: Exactly.

  • So the main benefit of an error budget

  • is that it's a quantitative measurement that's

  • shared between the product and SRE teams, which

  • means that we can balance innovation and stability

  • to an appropriate level.

  • SETH VARGO: So as long as the SLOs are met,

  • releases can continue.

  • But how do we know if an SLO breach is about to occur?

  • LIZ FONG-JONES: So when we defined earlier the expectation

  • of how much uptime a service is going to have

  • and how we're going to measure it,

  • well, we need to actually concretely implement

  • that using a neutral third party,

  • like a monitoring system.

  • SETH VARGO: Well, and the metrics

  • on that monitoring system, those are the SLIs, right?

  • LIZ FONG-JONES: Exactly.

  • And the difference between the actual uptime

  • and the calculated target uptime from our SLO

  • is the budget of how much unavailability

  • that we can tolerate for the system

  • to be stable over the entire window of the SLO.

  • So we call this the error budget.

  • If your SLIs are failing all the time,

  • then you're going to be burning through your error budget.

  • And then eventually, you need to stop your feature releases

  • in order to focus instead on making reliability improvements

  • and restructuring your application so that it can

  • meet your SLOs in the future.

  • SETH VARGO: So who enforces those policies, though?

  • Because couldn't a product team just go over and break the SLO

  • and force the SREs to work overtime?

  • LIZ FONG-JONES: So this is why we

  • need to have executive buy-in for error budgets.

  • If the SRE teams don't have the ability

  • to enforce the error budgets, then the whole system

  • is going to break down.

  • So some teams just allow for a limited number

  • of tokens or golden bullets that you can hand out to a vice

  • president, for example.

  • So if a product team really wants

  • to get that critical feature out,

  • well, they're going to have to ask their vice

  • president for a one-time exception,

  • and they'll only get a certain number per year.

  • SETH VARGO: I see.

  • But what about things that are outside of the product team

  • that aren't necessarily buggy code for my developers,

  • like someone cuts an undersea cable,

  • or there's a catastrophic failure at a data center?

  • Those shouldn't impact my error budget.

  • It wasn't my fault.

  • LIZ FONG-JONES: So this is why it's

  • important to have error budgets from top to bottom

  • for everything in your stack.

  • That way you can figure out how much error budget you allocate

  • to your dependencies and how much error budget is reserved

  • for your developers to spend.

  • And this is another reason why targeting 100% availability

  • isn't realistic, because all of your dependencies

  • are not 100% available either.

  • SETH VARGO: That makes a lot of sense.

  • But what about other things like restarting a failed service

  • or other kind of manual tasks?

  • Are those considered part of the error budget?

  • LIZ FONG-JONES: Yeah.

  • So Seth, when you have to do manual action

  • to keep your system from failing, in the wait

  • before you actually do that manual action,

  • you'll start burning through your error budget.

  • But the actual act of doing that manual work,

  • we track that separately.

  • And that's a concept that we call toil.

  • So we'll talk about that more in detail in the next video.

  • SETH VARGO: Great.

  • So risk and error budgets are directly related

  • to many of the DevOps principles that we've

  • discussed in earlier episodes.

  • It clearly defines that accidents

  • are normal by quantifying accidents

  • and risk through error budgets.

  • It also enforces that change should be gradual

  • because a non-gradual change could quickly

  • burn through the error budget for a particular product

  • breaking the SLO and preventing further deployment

  • for the quarter or for the year.

  • This has really helped a lot.

  • I think it's really clear why we say that class SRE implements

  • DevOps.

  • LIZ FONG-JONES: Thanks, everyone, for watching.

  • Check the description below for links and more information.

  • Don't forget to subscribe to the channel

  • and stay tuned for our next video

  • where we talk about toil budgets.

SETH VARGO: Hi there.

字幕與單字

單字即點即查 點擊單字可以查詢單字解釋

B1 中級 美國腔

風險和錯誤預算 (Risk and Error Budgets)

  • 88 4
    Marsen Lin 發佈於 2021 年 01 月 14 日
影片單字