字幕列表 影片播放 列印英文字幕 SETH VARGO: Hi there. And welcome to the third video in our series on SRE and DevOps. My name is Seth, and I'm a Developer Advocate at Google focused on infrastructure and operations. LIZ FONG-JONES: And I'm Liz, an SRE focused on teaching Google Cloud customers how to build and operate reliable services. SETH VARGO: In the previous video, we discussed the differences between SLIs, SLOs, and SLAs. SLIs are quantitative measurements, like latency. SLOs are the amount of time that an SLI can be out of specification. And SLAs are business agreements with explicit consequences for failing to deliver service. But what stops teams from breaking their agreed upon SLOs and forcing SREs to work overtime? It seems like a classic DevOps problem where product teams want to ship new features, but SREs need to maintain a reliability. Is there anything in the SRE program that can help with this classic problem? LIZ FONG-JONES: That's what we use error budgets for. SETH VARGO: Error budgets? LIZ FONG-JONES: Well, before we talk about error budgets, let's talk about risk and availability. As I mentioned in the previous episode, trying to go for 100% availability just isn't a good idea because it's expensive. It's technically complex. And in a lot of cases, it winds up being the case that users don't even see the benefits of it because of end reliability somewhere else in this system. SETH VARGO: I see. That makes a lot of sense. Because if my cellular network is only 99% reliable, but my service is 99.9% reliable, my users are never going to experience that additional 0.9% of reliability because they're cellular network is likely to fail before my service does. LIZ FONG-JONES: Yes. That's exactly correct. So while we want to reduce the risk of system failures, we have to accept some degree of risk in order to deliver these products and features. SETH VARGO: But how do we determine how much risk a service is willing to tolerate? LIZ FONG-JONES: So that's a product decision. So we have to work with the product's management team to figure out what is our explicit goal for the availability target of our service. And there are many things to think about, like how much is it going to cost to add extra fault tolerance, or to add extra testing time, or to reduce our frequency of pushes, or to increase how long it takes for us to decide that a release is good compared to the benefits to the user of increased reliability. SETH VARGO: I see. So the acceptable risk of a system dictates the SLO, and the SLO mathematically defines the error budget. If the service incurs too much downtime, we have to reduce the risk to remain within the SLO, which might mean halting deployments. If service owners want to deliver a lot of risky features, they have to be willing to accept a much looser SLO. Because if they were to choose a strict SLO, they would quickly exceed their error budget, which could halt future deployments. LIZ FONG-JONES: Exactly. So the main benefit of an error budget is that it's a quantitative measurement that's shared between the product and SRE teams, which means that we can balance innovation and stability to an appropriate level. SETH VARGO: So as long as the SLOs are met, releases can continue. But how do we know if an SLO breach is about to occur? LIZ FONG-JONES: So when we defined earlier the expectation of how much uptime a service is going to have and how we're going to measure it, well, we need to actually concretely implement that using a neutral third party, like a monitoring system. SETH VARGO: Well, and the metrics on that monitoring system, those are the SLIs, right? LIZ FONG-JONES: Exactly. And the difference between the actual uptime and the calculated target uptime from our SLO is the budget of how much unavailability that we can tolerate for the system to be stable over the entire window of the SLO. So we call this the error budget. If your SLIs are failing all the time, then you're going to be burning through your error budget. And then eventually, you need to stop your feature releases in order to focus instead on making reliability improvements and restructuring your application so that it can meet your SLOs in the future. SETH VARGO: So who enforces those policies, though? Because couldn't a product team just go over and break the SLO and force the SREs to work overtime? LIZ FONG-JONES: So this is why we need to have executive buy-in for error budgets. If the SRE teams don't have the ability to enforce the error budgets, then the whole system is going to break down. So some teams just allow for a limited number of tokens or golden bullets that you can hand out to a vice president, for example. So if a product team really wants to get that critical feature out, well, they're going to have to ask their vice president for a one-time exception, and they'll only get a certain number per year. SETH VARGO: I see. But what about things that are outside of the product team that aren't necessarily buggy code for my developers, like someone cuts an undersea cable, or there's a catastrophic failure at a data center? Those shouldn't impact my error budget. It wasn't my fault. LIZ FONG-JONES: So this is why it's important to have error budgets from top to bottom for everything in your stack. That way you can figure out how much error budget you allocate to your dependencies and how much error budget is reserved for your developers to spend. And this is another reason why targeting 100% availability isn't realistic, because all of your dependencies are not 100% available either. SETH VARGO: That makes a lot of sense. But what about other things like restarting a failed service or other kind of manual tasks? Are those considered part of the error budget? LIZ FONG-JONES: Yeah. So Seth, when you have to do manual action to keep your system from failing, in the wait before you actually do that manual action, you'll start burning through your error budget. But the actual act of doing that manual work, we track that separately. And that's a concept that we call toil. So we'll talk about that more in detail in the next video. SETH VARGO: Great. So risk and error budgets are directly related to many of the DevOps principles that we've discussed in earlier episodes. It clearly defines that accidents are normal by quantifying accidents and risk through error budgets. It also enforces that change should be gradual because a non-gradual change could quickly burn through the error budget for a particular product breaking the SLO and preventing further deployment for the quarter or for the year. This has really helped a lot. I think it's really clear why we say that class SRE implements DevOps. LIZ FONG-JONES: Thanks, everyone, for watching. Check the description below for links and more information. Don't forget to subscribe to the channel and stay tuned for our next video where we talk about toil budgets.