All glossary terms
Verify

Site reliability engineering

Site reliability engineering (SRE) is a discipline originated at Google in 2003 (codified in the 2016 SRE book) that applies software-engineering practices to operations. Rather than treating uptime as absolute, SRE expresses reliability as service level objectives (SLOs) and uses an error budget, the gap between actual and target, to govern reliability vs feature trade-offs.

SRE's central insight: 100% uptime is the wrong goal. It's both unachievable and unnecessarily expensive. Define an SLO (e.g., 99.9% request success), measure it via SLIs (service level indicators), and treat the gap to 100% as a budget. When the budget is healthy, ship features; when it's exhausted, halt feature work to invest in reliability. SRE practices that have spread beyond Google: toil reduction (engineering effort to eliminate repeated manual work), blameless postmortems, error-budget policies, on-call rotations with strict guardrails (no more than 25% on-call per quarter). The model has been adopted at scale by Meta, Netflix, LinkedIn, and most modern infrastructure organisations.