All glossary terms
Verify

Toil

Toil, as defined in Google's SRE practice, is operational work that is manual, repetitive, automatable, tactical (not strategic), and scales linearly with service growth, patching servers, manually resolving alerts, hand-editing configs. SRE teams target an upper bound on toil (Google: 50% of an SRE's time) so that the remaining time goes to engineering work that reduces future toil.

The definition matters because it draws a sharp line between unavoidable operational work (toil) and engineering work that pays down toil. Two cultural moves make the distinction operational: tracking toil hours (SREs log time against toil categories vs engineering categories), and capping toil percentage (when toil exceeds the cap, the team pauses feature work to automate). Anti-patterns: counting on-call as automatically being toil (much of it is incident response, which is engineering); treating one-off operational work as toil (it isn't, toil is repetitive); using 'toil reduction' as the SRE-team excuse to avoid customer feature requests (the goal is to reduce toil, not refuse work).