We need to measure our incidents.
- me several times, usually after a bad outage.
The most common and most classic metrics that come to mind are
- Mean Time to Detect: MTTD
- Mean Time Between Failure: MTBF
- Mean Time to Resolve: MTTR
MTTD, MTBF, and MTTR make a lot of sense for hardware.
Imagine that I am running a data center with 1000 hard disks, all purchased from the same manufacturer. The manufacturer has provided a probability of failure and manuals for recovery. By measuring Mean Time to Detect (MTTD), Mean Time Between Failures (MTBF), and Mean Time to Repair (MTTR), I can assess the quality of our installation and the effectiveness of our recovery processes. The failure events are repeatable, and approximately independent and identically distributed (iid).
These metrics allow me to:
- Experiment with different manufacturers or combinations of manufacturers.
- Track the lifespan and quality of these disks over time, models, and vintages.
- Track the effectiveness of different RAID configurations.
These MTT*
use the mean, which is most useful when I can assume a normal distribution.
In software development, repeatable bugs are quite rare. When a bug can be easily reproduced, it is usually a trivial task to fix and prevent. One effective technique is to add unit tests as you address bug fixes, which helps prevent any regressions. Additionally, most teams also implement some form of test coverage requirement.
Bugs which result in the painful incidents do not follow a normal distribution:
- AWS SQS changes its wire protocol from XML to JSON, and all queue pollers start failing.
- AWS suffers from a network partition in us-west-2 between 2a and 2c. All of your resources are still up (hence no AZ failover was triggered), but some can no longer communicate with each other.
- Your bank changes their return code for a specific ACH error and it is not part of the NACHA standard.
- An innocuous bug introduced 6 months ago has been sitting silently in your codebase. One day, another equally innocuous change triggers the former, resulting in catastrophe and extended resolution. Oh, and the engineer who introduced the first change is currently on vacation in a different timezone.
Measuring the mean makes a lot less sense to me.
This is valid data about these incidents, but they (and other common measurements around incidents) tend to generate very little insight about these events, or what directions a team or company might take in order to make improvements.
In fact, I believe that the industry as a whole is giving this shallow data much more attention than it warrants. Certainly, filtering this shallow data as means, medians, and other central tendency metrics obscures more about the incidents than it actually reveals.
- John Allspaw, in Moving Past Shallow Incident Data
MTT*
metrics also distract the engineering leadership from focusing on the systemic risks. Of the examples I listed above, we have the most control over the last one, so let us dive into that using a specific example from Spotify.
- Some time in 2022: a change was introduced to the Domain Name Service (DNS) component, removing a safeguard against invalid configurations and introducing a new failure mode.
- Jan 14, 2023: a GitHub outage triggers that failure mode, causing invalid DNS configurations to be loaded, resulting in a 3.5h outage.
Using the traditional MTT*
metrics, the mean would be skewed, possibly obscuring improvements made in other areas. Suppose, in response to a rash of deployment-related outages in the summer of 2022, Spotify had implemented circuit breakers and automatic rollbacks in December. The engineering team was eager to show off their impact. January would have been a terrible month for that, and the (supposedly sharp) improvement in February can only be attributed to luck: GitHub did not have an outage in February. What conclusions can be drawn from the metrics?
Furthermore, should the measurement of Time to Detect start from the time the first change was introduced in 2022, or from the time GitHub suffered an outage? (Trick question: neither choice yields any insights.) The Time to Recover for this incident is long, but given the uniqueness of this incident, does it tell us anything about the overall quality of Spotify’s engineering and incident response?
Stepping back from those MTT*
metrics: what insights should management draw from this incident to improve the organization?
We want to prevent better: which systems lack clear ownership?
- Do we have good coverage of subject matter experts who can enumerate failure modes that the systems are vulnerable to?
- Are the right leaders in place who can make trade-offs between short-term measures (adding validation) and long-term investments (reducing the complexity of the DNS infrastructure)?
We want to detect & recover better: what tooling do we need to detect and triage DNS issues?
- How can we quickly narrow down the cause to DNS?
Why It Is Tempting
The traditional MTT*
metrics are tempting because engineering management needs some way to know:
- Are we getting better at preventing incidents?
- Are we getting better at detecting incidents?
- Are we getting better at resolving incidents?
It is convenient to apply General Electric-style Key Performance Indicators (KPIs) and use MTT*
to track the competence of the organization over multiple quarters. Unfortunately, I have also seen such KPIs drive perverse incentives down the chain.
Two of my past companies experimented with a centralized frontline response team, commonly known as a Network Operations Center (NOC). Both times, it drove down our Mean Time to Acknowledge (MTTA). However, service quality suffered because the first responders often lacked domain context, could not interpret the errors, and were poor incident commanders. Our Employee Net Promoter Score (eNPS) also fell due to burnout and growing resentment between the NOC and the product teams. This consequence resembles that of organizations that rely on Quality Assurance (QA) teams. I suspect that trends for NOCs and QAs will converge: prevalent in very large companies that need such specialization, but non-existent everywhere else.
Instead of traditional KPIs, I prefer to use Service Level Objectives (SLOs) to describe the quality of the services and give teams the autonomy to prioritize improvements against these SLOs. Individual teams have the right information, context, and intuition to negotiate trade-offs. Assign objectives, and let them choose. This includes letting them create and calibrate their own alarms, and determine appropriate detection and response expectations.
Nonetheless, I am empathetic to the challenges of running large organizations. Perhaps, at ≥100 engineers with ≥20 specialized teams, a coarse MTT*
metric is the only feasible and scalable option.
Going Deeper, And Applying Context
Over the past decade of observing teams resolve incidents, I formed a hypothesis: perhaps aggregate metrics are not useful for incidents because they assume that organizational competence lies on a continuous spectrum. Perhaps this form of competence is discrete.
For detection, for each failure mode, an organization could
- detect and prevent before the deploy
- detect immediately or shortly after the deploy
- unable to detect
Within each discrete level, for each failure mode, a metric might make sense. For example, if database failovers are a common occurrence, I would want to measure the number of seconds needed to detect a database failover and rebuild all connection pools. Driving this down improves our Availability SLO. I conjecture that it would be more insightful than an MTTD
metric that aggregates across all 3 discrete levels.
Similarly, there could be 3 discrete levels of competence for resolution:
- the incident was resolved by automation
- a runbook existed, and we used it
- a runbook did not exist, and we made up a response on the fly
I do not know how widely applicable this is, but it is my current working model. It might also require too much analysis and could be hard to scale.
🌶️ Take: “You Can’t Improve What You Don’t Measure” is Lazy Thinking
It is also a misquote.
We tend to only improve what we measure, and hence we must be very selective about which metrics we choose to highlight.