DORA metrics are four numbers that tell you whether your team ships software fast and safely: deployment frequency, lead time for changes, change failure rate, and failed deployment recovery time. In the 2024 DORA State of DevOps report, elite performers deploy on demand with a lead time under a day and a change failure rate near 5 percent, while low performers can take one to six months to ship a single change.
The Four Keys exist because intuition lies. "We feel fast" is not data. DORA, born from the Accelerate research led by Nicole Forsgren, Jez Humble, and Gene Kim, turned a decade of survey science into a balanced scorecard that pairs speed with stability so you cannot cheat one by sacrificing the other. This guide explains each metric, the latest benchmarks, how to calculate them, and how the Four Keys fit alongside SPACE, the DevEx framework, and DX Core 4. ⚡
TL;DR: DORA's Four Keys are Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Failed Deployment Recovery Time. The first two measure throughput; the last two measure stability. 2024 elite teams (~19% of respondents) deploy on demand with sub-1-day lead time and ~5% failure rate. Read all four together, pair them with developer experience signals, and never tie them to individual pay. Start your delivery dashboard on Taskade.
What are DORA metrics?
DORA metrics are four software-delivery measurements produced by the DevOps Research and Assessment program: deployment frequency, lead time for changes, change failure rate, and failed deployment recovery time. They were validated across more than 39,000 survey responses over the program's lifetime and predict organizational performance, not just engineering output. The acronym DORA stands for DevOps Research and Assessment.
The genius of the Four Keys is the split. Deployment frequency and lead time measure throughput — how quickly value moves from a developer's keyboard to a user's screen. Change failure rate and recovery time measure stability — what happens when something breaks. Optimize throughput alone and you ship chaos; optimize stability alone and you ship nothing. DORA's research showed elite teams refuse the tradeoff: they are fast and safe at the same time. That counterintuitive finding is the entire reason the metrics matter.
DORA sits at the start of a measurement lineage that Nicole Forsgren threads through: Accelerate/DORA (2018) → SPACE (2021) → the DevEx framework (2023) → DX Core 4 (2024). Each generation widened the lens from delivery outcomes to the human experience of the engineers producing them. If you want the broad picture of why measuring the experience matters, read our pillar on what developer experience (DevEx) is; this post goes deep on the delivery numbers.
Lead time spans commit to stable release; deployment frequency counts how often you reach the green node; change failure rate is how often you hit the orange one; recovery time is how fast you escape it. One diagram, all four keys.
The Four Keys, one by one
The Four Keys break into two velocity metrics and two stability metrics, and DORA insists you read them as a set. A single key in isolation invites gaming — you can inflate deployment frequency with trivial commits or flatter change failure rate by shipping nothing. Together, they form a balanced scorecard where improving one without wrecking another is the actual signal of an elite team.
Here is the canonical definition table, redundant on purpose with the diagram above and the calculations below — the same four facts encoded three ways so they stick.
| Metric | Type | Measures | Elite target |
|---|---|---|---|
| Deployment Frequency | Throughput | How often you release to prod | On demand |
| Lead Time for Changes | Throughput | Commit to running in prod | < 1 day |
| Change Failure Rate | Stability | % of deploys that degrade prod | ~ 5% |
| Failed Deploy Recovery | Stability | Time to restore after failure | < 1 hour |
Deployment Frequency
Deployment frequency counts how often a team successfully releases to production over a window. A team shipping 40 successful deploys across 20 working days has a deployment frequency of 2 per day. DORA reports it as a band — on demand, daily-to-weekly, weekly-to-monthly, or monthly-to-every-six-months — because the goal is small, frequent, reversible changes, not a vanity number.
High deployment frequency is a proxy for small batch size. When you deploy many times a day, each change is tiny, easy to review, and easy to roll back. When you deploy once a quarter, every release is a high-stakes event that bundles a hundred risky changes. The metric rewards the discipline of shrinking the unit of work — exactly the discipline that good project-management methodology reinforces.
Lead Time for Changes
Lead time for changes is the elapsed time from a code commit to that code running successfully in production. It is the single clearest measure of delivery friction because it spans the entire pipeline: review queues, CI duration, flaky tests, manual approvals, and deploy gates all add to it. Elite teams keep lead time under a day; low performers can take one to six months.
This is the metric that exposes the gap between "you can write the code in a day" and "you ship it in a month." The code is rarely the bottleneck. The bottleneck is the system the code has to travel through — and that system is what developer experience work optimizes.
Change Failure Rate
Change failure rate is the percentage of deployments that cause a degraded service requiring a hotfix, rollback, or patch. If 100 deploys produce 8 incidents, your CFR is 8 percent. In the 2024 report, elite performers sat near 5 percent and low performers near 40 percent. A good CFR is under 15 percent, but chasing zero usually means you have stopped shipping.
CFR is the honesty check on deployment frequency. Anyone can deploy 50 times a day if half of them break. The two metrics only mean something together — high frequency at low CFR is the signature of an elite team, while high frequency at high CFR is just fast failure.
Failed Deployment Recovery Time
Failed deployment recovery time (formerly Time to Restore Service, sometimes called MTTR) measures how long it takes to recover when a deployment degrades production. Elite teams recover in under an hour; low performers can take a week to a month. It is the metric that captures resilience: not whether you fail, but how gracefully you recover.
Recovery time rewards automation and good runbooks. Teams with fast rollback, feature flags, and clear on-call ownership recover in minutes. Teams that need a war room and a manual database fix recover in days. This is precisely where a tight agentic workflow and rehearsed runbooks earn their keep.
The state machine above shows why the four keys interlock: every trip from Stable through Deploying back to Stable contributes to deployment frequency and lead time, while every detour into Degraded contributes to change failure rate and recovery time.
The fifth signal: reliability
After the original four keys, DORA research added a fifth signal — reliability, sometimes called operational performance — to capture whether a service actually meets the expectations users have of it. Reliability is not a single number like the others; it is the degree to which your service hits its availability, latency, and performance targets, often expressed as adherence to service-level objectives. It was added because a team can ace all four keys and still run a product that users experience as flaky.
Reliability closes a real gap. Deployment frequency, lead time, change failure rate, and recovery time all describe the delivery pipeline. Reliability describes the lived product. A team that deploys on demand with a 4 percent change failure rate but whose service times out under load has excellent delivery and poor reliability. DORA treats reliability as a capability that the four keys enable rather than a fifth peer metric, which is why most dashboards still lead with the original four and add reliability as context. Treat it as the user-facing sanity check on an otherwise internal scorecard.
2024 DORA benchmarks: elite, high, medium, low
The 2024 DORA State of DevOps report buckets teams into four performance clusters — elite, high, medium, and low — with concrete bands for each of the four keys. Elite performers represented about 19 percent of respondents and deploy on demand with sub-day lead time, ~5 percent change failure rate, and under-an-hour recovery. The benchmarks below are the reference every engineering org calibrates against.
| Cluster | Deploy Freq | Lead Time | Change Failure | Recovery |
|---|---|---|---|---|
| Elite | On demand | < 1 day | ~ 5% | < 1 hour |
| High | Daily–weekly | 1 day–1 week | ~ 20% | < 1 day |
| Medium | Weekly–monthly | 1 week–1 month | ~ 10% | < 1 day |
| Low | Monthly–6 mo | 1–6 months | ~ 40% | 1 week–1 month |
Two findings from the 2024 report are worth flagging. First, the high-performance cluster shrank from 31 percent to 22 percent of respondents, while the low cluster grew from 17 percent to 25 percent — a reminder that delivery performance is not a one-way ratchet. Second, for the first time the medium cluster posted a lower change failure rate (~10%) than the high cluster (~20%), breaking the usual pattern where all four keys move together. That anomaly is a caution against reading any single number in isolation.
The chart makes the squeeze visible: a fat middle, a contested top, and a growing tail. Sources for these numbers: the 2024 DORA report and the DX summary of its findings.
How to calculate each metric
Calculating the Four Keys is mechanical once you have two event streams: deployment events and incident events. Deployment frequency and lead time come from your version-control and CI/CD systems; change failure rate and recovery time come from your incident tracker joined back to deploys. You do not need a heavy platform to start — webhooks and a spreadsheet get you a first signal in an afternoon.
Here is the arithmetic, in plain form, with worked examples:
DEPLOYMENT FREQUENCY
= successful_prod_deploys / time_window
e.g. 40 deploys / 20 working days = 2.0 / day -> "on demand" bandLEAD TIME FOR CHANGES (median, not mean — outliers skew the mean)
= median(deploy_timestamp - first_commit_timestamp) per change
e.g. median of [3h, 6h, 9h, 30h] = 7.5h -> "< 1 day" band
CHANGE FAILURE RATE
= deploys_causing_incident / total_deploys * 100
e.g. 8 incident-causing / 100 deploys = 8% -> healthy
FAILED DEPLOY RECOVERY TIME (median)
= median(service_restored_ts - incident_start_ts)
e.g. median of [12m, 25m, 40m] = 25m -> "< 1 hour" band
Two rules keep the numbers honest. Use the median, not the mean, for lead time and recovery — a single 200-hour outlier wrecks an average and tells you nothing about the typical change. And define "failure" once, explicitly: a degradation requiring a hotfix, rollback, or patch counts; a cosmetic bug that ships next sprint does not. Write that definition into your team's shared knowledge so every reviewer scores incidents the same way.
The entity model above is the minimum data you must capture: deployments bundle commits (giving you lead time), deployments may cause incidents (giving you change failure rate), and incidents resolve via a recovery (giving you recovery time). Capture those four tables and every key falls out as a query.
DORA in context: SPACE, DevEx, and DX Core 4
DORA measures delivery outcomes, but it does not explain why a team is slow — for that you need frameworks that measure the human system. The lineage runs DORA → SPACE → DevEx framework → DX Core 4, each adding resolution. Nicole Forsgren co-authored the first three, which is why they compose rather than compete. Understanding where DORA stops and the others begin prevents the classic mistake of treating four delivery numbers as a complete picture of engineering health.
The table below maps the four frameworks to what each actually measures and when it appeared.
| Framework | Year | Dimensions | Focus |
|---|---|---|---|
| DORA / Four Keys | 2018 | 4 delivery metrics | Delivery outcomes |
| SPACE | 2021 | 5 human dimensions | Holistic productivity |
| DevEx framework | 2023 | 3 (Feedback, Cognitive Load, Flow) | Daily experience |
| DX Core 4 | 2024 | 4 (Speed, Effectiveness, Quality, Impact) | Unified scorecard |
SPACE (2021) widened DORA's lens to five dimensions: Satisfaction and well-being, Performance, Activity, Communication and collaboration, and Efficiency and flow. Its core argument is that productivity is multidimensional and you should pick at least one metric from several dimensions, never measuring activity alone.
The DevEx framework (2023), published in ACM Queue, distilled 25 sociotechnical factors into three measurable dimensions: Feedback Loops (how fast tools respond), Cognitive Load (how much you must hold in your head), and Flow State (how often you reach deep focus). It is the most actionable model for improving the daily experience — read our pillar on developer experience for the full breakdown.
DX Core 4 (2024), by Laura Tacho and Abi Noda, unifies all three into one scorecard: Speed (diffs per engineer), Effectiveness (the DXI, a 14-question survey), Quality (change failure rate — DORA lives on here), and Impact (percent of time on new capabilities). Notably, DX — the company behind getdx.com, founded by Abi Noda and Greyson Junggren — was acquired by Atlassian in an agreement announced September 18, 2025, valued near $1 billion, a sign of how seriously the industry now takes delivery measurement.
DORA tells you the result; the DevEx framework tells you the cause; DX Core 4 puts them on one page. That is the whole story of the measurement lineage in a single graph.
DORA metrics in the AI agent era
DORA metrics matter more in the AI agent era, not less, because agents change who writes code but not the path code travels to production. When AI coding agents generate ten times the diffs, the review queue, the CI pipeline, and the deploy gate become the binding constraints — and those are exactly what lead time and change failure rate measure. Volume without stability is just faster risk.
There is a hard truth worth stating plainly: the friction that hurts humans hurts agents too. A slow CI pipeline, a flaky test suite, and an unclear deploy process throttle an agent just as they throttle a person. And the LLM will not get paged at 3am — a human still owns the pull request, the incident, and the recovery. So as agents raise deployment frequency, change failure rate and recovery time become the guardrails that keep the velocity from turning into outages. (This narrative spine echoes a Beyond Coding episode with Bas de Groot, hosted by Patrick Akil; the Gregor Hohpe framing in it is a paraphrase, attributed as such.)
This is where agentic engineering platforms and durable automation workflows come in: they let you put an agent on the measurable parts of the loop — drafting the PR, summarizing the incident, posting the weekly DORA digest — while a human keeps ownership of the risky parts. For a deeper look at how agents fit production systems, see our guide to multi-agent systems and what AI agents are.
There is also a measurement subtlety to watch as agents arrive. Deployment frequency and diffs-per-engineer can spike overnight when agents start generating code, but that spike is not automatically progress — it can be noise. The right move is to anchor on the stability keys during an agent rollout: if change failure rate and recovery time hold steady (or improve) while throughput climbs, the agents are genuinely accelerating delivery. If stability degrades as volume rises, the agents are manufacturing rework, and your lead time will quietly balloon as the review queue clogs with diffs nobody trusts. Read the four keys as a ratio during transitions, not as four separate dials, and let the stability side veto the velocity side. The same caution applies to any tooling change that promises more output — measure whether the output is shippable, not just whether there is more of it.
How to build a DORA dashboard
Building a DORA dashboard takes three inputs and one habit. The inputs are a deployment event source (CI/CD or version control), an incident source (your tracker), and a place to aggregate and visualize the four keys. The habit is a recurring delivery review where a human reads the numbers, decides what to fix, and assigns the work. The dashboard without the review is wallpaper.
A practical, build-it-this-week approach looks like this:
- Wire deployment events. Capture every successful production deploy with a timestamp and the commits it bundles. GitHub, GitLab, or your CI emit these as webhooks.
- Wire incident events. Tag incidents that were caused by a deploy and record start and restored timestamps. This gives you change failure rate and recovery time.
- Aggregate the four keys on a weekly cadence, using medians for lead time and recovery.
- Run a 30-minute weekly delivery review. Read the four keys, pick the one that drifted, and turn the fix into an owned task.
- Capture decisions as memory so next quarter's team knows why you changed the deploy gate.
Where does Taskade fit? Here is the honest guardrail: Taskade is not a DORA metrics platform, not a CI/CD system, and not a code-review tool. Your raw signals come from your delivery and incident tooling. Taskade is the planning, knowledge, and automation layer around engineering work — the system of record for running the delivery practice on top of the data. That is its whole job, and it is a different job than the platforms that compute the metrics.
Concretely, Taskade's Workspace DNA — Memory (Projects), Intelligence (AI Agents), and Execution (Automations) — maps onto the delivery-review loop:
- Memory: Clone a DORA-style metrics board, a sprint board, or an on-call rotation from the Community gallery so the weekly review has a home and a history. Pull project data across 7 views — List, Board, Calendar, Table, Mind Map, Gantt, and Org Chart (Timeline lives inside Gantt) — so the same delivery data reads as a board to engineers and a Gantt to leadership.
- Intelligence: Point an AI Agent — built on EVE, with 33 built-in tools and a choice of 15+ frontier models from OpenAI, Anthropic, Google, and open-weight providers — at your weekly numbers to draft the delivery digest, flag the metric that moved, and suggest the fix.
- Execution: Use Automations and Taskade's 100+ bidirectional integrations — triggers pull GitHub deploy and incident events in, actions push reminders and digests out — to nudge the on-call owner when change failure rate crosses your threshold.
The sequence above is the redundant-encoding payoff: the same loop described in the numbered list, drawn as a message flow. Raw metrics live in your delivery tooling; the running of the practice — the digest, the assignment, the memory, the nudge — lives in the workspace.
Teams that want to start from a template can clone a board with one prompt using Taskade Genesis, which builds live apps from a description, or browse project-management automations and task-automation patterns for the recurring-review machinery. For the broader playbook on connecting agents to tools, see agent memory and connected tools and our guide to AI agent tools.
Common DORA mistakes (and how to avoid them)
The most common DORA mistake is treating the four keys as individual KPIs to maximize rather than a balanced set to read together. Maximizing deployment frequency alone produces trivial deploys; minimizing change failure rate alone produces a team afraid to ship. DORA's own research is explicit: the metrics only mean something as a system, where moving one without harming the others is the signal.
Three failure modes to avoid:
| Mistake | Symptom | Fix |
|---|---|---|
| Single-metric tunnel vision | Gamed deploy counts | Read all four as a scorecard |
| Individual leaderboards | Fear, hoarding, sandbagging | Measure teams, never people |
| Numbers without narrative | Metrics ignored | Add a weekly human review |
Never tie DORA metrics to individual compensation or rank engineers on a leaderboard. Goodhart's law guarantees that the moment a metric becomes a target for an individual, it stops measuring reality and starts measuring fear. Measure teams and systems, pair the quantitative four keys with qualitative DevEx signals like cognitive load, and keep a human in the loop to interpret what the numbers mean — the same discipline good agile and scrum practice already teaches. For teams distributed across time zones, the remote scrum playbook shows how to run that review asynchronously.
Frequently asked questions
What are the four DORA metrics?
The four DORA metrics, called the Four Keys, are Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Failed Deployment Recovery Time (formerly Time to Restore Service). The first two measure throughput or velocity; the last two measure stability. A fifth signal, Reliability or operational performance, was added in later DORA research to capture whether services meet user expectations.
What are the 2024 DORA benchmarks for elite performers?
In the 2024 DORA State of DevOps report, elite performers deploy on demand, have a lead time for changes of less than one day, a change failure rate around 5 percent, and recover from failed deployments in under one hour. Elite teams were roughly 19 percent of respondents. High performers deploy daily to weekly with lead times of one day to one week.
How do you calculate deployment frequency?
Deployment frequency is the count of successful production releases over a time window divided by that window. If a team shipped 40 successful deployments in 20 working days, the deployment frequency is 2 per day. DORA reports it as a band — on demand, daily to weekly, weekly to monthly, or monthly to every six months — rather than a single precise number.
What is lead time for changes?
Lead time for changes is the elapsed time from when code is committed to when that code runs successfully in production. It captures the entire path through code review, continuous integration, testing, and deployment. Elite teams keep lead time under one day; low performers can take one to six months. It is the clearest measure of delivery friction.
What is a good change failure rate?
A good change failure rate is under 15 percent. In the 2024 DORA report, elite performers sat near 5 percent, high performers near 20 percent, and low performers near 40 percent. Change failure rate is the percentage of deployments that cause a degraded service requiring a hotfix, rollback, or patch. Lower is better, but chasing zero usually means shipping too slowly.
What is the difference between DORA, SPACE, and the DevEx framework?
DORA measures software delivery outcomes with four system-level metrics. SPACE, published in 2021, broadens the lens to five human dimensions: Satisfaction, Performance, Activity, Communication, and Efficiency. The DevEx framework from 2023 distills sociotechnical research into three dimensions: Feedback Loops, Cognitive Load, and Flow State. DX Core 4 from 2024 unifies all three into Speed, Effectiveness, Quality, and Impact.
Are DORA metrics still relevant in the AI agent era?
Yes. DORA metrics stay relevant because AI coding agents change who writes code but not the path code travels to production. Friction in review, integration, and deployment still gates delivery. As agents generate more diffs, change failure rate and lead time become more important guardrails, not less, because volume without stability creates risk.
Can DORA metrics be gamed?
Yes, any single metric can be gamed. Teams can inflate deployment frequency with trivial deploys or lower change failure rate by shipping less. The defense is to read the four keys together as a balanced scorecard, pair them with qualitative DevEx signals like cognitive load, and never tie individual compensation to a single number.
What tools do I need to build a DORA dashboard?
You need a source of deployment events, usually your CI/CD system or version control, an incident source for failures and recovery times, and a place to aggregate and visualize the four keys. Many teams use a dedicated platform, while others assemble a lightweight dashboard from version-control webhooks and a shared project workspace that tracks the running cadence around the data.
Is Taskade a DORA metrics platform?
No. Taskade is not a DORA metrics platform, a CI/CD system, or a code-review tool. It is the planning, knowledge, and automation layer around engineering work. Teams use it to clone a metrics dashboard board, run the weekly delivery review, capture decisions as memory, and trigger reminders when a number drifts, while the raw signals come from your delivery and incident tooling.
Start running your delivery practice
DORA's Four Keys remain the most durable way to measure software delivery because they refuse the false choice between speed and stability. Deployment frequency and lead time tell you how fast value flows; change failure rate and recovery time tell you what happens when it breaks. Read them together, benchmark against the 2024 elite/high/medium/low bands, and pair them with the human signals from SPACE, the DevEx framework, and DX Core 4.
The metrics are computed by your delivery tooling — but the practice of running them lives in a workspace. Clone a delivery board from the Community gallery, point an AI Agent at the weekly numbers, wire Automations to your GitHub deploy events, and keep a human in the loop on the review. Build the whole loop with Taskade Genesis from a single prompt, or start free and bring your team along. The fastest teams are not the ones with the most dashboards — they are the ones who actually read them, decide, and ship.
▲ ■ ●





