Operational resilience for CTOs: Meeting FCA and DORA without turning engineering into paperwork
Operational resilience for CTOs: Meeting FCA and DORA without turning engineering into paperwork
Operational resilience for CTOs: Meeting FCA and DORA without turning engineering into paperwork
In the EU, the Digital Operational Resilience Act applies from 17 January 2025. In the UK, the FCA, PRA, and Bank of England set a key operational resilience deadline of 31 March 2025 for firms to stay within impact tolerances during severe disruptions. That puts a date on work plenty of teams have kicked down the road for years. And it turns resilience from “good engineering” into something you have to demonstrate.
My thesis is simple. Operational resilience is a management system for technology risk. FCA and DORA push you to agree what matters, test it under stress, fix the weak spots, and keep evidence that you did.
What is operational resilience (and what FCA and DORA are trying to achieve)?
Operational resilience means you can keep delivering your most important services on a bad day. Not a normal incident. A bad day with multiple failures, missing people, and suppliers that aren’t picking up the phone.
The FCA defines the core unit as an important business service. You set an impact tolerance for each service. That tolerance is the maximum tolerable disruption. The FCA, PRA, and Bank of England describe this model in their policy statements and supervisory statements, including PS21/3 and SS1/21: FCA PS21/3, Bank of England SS1/21.
DORA takes a wider view. It sets rules for ICT risk management, incident reporting, resilience testing, and third party risk. It also creates EU oversight for critical ICT providers. The legal text is Regulation (EU) 2022/2554: EUR-Lex DORA text.
Both regimes are aiming at the same outcomes.
- Protect customers and markets. Outages hurt people, not just dashboards.
- Force clarity on what matters. You can’t defend everything equally.
- Reduce hidden coupling. Big outages usually come from dependencies you didn’t model.
- Make testing real. Tabletop exercises don’t find brittle systems.
- Make accountability real. Regulators want named owners and board oversight.
Here’s a definition I use with boards because it lands quickly.
Operational resilience is the ability to stay within agreed customer harm limits for critical services during severe but plausible disruption, backed by tested controls and audit-ready evidence.
That definition changes the conversation. It’s not “five nines uptime.” It’s “what level of disruption causes unacceptable harm, and can we stay inside that boundary?”
What do these regulations include in practice?
- Service identification. A short list of services that matter most.
- Impact tolerances. Time based limits, plus volume and data integrity limits.
- Mapping. People, process, tech, facilities, and third parties.
- Testing. Scenario tests that match severe but plausible events.
- Remediation. Investment plans tied to mapped weak points.
- Governance and evidence. Clear ownership, reporting, and records.
The framing I give other CTOs is this. FCA and DORA turn resilience into a product with requirements, tests, and release criteria.
FCA operational resilience requirements: important business services, impact tolerances, and mapping
Most CTOs trip on one thing. They start from systems. FCA starts from services.
How to define an “important business service” without boiling the ocean
An important business service is not “payments platform” or “Kubernetes cluster.” It’s a customer facing service that, if disrupted, causes intolerable harm.
Examples I’ve seen work well:
- “Card payments authorization for UK retail customers.”
- “Online mortgage overpayment submission and confirmation.”
- “Same day BACS file submission for SME payroll customers.”
Examples that fail:
- “Core banking.” Too broad.
- “Mobile app.” Too vague.
- “AWS.” That’s a supplier, not a service.
A practical filter is customer harm. Ask questions you can put numbers against:
- How many customers get blocked per hour of outage?
- What’s the financial exposure per hour?
- What’s the legal or safety exposure?
- What’s the reputational blast radius?
If you can’t answer those, you don’t have a service definition. You have a system label.
How to set impact tolerances that engineering can actually use
Impact tolerances have to be measurable. Time is the anchor, but time alone is a weak spec.
I like a three part tolerance per service:
- Maximum outage time. Example: 2 hours.
- Maximum degraded mode time. Example: 24 hours at 60 percent throughput.
- Maximum data integrity breach. Example: 0 lost transactions, 0 duplicated postings.
You can also add a volume metric:
- Maximum queued work. Example: no more than 50,000 pending payment instructions.
The catch is that tolerances need to reflect harm, not comfort. If your current RTO is 8 hours and harm starts at 2 hours, the tolerance is 2 hours. Now you’ve got a gap to close, and that gap needs funding.
Mapping that is useful, not a diagram graveyard
FCA mapping includes people and process, not just tech. Engineers sometimes roll their eyes at that. Then they get hit by an incident where the runbook is wrong, the escalation path is unclear, and the one person who knows the system is on a plane.
A mapping output that works in audits has two layers:
- Service map. The end to end flow, from customer request to final confirmation.
- Dependency register. The components that can break the service.
Your dependency register should include:
- Applications and services. Names, owners, tier.
- Data stores. RPO, replication mode, restore steps.
- Queues and event buses. Backlog limits, replay plan.
- Identity and access. SSO, MFA, break glass.
- Networks and DNS. External dependencies, failover.
- People and runbooks. On call coverage, escalation paths.
- Third parties. Cloud, SaaS, payment processors, telecoms.
If you run microservices, automated discovery speeds this up. But you still need a human-reviewed service boundary. Our own Microservices Dependency Mapper can help you build a living dependency view: /tools/microservices-dependency-mapper.
A real world scenario:
A bank sets a 2 hour tolerance for “card authorization.” The map shows a single point of failure in a fraud scoring API that calls a third party model endpoint. The endpoint has no regional failover. The bank didn’t know that. The map makes it obvious, and now the remediation plan has a clear target.
DORA compliance for CTOs: ICT risk management, incident reporting, testing, and third party oversight
DORA is broader than FCA operational resilience. It covers resilience, but it also sets baseline ICT controls and reporting.
ICT risk management: what auditors will ask you to show
DORA requires an ICT risk management framework. In practice, auditors want to see that you run a loop:
- You identify ICT risks.
- You set controls.
- You monitor.
- You test.
- You fix gaps.
This overlaps with ISO 27001 and SOC 2, but DORA is stricter on resilience testing and third party risk. The European Banking Authority and other ESAs publish technical standards and guidance that shape what “good” looks like: EBA DORA page.
If you already run SOC 2, you can reuse a lot. Our SOC 2 Readiness tool can help you map controls and evidence: /tools/soc2-readiness.
Incident reporting: build the pipeline before you need it
DORA sets incident reporting requirements for major ICT related incidents. The exact thresholds and timelines depend on the final RTS and your regulator, but the operational need doesn’t change much.
You need:
- A clear incident classification scheme.
- A way to collect required fields fast.
- A workflow that produces regulator ready reports.
If your incident process lives in Slack and memory, it falls apart under pressure.
A practical build:
- Create an incident intake form with required fields.
- Auto pull timestamps from your paging tool.
- Auto pull affected services from your service catalog.
- Store reports in a controlled repository.
Our Incident Response Planner can help you standardize roles and runbooks: /tools/incident-response. And our Incident Postmortem tool helps you keep action items and evidence in one place: /tools/incident-postmortem.
Digital operational resilience testing: move past tabletop exercises
DORA expects testing that matches your risk profile. For many financial entities, that includes TLPT, threat led penetration testing, aligned with the TIBER-EU model in many jurisdictions.
Even if TLPT doesn’t apply to you, the direction is clear. You need to test failure modes that actually happen.
A testing ladder I’ve seen work:
- Component tests. Restore a database from backup. Prove RPO and RTO.
- Service failover tests. Region failover for a tier 1 service.
- Scenario tests. Ransomware plus loss of a key SaaS.
- Red team exercises. Privilege escalation and lateral movement.
Google’s SRE book has a solid view on why you need error budgets and controlled risk in production: Google SRE book.
Third party risk: DORA makes vendor management a CTO problem
DORA puts third party ICT risk front and center. It also creates oversight for critical ICT providers.
CTOs feel this in three places:
- Contract clauses and audit rights.
- Concentration risk, like one cloud region or one identity provider.
- Exit plans that work in real time, not in a slide deck.
If you want one place to start, run a structured vendor review for your top 20 ICT suppliers by service criticality. Our Vendor Risk Assessment tool can help you standardize the questions and evidence: /tools/vendor-risk-assessment.
How to meet operational resilience requirements: a CTO playbook with metrics, tests, and evidence
Most CTOs I talk to struggle with the same tension. Regulators want rigor, and teams fear bureaucracy. You can satisfy both if you treat resilience like product delivery.
Here’s the framework I use.
The STAMP framework for operational resilience
STAMP stands for Services, Tolerances, Architecture, Monitoring, Proof.
The name nods to Nancy Leveson's STAMP (Systems-Theoretic Accident Model and Processes) work at MIT, which treats safety as a control problem rather than a component failure problem. Leveson's insight that accidents emerge from interactions between components, not just broken parts, maps directly to operational resilience. The STPA Handbook is worth reading if you want the full systems-theoretic foundation. I've adapted the thinking into five pillars that fit how CTOs actually run compliance programmes.
- Services. Define 10 to 30 important services, not 300 systems. The FCA's concept of important business services gives you the scoping model.
- Tolerances. Set time, throughput, and integrity limits per service. The Bank of England's impact tolerance guidance (SS1/21) defines what "tolerable" means in practice.
- Architecture. Remove single points of failure that break tolerances. The NIST Cybersecurity Framework 2.0 covers the governance and protection functions that underpin resilience architecture.
- Monitoring. Measure service health against tolerances, not host metrics. Google's SRE book on service level objectives is the best operational reference for measuring what matters.
- Proof. Keep evidence from tests, incidents, and fixes. DORA's regulatory technical standards set the bar for what evidence auditors expect.
You can run STAMP as a quarterly cycle, with monthly checkpoints. Use our STAMP Operational Resilience Assessment to score your maturity across all five pillars and identify gaps: /tools/stamp-framework.
Immediate actions CTOs can take in the next 30 days
- Pick the top services. Create a list of 10 candidate services. Tie each to a customer group and a product P and L.
- Write draft tolerances. Put numbers on time, volume, and integrity. Use your last 12 months of incidents to ground it.
- Build a thin service map. One page per service. Include owners and top dependencies.
- Run one severe scenario test. Example: primary region loss plus IAM outage. Time the response.
- Create an evidence folder. Store maps, test plans, results, and remediation tickets with immutable timestamps.
If you want a place to track all of this, use Command Center as your system of record for risks, incidents, SLOs, and capacity: /command-center.
Policy framework: what to standardize so teams move faster
- Service ownership. Each important service has a named business owner and a named tech owner.
- Resilience acceptance criteria. No tier 1 release ships without rollback steps, tested restores, and SLO dashboards.
- Change risk rules. High risk changes require a pre change review and a backout plan.
- Third party onboarding. New ICT vendors need security controls, resilience claims, and exit steps.
This is where leadership shows up. You need to make these rules boring. Teams follow boring rules.
Architecture principles: design choices that map to FCA and DORA expectations
- Failure isolation. Use bulkheads. One noisy tenant can’t take down the whole service.
- Deterministic recovery. Practice restores until they’re scripted. Manual restores fail at 3 a.m.
- Graceful degradation. Keep core flows working with reduced features. Example: allow balance checks even if insights fail.
- Dependency tiering. Tier your dependencies. A tier 0 dependency must have multi region and tested failover.
A simple decision matrix helps teams choose where to invest.
| Decision | Good for | Bad for | Evidence you need |
|---|---|---|---|
| Active active across regions | Tier 0 services with low tolerance | Complex data consistency | Failover test logs, RPO proof |
| Active passive with warm standby | Tier 1 services with 1 to 4 hour tolerance | Longer recovery | DR run results, restore scripts |
| Single region with backups | Tier 2 services | Tier 0 and tier 1 | Backup restore tests, RTO metrics |
| SaaS for critical function | Fast delivery | Vendor outage risk | Vendor SLA, exit plan, concentration analysis |
For build vs buy calls that affect resilience, use our Build vs Buy Matrix to make trade offs explicit: /tools/build-vs-buy-matrix.
Proving you met the requirements: the evidence pack auditors respect
Audits fail on evidence, not intent. You need a repeatable evidence pack per service.
I recommend a “service resilience dossier” with these artifacts:
- Service definition. Customer group, channels, volumes, harm statement.
- Impact tolerances. Numbers, rationale, approval date, owner.
- Service map. Dependencies, owners, third parties.
- Control mapping. Which controls protect which failure modes.
- Test plan and results. Dates, scope, outcomes, gaps.
- Remediation plan. Tickets, budgets, deadlines, owners.
- Incident history. Major incidents, time to detect, time to recover, lessons.
Metrics that show maturity:
- MTTD and MTTR per important service.
- Restore success rate. Example: 12 of 12 monthly restore tests passed.
- RPO achieved. Example: 0 data loss in quarterly failover tests.
- Change failure rate for tier 0 and tier 1 services.
- On call load. Pages per engineer per week. Track burnout risk.
For root cause work, I like graph based methods that connect symptoms to shared dependencies. Our Split Cause (RCA) tool is built for that style of investigation: /splitcause.
And for architecture evidence, keep diagrams current. Use ArchiMate Modeler to maintain service maps that non engineers can read: /tools/archimate.
Enterprise implications: why operational resilience changes budgets, org design, and vendor strategy
-
It changes what “done” means for tier 0 services. A feature isn’t done until you can recover it under stress. That affects roadmaps and release gates.
-
It forces a service catalog and ownership model. Shadow services and unclear ownership fail audits. They also fail customers. Use this work to clean up your org chart and your on call map.
-
It raises the bar for third party contracts. You need audit rights, incident notification terms, and exit steps. Procurement can’t do this alone. Engineering has to validate the claims.
-
It makes resilience a board reporting topic. You’ll report tolerances, test results, and remediation progress. That pushes you to translate engineering metrics into customer harm metrics.
A common scenario:
A fintech runs on one cloud region and one managed database. They have good uptime, but no tested restore. Under FCA style scrutiny, they can’t show they can stay within a 2 hour tolerance for “payments initiation.” The fix isn’t a new tool. The fix is a funded plan: multi AZ, tested restores, and a quarterly failover drill with evidence.
Bigger picture: resilience is now shaped by regulation, supply chains, and geopolitics
DORA and FCA rules land in a world where outages come from more than bugs. Cloud concentration, SaaS dependencies, and cyber attacks drive real customer harm. The NIST Cybersecurity Framework 2.0 points in the same direction. You need clear governance, tested response, and measured recovery.
I also see a talent angle. Teams that practice resilience ship faster over time. They spend less time in surprise outages and more time in planned work. But you have to protect the humans. Sustainable on call is part of resilience. Our guide to designing fair on call rotations fits directly into this work: /tools/oncall-rotation-optimizer.
The question is whether your firm treats operational resilience as a compliance project, or as a way to run technology like it matters. Which one will your engineers believe after the next major incident?
Sources:
- https://www.fca.org.uk/publication/policy/ps21-3.pdf
- https://www.bankofengland.co.uk/prudential-regulation/publication/2021/march/operational-resilience-impact-tolerances-for-important-business-services
- https://eur-lex.europa.eu/eli/reg/2022/2554/oj
- https://www.eba.europa.eu/regulation-and-policy/digital-operational-resilience-dora
- https://sre.google/sre-book/table-of-contents/
- https://www.nist.gov/cyberframework
- https://direct.mit.edu/books/oa-monograph/2908/Engineering-a-Safer-WorldSystems-Thinking-Applied
- http://www.flighttestsafety.org/images/STPA_Handbook.pdf