Why tech leaders should track service level objectives (SLOs) in load testing campaigns

Diego Salinas

Enterprise Content Manager

Table of contents

Heading 2

Last updated on

Friday

June

2026

Why tech leaders should track service level objectives (SLOs) in load testing campaigns

When Canal+ needed to guarantee its streaming platform could handle millions of concurrent viewers during a major live football broadcast, the team didn't simply run a load test and hope for the best.

They ran progressive, iterative load campaigns against explicit performance targets, identified and resolved bottlenecks in caching and licensing APIs, and optimised machine sizing before a single viewer tuned in. The result: zero incidents during the broadcast. Not "fewer incidents than last time." Zero.

That outcome didn't come from running harder tests. It came from running smarter ones — anchored to Service Level Objectives that defined, in user-relevant terms, exactly what "good enough" meant before go-live.

For tech leaders, this is the core argument: load testing without SLOs is activity. Load testing with SLOs is governance.

The framework: SLIs, SLOs, SLAs, and error budgets

Before getting into practice, the terminology needs to be precise — because sloppy definitions lead to sloppy governance.

Google's SRE literature provides the clearest foundation:

SLI (Service Level Indicator): A quantitative measure of service behaviour — request latency, error rate, throughput, availability.
SLO (Service Level Objective): The target or acceptable range for that SLI. For example: "99.9% of checkout requests complete within 300 ms over a 30-day window."
SLA (Service Level Agreement): The external commitment to customers, usually with financial penalties attached.
Error budget: The allowable unreliability implied by the SLO. At 99.9%, that's roughly 43 minutes of downtime per month. At 99.99%, it drops to about 4 minutes.
Burn rate: How quickly that budget is being consumed, the key signal for operational urgency.

One leadership principle follows immediately from this structure: your internal SLO should be stricter than your public SLA. Google Cloud's own guidance illustrates this with a 99.95% internal SLO paired with a 99.9% SLA. That gap is a deliberate safety buffer — and running load tests against the internal SLO means you surface contractual risk while there's still time to fix it.

The second principle is equally important: SLOs must be user-centred, not infrastructure-centred. A load test that only reports CPU utilisation and median response time is measuring what's convenient, not what customers experience. The right SLI is the one that, if barely met, still keeps the typical user satisfied.

How SLOs change the design of load tests

Most load testing today still asks the wrong question: "What was the maximum RPS we achieved in the lab?" SLO-driven load testing asks a more useful set of questions:

At what request rate do we stop meeting the user-relevant objective?
How quickly are we burning error budget when we miss it?
What component saturates first and how does the system behave when it does?

That reframing has four concrete effects on how campaigns are designed.

Pass/fail becomes explicit: A load test without SLOs may report that p95 latency was 280 ms and CPU reached 78%, but it doesn't answer whether the system is ready to release. Tools like k6, Gatling, and Azure Load Testing all support encoding user-relevant thresholds directly in test execution, producing a true pass/fail signal rather than a dashboard someone must interpret later.

2. Load shapes become more realistic. Google Cloud explicitly recommends open-loop load patterns for this reason: production clients don't self-throttle the way closed-loop generators do. Open-loop tests send requests at a steady rate regardless of response times, which better mimics real traffic. A test that passes under artificially polite load can still fail catastrophically when production traffic arrives without courtesy.

3. Overload behaviour becomes a first-class objective. SLO-driven testing doesn't just ask "what's our capacity?" It asks "what happens when we exceed it?" Does the system shed load cleanly? Does it recover without cascading failures? These are the questions that matter on launch days and during demand spikes — and they're the questions that "peak RPS in the lab" benchmarks never answer.

4. Short tests connect to long-horizon budgets. A production SLO is measured over days or weeks; a load test runs for minutes or hours. The bridge is burn rate: you don't need to recreate an entire month to show that current error rates would exhaust your monthly budget unacceptably fast. That calculation turns a single test run into a release signal.

The technical upside: five benefits engineers should know

Realistic target-setting

‍SLOs prevent teams from optimising for the wrong number. Lab-only peak throughput figures are internally satisfying but commercially irrelevant. The SLO focuses attention on the tail latency and success rate of the journeys customers actually take.‍

Better prioritization

‍Google's error-budget policy explicitly uses budget consumption to redirect effort from features to reliability. When a load test shows your checkout service is burning budget at 3× the sustainable rate, that's a data-driven argument for investing in caching or query optimisation, not a matter of opinion.

Stronger root-cause analysis

‍When a latency SLO fails during a test, the investigation has a starting point: which resource, dependency, or code path saturated first? Correlating load test output with traces, logs, and server-side metrics compresses the time between "something's wrong" and "here's why."

Protection from average-only blindness

‍Google's "Tail at Scale" research shows why large systems are dominated by latency tails as scale and utilisation increase. The Home Depot's SLO programme explicitly chose percentile latency over arithmetic averages for exactly this reason. If your release gates use averages while your users feel the p99, you're under-measuring risk.

Automation and repeatability

‍SLOs, code-based assertions in Gatling make performance testing suitable for CI/CD in the same way unit tests are. For instance, LoginRadius moved away from a JMeter-based approach that wasn't integrated into its pipeline, and reported latency dropping from 500 ms to 250 ms alongside an 80%+ reduction in production issues.

The business case: five benefits leaders should own

Customer experience protection

SLOs formalise what "acceptable" means in terms customers feel, not in terms that are easy to instrument. Every load test run against an SLO is a forward-looking commitment to that experience under pressure.

SLA risk reduction

‍If a service can't pass its internal SLO under expected peak conditions, the risk of breaching its public SLA in production is already real — with 54% of significant outages costing over $100,000. Load testing against the internal SLO functions as an early-warning system for commercial exposure — before it becomes a legal conversation.

Infrastructure right-sizing

‍Canal+'s gains included improved machine sizing, .not over-provisioning "just in case," but provisioning to the SLO boundary. Google's tail-latency research notes that tail-tolerant techniques can allow higher utilisation without lengthening the tail, meaning SLO-driven testing often surfaces headroom that naive capacity planning leaves on the table.

Release confidence with teeth

‍Houghton Mifflin Harcourt now runs all 50 of its load simulations together before release, including campaigns at four to five times normal traffic before peak periods. They report fewer performance issues in production as a direct result. That's what release confidence looks like when it's backed by data rather than optimism.

Velocity preservation, not velocity reduction

‍This is the counterintuitive point that matters most for CTO-level conversations. Google's error-budget guidance is explicit: exhausting budget may temporarily slow release cadence, but the purpose is to restore safe release speed, not to punish teams. DORA's research consistently shows that speed and stability are not structural trade-offs for most organisations. SLO-driven load testing is not anti-delivery; it's what makes delivery sustainable at scale.

Scaling it: the organizational dimension

The most important lesson from The Home Depot's SLO program isn't technical. Before adopting a common SLO framework — covering volume, availability, latency, errors, and tickets — their monitoring was fragmented, root causes were hard to pinpoint, and teams wasted "countless hours" working backwards from user-facing symptoms.

After implementing the framework with training, automation, and executive reporting, they scaled from approximately 50 services reporting SLOs to 800 within a year. Around 50 new services were being onboarded per month. They also integrated SLOs into destructive testing, automatically recording the effect of chaos experiments on service metrics.

That's not a tooling story. It's an operating-model story. SLOs gave engineering, SRE, product, and leadership a shared language — and that language made reliability visible, discussable, and governable at scale.

Also Evernote's experience reinforces the cross-team effect. Working with Google's CRE team, they adopted an error-budget approach and within nine months were already on version 3 of their SLO practice. Monthly SLO reviews replaced ad hoc outage conversations, and both Evernote and Google had a common, data-driven way to discuss service quality. SLOs improved supplier management and internal prioritisation simultaneously.

Where to start: a practical roadmap

The highest-confidence starting point is narrow scope and high relevance: pick two or three critical user journeys, define SLIs for them, set internal SLOs that are stricter than your SLAs, and encode them as test thresholds.

Then connect those thresholds to runtime telemetry and attach burn-rate alerts and release-gate policies.

A five-phase performance testing maturity model emerges consistently from the literature:

Define: Identify critical user journeys and existing telemetry. Draft SLIs, internal SLOs, and SLA buffer policy.
Instrument: Add percentile histograms, error counters, and saturation metrics to your services.
Automate: Encode SLO thresholds in load tests and CI/CD pipelines. Connect traces, logs, and server-side metrics.
Operate: Run regular SLO reviews. Add fast-burn and slow-burn alerts. Use SLOs for canary releases and peak-readiness drills.
Expand: Roll out to more services and teams. Build executive dashboards alongside service-owner dashboards.

The most common pitfalls are worth naming explicitly: setting 100% SLO targets (which eliminates the error budget entirely), using averages as pass criteria (which hides tail failures), copying another company's thresholds (which produces governance that doesn't fit your architecture or user expectations), and treating SLOs as dashboards without consequences (which fails to change engineering prioritisation).

The strategic call to action

The diagnostic question for any CTO is simple: if your load testing program isn't tied to SLO attainment, error-budget consumption, and release decisions, what decisions is it actually driving?

Canal+ answered that question before a major broadcast and served millions of viewers without a single incident. The Home Depot answered it and scaled reliable service delivery across 800 systems. LoginRadius answered it and halved its production latency.

The technology to do this is mature, well-documented, and largely open-source. The organizational will to tie test outcomes to release decisions and infrastructure investment is the harder part since four in five serious outages are attributed to preventable process failures, not missing technology.

But that's exactly what separates performance engineering that generates activity from performance engineering that generates governance value.

SLOs don't make load testing more complicated. They make it more useful.

FAQ

What is SLO testing?

SLO testing is load testing anchored to Service Level Objectives—quantitative targets that define acceptable service behavior in user-relevant terms, such as "99.9% of checkout requests complete within 300 ms over a 30-day window.