Shift-right testing with load testing metrics | Gatling Blog

Written by Diego Salinas | Jul 8, 2025 9:05:59 AM

Shift-right testing with load testing metrics

40% of production issues escape traditional testing.

Picture Formula 1 teams. Wind tunnel tests. Practice track sessions. Perfect preparation.

Race day arrives. Twenty cars. 200 mph speeds. Reality delivers surprises no controlled testing predicted.

The same can be said about software; just replace your cars with users and the 200 mph speeds with <2s loading speeds.

Applications perform flawlessly in staging, but production tells different stories. The software industry reached an inflection point.

Traditional testing approaches are no longer sufficient for complex, distributed systems.

What is shift right testing?

Shift right testing acknowledges fundamental truth: production is the ultimate testing environment.

All other testing environments? Approximations with gaps. For instance, GitLab discovered their observability-based testing framework identified performance trends in production invisible in staging environments. This realization sparked their shift right approach.

Instead of perfecting approximations, shift right testing embraces production as the primary testing environment.

Production validation takes multiple forms:

Canary releases: Expose new code to small user percentages
Feature flags: Test functionality with specific user segments
AB testing: Validate hypotheses about user behavior
Chaos engineering: Test system resilience under stress

Each technique provides insights impossible in controlled environments.Enterprise companies should test actual transaction patterns. Real user behavior. Genuine system loads.

The power? Real data from real users.

Shift left vs. shift right

Software testing divides into two complementary philosophies.

Shift left testing

Philosophy: Test early and often during development
Goal: Prevent defects reaching production
Environment: Development, staging, test environments
Data: Synthetic test data, mocked dependencies

Shift left testing pushes quality assurance earlier in the development cycle. Teams write unit tests alongside code. Run integration tests in CI pipelines. Catch bugs before production. It's like checking your parachute before jumping.

Shift right testing

Philosophy: Test in production with real conditions
Goal: Validate behavior under real-world conditions
Environment: Production environment with real infrastructure
Data: Real user data, actual system state

Shift right testing takes the opposite approach. Instead of anticipating every scenario in controlled environments, teams test in production. Real users. Real data. Real infrastructure. It's monitoring your descent. Being ready to deploy reserve chutes if something goes awry.

	Shift left testing	Shift right testing
Testing phase	During development	After deployment
Feedback speed	Minutes to hours	Hours to days
Issue detection	60-70% of problems	Remaining 30-40%
Best for	Known scenarios	Unknown scenarios
Key metrics	Code coverage, test pass rate	MTTR, error budgets, real user metrics

Why shift right now?

The software industry built testing foundations on manufacturing principles.

Manufacturers test goods before factory departure. Software teams created elaborate quality gates before production releases. This approach made sense for quarterly releases and monolithic applications.

Traditional testing follows predictable patterns:

Developers write code
QA teams test in isolated environments
Operations teams deploy to production

Each phase has clear boundaries. Clear responsibilities. This waterfall approach provided structure but came with significant limitations.

The fundamental assumption? We can predict and simulate production conditions.

We create staging environments mirroring production. Generate synthetic load mimicking user behavior. Run test suites covering known scenarios.

Yet production environments surprise us.

Unexpected combinations of user behavior. Data patterns. System interactions.

That’s why following the waterfall method doesn’t cut it anymore and software teams transitioned to continuous delivery testing.

Amazon deploys code every 11.7 seconds. Netflix pushes thousands of daily changes. These companies achieved velocity by testing differently.

Continuous delivery demanded new quality assurance approaches. Teams automated everything possible. Adopted feature flags and canary deployments. Reduced blast radius of problems.

Most importantly: they recognized some quality aspects only validate in production.

The role of observability in shift right testing

Observability emerged answering critical questions: How do we understand complex, distributed systems?

Unlike traditional monitoring that tracks known metrics and predetermined thresholds, observability provides the ability to ask arbitrary questions about system behavior.

Three pillars work like detective toolkits:

Metrics: Quantitative view showing trends over time
Logs: Detailed events explaining anomalies
Traces: Connections showing request flows through distributed systems

Together, they enable understanding not just what happened, but why.

Also, keep in mind that modern DevOps practices depend on observability for feedback loops.

Developers deploy code. They need immediate visibility into behavior. Systems experience problems. Teams need data for quick diagnosis and fixes.

Research from Technische Universität Berlin demonstrates organizations with comprehensive observability achieve 73% improvement in software quality through reduced downtime and real-time issue detection.

Trade-offs, risks, and benefits

Shift left and shift right testing present false dichotomy. Successful organizations employ both strategies. Understanding complementary strengths.

Shift left testing: Fast feedback. Catches obvious issues.
Shift right testing: Validates assumptions. Reveals emergent behaviors.

Trade-offs

Real-world validation vs. controlled safety: Gain authentic user behavior insights. Lose predictability of test environments. Production reveals true system behavior at cost of potential user impact.
Infrastructure investment: Reduce expensive staging environments. Require comprehensive observability infrastructure. Research shows observability can reduce system overhead by 37.5%.
Feedback timing: Slower results than pre-production testing. Hours/days vs. minutes. More accurate and actionable insights from real usage patterns.
Operational complexity: Simplified environment management. Increased need for safety mechanisms like feature flags, canary deployments, automated rollbacks.

Risks

Direct user impact: Production issues affect real customers immediately. Financial services report potential millions in lost revenue from brief outages.
Data integrity threats: Risk corrupting production data. Accidentally processing real transactions. Requires careful test isolation and synthetic account strategies.
Compliance violations: Testing with real customer data subjects organizations to GDPR, HIPAA regulations. Must implement strict controls and audit trails.
Cascading failures: Production tests trigger unexpected system-wide issues. One service failure during testing might cascade across distributed systems.
Security exposure: Production testing might reveal vulnerabilities to bad actors monitoring system behavior. Requires careful scoping of test scenarios.

Benefits

Accurate performance validation: Discover real load patterns invisible in synthetic tests.
Early problem detection: Identify issues while affecting small user segments. Rather than waiting for system-wide failures.
Business intelligence: AB testing and canary deployments reveal actual user preferences and user behavior.
Cost optimization: Right-size infrastructure based on real usage rather than theoretical peaks. Amazon saves millions avoiding overprovisioning through production load testing insights.
Improved resilience: Regular production testing builds anti-fragile systems. Teams become confident handling failures. Recovering quickly from security incidents.
Faster innovation: Validate features with real users immediately. Rather than guessing what might work. Reduces time from idea to validated learning.

Shift right best practices

The difference between successful shift right testing and production disasters? Following established patterns.

These practices emerged from hard-won lessons at companies learning what works when testing in production.

Incremental adoption and risk mitigation

Successful shift right testing begins with baby steps.

Organizations attempting overnight transformation often fail spectacularly. Instead, start with low-risk services. Gradually expand. Choose services with good observability. Clear ownership. Forgiving SLAs for initial experiments.

Risk mitigation strategies evolve with maturity:

Early adopters run load tests during off-peak hours
As confidence grows, teams test during business hours with careful monitoring
Eventually, continuous load testing becomes normal operations

This progression builds technical capabilities and organizational confidence.

Documentation and runbooks prove essential.

Every shift right testing activity needs clear procedures. Initiation. Monitoring. Rollback. Teams document what to watch. When to worry. How to respond.

Balancing performance overhead vs. diagnostic value

Observability comes with costs.

Research from Umeå University found comprehensive instrumentation can add up to 71% CPU overhead with naive implementations. This overhead affects application performance and infrastructure costs.

Teams must balance diagnostic value against collection cost.

Sampling strategies provide solutions:

Head-based sampling reduces overhead but might miss interesting outliers
Tail-based sampling captures all errors but requires sophisticated infrastructure
Mixed approaches balance both needs

The key lies in dynamic adjustment.

During normal operations, minimal instrumentation suffices. When problems appear, teams temporarily increase collection rates for deeper investigation.

Automated analysis and AI-driven anomaly detection

The volume of data from shift right testing overwhelms human analysis.

Modern platforms employ machine learning for anomaly detection. Pattern recognition. Root cause analysis. These systems learn normal behavior. Alert on deviations.

For instance, Datadog's AI models demonstrate potential:

Automatically identify performance regressions
Correlate metrics across services
Suggest likely root causes

What once required hours of manual investigation now happens in seconds. AI models improve over time. Learning from false positives and confirmed issues.

However, AI isn't magic.

Effective anomaly detection requires careful feature engineering. Appropriate algorithms. Continuous tuning. Teams must understand tool limitations. Maintain human oversight.

The goal: Augment human judgment with machine insights.

Challenges and opportunities in shift right testing

Every organization attempting shift-right testing hits the same walls. The technical hurdles seem daunting, the cultural resistance feels insurmountable, and the risks appear too high.

Yet within each challenge lies an opportunity to build better systems and stronger teams. Understanding these obstacles and their hidden potential separates organizations that abandon shift-right testing from those that transform their entire approach to quality.

Technical challenges: Instrumentation overhead and data correlation

Instrumentation overhead remains persistent challenge.

Every metric collected, log written, trace recorded consumes resources. In high-throughput systems, this overhead affects the very performance we're measuring. Observer effect in software systems is real and significant.

Solutions require sophisticated approaches:

Implement adaptive sampling increasing during interesting events, decreasing during steady state
Use eBPF for low-overhead kernel-level instrumentation
Optimize hot paths to minimize instrumentation impact
Measure overhead itself, ensuring observability doesn't compromise performance

Data correlation across distributed systems presents another challenge.

When users report slowness, teams must correlate their experience with metrics from dozens of services. Traditional approaches using timestamps break down due to clock skew and network delays.

Modern solutions use:

Correlation IDs
Distributed tracing
Sophisticated time synchronization to connect related events

Organizational challenges: Cultural shifts and skill requirements

Shift to production testing challenges traditional organizational boundaries.

Developers who never worried about production must understand operational concerns. Operations teams comfortable with stability must embrace continuous change. This cultural shift often proves harder than technical implementation.

Successful transformations invest in education and gradual transition:

Start with ride-alongs where developers join incident response
Implement blameless postmortems treating failures as learning opportunities
Celebrate both successful experiments and quick recoveries from failures

Over time, culture shifts from risk avoidance to risk management.

Engineers need understanding of distributed systems. Statistical analysis. Observability tools. They must read dashboards. Interpret metrics. Make decisions under uncertainty. Organizations investing in training see faster adoption and better outcomes.

Regulatory and compliance considerations

Regulated industries face unique challenges with shift right testing.

Financial services must protect customer data. Healthcare organizations must maintain HIPAA compliance. These requirements don't prohibit production testing but demand careful implementation.

Successful approaches use:

Synthetic data mirroring production patterns without containing sensitive information
Strict access controls and audit trails for production testing activities
Close collaboration with compliance teams establishing approved procedures and boundaries
Isolated production-like environments satisfying regulatory requirements while enabling realistic testing

Rather than viewing regulations as obstacles, mature organizations integrate compliance into shift right testing strategies. Document procedures. Implement controls. Demonstrate that production testing improves overall system reliability and security.

Data management and storage strategies for shift right testing

Shift right testing generates enormous data volumes.

Medium-sized applications produce gigabytes of observability data daily. Storage costs mount quickly. Query performance degrades without proper management.

Tiered storage strategies address challenges:

Hot data from recent tests stays in fast, expensive storage for immediate analysis
Warm data moves to cheaper storage but remains queryable
Cold data archives for compliance and historical analysis

This tiering balances cost with accessibility.

Data reduction techniques prove essential:

Aggregate metrics over time
Keep high resolution for recent data
Lower resolution for historical trends
Compress logs and traces
Deduplicate common patterns
Expire data according to retention policies

These techniques reduce storage requirements by 90% without losing critical insights.

Key load testing metrics for shift right monitoring

Load testing metrics in production environments require careful selection and interpretation.

Response time and latency tell us about user experience: A Google study found 500ms delay in search results caused 20% traffic drop. In production, these metrics reveal how real-world conditions affect performance.
Throughput and concurrency metrics show system capacity and efficiency: Netflix monitors these metrics obsessively. Tracking how many streams their systems handle during peak viewing hours. They discovered through production monitoring Sunday evenings generate 40% more load than synthetic tests predicted.
Error rates and SLA compliance provide quality scorecards: Financial services companies track these metrics religiously. Even small error rate increases indicate serious problems.
Resource utilization metrics reveal performance costs. CPU, memory, I/O measurements show whether systems run efficiently or waste resources. E-commerce platforms use these metrics optimizing infrastructure costs. Amazon's shift right testing identified memory leaks appearing only after processing millions of orders. Saving substantial infrastructure costs.

Integrating load testing metrics into observability dashboards

Integration of load testing metrics into observability dashboards transforms raw data into actionable insights. Successful implementations follow layered approaches. High-level dashboards show overall system health. Detailed views enable deep investigation of specific issues.

Grafana and Datadog became the de facto standard for visualization, offering powerful capabilities for creating dynamic dashboards. Teams build dashboards correlating load testing metrics with business outcomes.

E-commerce companies display conversion rates alongside response times. Making business impact of performance issues immediately visible.

But the key to effective dashboards lies in context and correlation.

Isolated metrics tell incomplete stories. When response times increase, dashboards should show related metrics. CPU usage. Database query times. Error rates. This correlation enables rapid root cause analysis and informed decision-making.

Observability tools and load testing metrics

Understanding observability requires thinking beyond traditional monitoring.

Imagine investigating crime scenes:

Metrics: Photographs showing scenes from different angles
Logs: Witness statements describing what happened
Traces: Detective timelines connecting all evidence

You need all three to solve cases.

Metrics provide continuous system pulse: Answer questions like "How fast are we responding to requests?" and "How much memory are we using?" Time-series databases like Prometheus excel at storing metrics efficiently. Enable teams to identify trends and anomalies across millions of data points.
Logs tell stories of individual events: When users encounter errors, logs capture context. What they were trying to do. What went wrong. System state at that moment. Modern logging systems go beyond simple text files. Structure data for easy querying and correlation.
Traces reveal request journeys through distributed systems: In microservices architecture, single user actions might touch dozens of services. Tracing shows how long each service takes. Where bottlenecks occur. How failures cascade.
OpenTelemetry emerged as an industry standard: It implements distributed tracing, providing consistent instrumentation across languages and frameworks.

Using Gatling for shift right load testing

Gatling stands out in load testing landscape for performance and programmability.

Gatling handles thousands of concurrent users on modest hardware. Unlike traditional load testing tools relying on threading models, Gatling uses asynchronous, non-blocking architecture mirroring how modern applications handle load.

Gatling is language-agnostic, which makes complex scenarios readable and maintainable. Teams can model realistic user journeys. Complete with think times. Conditional logic. Data feeders. This expressiveness proves crucial for shift right testing, where scenarios must reflect actual user behavior rather than simplified patterns.

Gatling's reporting capabilities provide immediate insights into test results. HTML reports generated after each run visualize response time distributions. Error rates. Throughput patterns. These reports integrate seamlessly with CI/CD pipelines. Enabling automated performance regression detection.

Setting up Gatling in production-like environments

Implementing Gatling for shift right testing requires careful architectural decisions.

The goal isn't hammering production systems. It's validating performance characteristics under realistic conditions. This means running tests from locations mirroring user geography. Using data reflecting production patterns. Generating load matching actual usage.

Infrastructure considerations play crucial roles:

Deploy Gatling instances in same cloud regions as applications
Minimize network variables
Use production-grade load generators capable of realistic traffic patterns
Maintain dedicated Gatling clusters generating millions of requests per second when needed

Security and isolation require special attention.

Production load testing must not compromise sensitive data or impact real users. Teams implement careful scoping. Use feature flags routing synthetic traffic differently from real user requests. Ensure test data doesn't pollute production databases. Load tests respect rate limits and security controls.

Start implementing Gatling in your CI/CD pipeline. Not just for pre-production testing but continuous production validation. Begin with simple scenarios mirroring your most critical user journeys. Export Gatling metrics to your observability platform. Build dashboards correlating load with system behavior.

Take the first step today.

Choose one service. Implement basic observability. Run your first production load test with Gatling. Start small, perhaps during off-peak hours with minimal load. Monitor carefully. Learn from results. Gradually expand.

Within months, you'll wonder how you ever operated without production testing.

The future belongs to teams embracing production as their primary testing environment.

View full post