April 9, 2026· 10 min read

How do AI agents perform under realistic production stress?

Oviya Seeniraj, Mihir Kachroo, Gary Shen

We evaluated how AI agent performance degrades under realistic stress—timeouts, auth errors, and flaky APIs—on EnterpriseOps-Gym. Performance drops significantly, exposing gaps in agent testing that rely only on clean end-to-end tests.

Series · Adversarially stress testing AI agents using environments

Intro

Most AI agent evaluations measure performance on clean, end-to-end tasks. But real enterprise environments are messy—APIs timeout, authentication fails, and responses break.

As a result, strong benchmark performance often overestimates how agents behave in production.

Question: If an agent performs well in ideal conditions, how much does it degrade under realistic friction?

To answer this, we evaluate agents against a baseline EnterpriseOps-Gym and a stressed version based on Shah et al., 2026 of the same tasks. We selected an enterprise environment that is representative of the real world and a stress profile that is realistic and resolvable. to test high-impact, realistic agentic applications.

What we built

We evaluate robustness by running the same 100 tasks under two conditions:

Baseline: no failures
Stress: injected, recoverable faults

Our stress profile focuses on realistic, fixable failures. Essentially, the same tasks are run, but the agent has two obstacles rather than one. Not only must it complete the agentic enterprise task, but it must also resolve the following injected faults that it runs into:

Timeouts
Auth errors
Malformed responses
Tool/API friction

Each task is run three times per condition to account for variability. To keep experiments tractable, we sample 100 statistically representative tasks (from ~1,150 total), with a budget of 8,192 tokens per task.

The evaluation spans eight enterprise environments ("gyms"), capturing behavior across different workflows—not just a single tool.

Agents can retry, replan, and recover -- so success remains achievable.

We evaluate agents on two metrics: full-task success and verifier pass. Full-task success is the agent's ability to complete the task, while verifier pass is the agent's ability to make progress by taking the correct actions towards the correct end state.

Key takeaways

Horizontal bars for mean full-task success and mean verifier pass baseline versus stress — Both “full pass” and “partial credit” drop when stress is on.

The core takeaway: performance in clean, expected conditions doesn’t translate to realistic environments. When exposed to representative failure modes, success rates drop sharply—enough to materially impact deployment.

Viewed through retention, only a fraction of baseline performance survives under stress: full-task success drops to 19.6% of its baseline level, while verifier pass retains 43.9%. This gap shows that agents often make partial progress, but far less often reach a correct end state.

Gym	Tasks × runs (n)	Base succ	Stress succ	Base verifier	Stress verifier
calendar	33	42.4%	15.2%	73.3%	41.4%
csm	42	2.4%	0.0%	25.8%	10.4%
drive	33	27.3%	9.1%	46.2%	29.4%
email	36	38.9%	2.8%	60.4%	22.6%
hr	42	9.5%	0.0%	29.7%	15.6%
hybrid	39	25.6%	5.1%	51.0%	26.9%
itsm	42	28.6%	2.4%	45.7%	12.1%
teams	33	24.2%	6.1%	52.5%	11.8%

I. Stress sharply lowers both end-to-end success and verifier-level progress

Mean baseline vs stress for success and verifier pass — Average baseline vs stress on the 100-task cross-gym slice.

Across three runs, full-task success falls from 24.0% to 4.7% (-19.3 pp) and average verifier pass falls from 46.9% to 20.6% (-26.3 pp).

It is clear that both the ability of the agent to complete its task and to make progress towards it is stunted by our stress profile. By tracking both metrics rather than simple success, we can assess exactly where the agent begins to break down and behave incorrectly upon encountering production failures.

II. Taxonomy-grounded stressors make the failure actionable

Horizontal bars of conditional task success by injected fault type on stress runs — Which injected stressors correlate with the lowest full-task success (stress runs only).

Just knowing that your agent doesn't complete its task is not actionable. Viewing failures from a taxonomy granularity allows us to analyze and resolve the root causes of major failures.

Some stressors are consistently more damaging than others (for example, auth-related failures are often among the harshest). Tool-call failures are summarized as one “tool error” family in the chart, but the underlying logs still contain familiar subtypes like timeouts, validation errors, auth errors, and rate limits.

Viewing failures from all granularities is important and allows us to bridge the gap between evaluation and engineering work.

III. One average can hide environment-specific risk (use the per-gym breakdown sparingly)

Horizontal bars of baseline versus stress full-task success by gym — Environment-by-environment success: where stress hurts most in this slice.

Aggregate numbers are great for tracking progress, but they can hide risk if you only deploy into one slice of enterprise work. The per-gym table is a simple “where to look first” breakdown useful for multi-app agentic workflows.

We can use this as a diagnostic: pick the environments closest to your deployment surface and validate robustness there before you ship.

IV. Stress results bounce more than baseline results, so repeat runs matter

Per-run baseline versus stress metrics for three runs — Per-run bars: stress wiggles more than baseline.

Baseline results are comparatively steady run-to-run, but stress results vary more. That’s expected: stress introduces stochastic “when and where” friction and surfaces brittle recovery policies.

In practice, it is important to quote stress numbers with multiple runs and compare means (and, if needed, variance) rather than one-off runs.

Next steps

Our next blog will go beyond static stress tests by generating harder tasks and environments based on where the agent fails, applying unsupervised environment design (UED) methods.

Instead of a fixed stress profile, we’ll adapt scenarios to target known weaknesses by different granularities such as specific gyms and failure types. We turn the benchmark into a feedback loop that focuses on fragile areas and tracks whether fixes improve performance under stress.

Interested in our work, access to this experiment, or looking to collaborate? Reach us at [email protected].