Back to News
March 12, 2026Β· 8 min read

Benchmarking AI Agents using RL Environments

Oviya Seeniraj, Gary Shen, Mihir Kachroo

Can non-deterministic systems arrive at deterministic solutions? We present our findings after evaluating frontier AI models across 180 episodes in our custom RL environment.

1

Intro

The promise of AI agents is alluring: systems that assess their environment, plan, and execute multi-step workflows autonomously. Infinitely scalable productivity, in theory.

But can they arrive at a deterministically correct solution as an inherently non-deterministic system? That's what RL environments test. Broadly, agents reason (LLM inference), recall (memory retrieval), and act (tool calls). In our environment, each action returns a state and reward, with a final binary verdict per task.

Our goal: Benchmark frontier lab models on our new RL environment and see how they perform. Spoiler: the gap between "finished" and "correct" is enormous.

πŸ“Œ Note

This post is the first in a series. Here, we benchmark, evaluate, and analyze results. In our next post, we share post-training results.

2

The Environment We Built

At a high level, the environment has three parts:

I. A World of Constrained Tasks

A world that gives agents tasks requiring reasoning under constraints: not just simple retrievals, but augmented, multi-step problems.

Task taxonomy showing the spectrum of task difficulty
Task taxonomy: from simple retrieval to multi-step reasoning

II. A Defined Action Space

A defined action space that allows the agent to interact with and evaluate the world, enabling them to complete tasks and us to assess their abilities:

searchStores
openStore
addToCart
applyCoupon
beginCheckout
placeOrder
getState
clearCart

πŸ’‘ Insight

As is convention, each action maps to a state and reward. (The names above are environment-specific; they generalize to search, select, apply, commit, and observe primitives in any RL setup). We also add evaluation notes.

III. A Verifier

A verifier determines the ground-truth solution and evaluates two outcomes:

  • action completed: the agent reaches a terminal state (wrong outcome = '0', same as incomplete)
  • task satisfied: the agent reaches the correct terminal state (the actual solution = '1')

Our workflow:

Environment workflow diagram showing the end-to-end pipeline
End-to-end workflow: task generation β†’ agent interaction β†’ binary verification
3

Observations & Key Takeaways

10 tasks, 6 models, three frontier labs (Anthropic, OpenAI, Google) β€” 180 episodes. Same environment, same tools. Here's what happened:

ModelProviderRunsEpisodesTask SatisfiedAction Completed
Claude Opus 4.6Anthropic33093% (28/30)93% (28/30)
Gemini 3.1 ProGoogle33090% (27/30)90% (27/30)
Gemini 3 FlashGoogle33063% (19/30)90% (27/30)
GPT-5.4OpenAI33050% (15/30)63% (19/30)
Claude Haiku 4.5Anthropic33023% (7/30)83% (25/30)
GPT-5 MiniOpenAI33023% (7/30)50% (15/30)
Action Completed vs Task Satisfied rates across all models
The gap between completing an action and completing it correctly

Models that complete workflows aren't necessarily completing them correctly. The gap between Action Completed and Task Satisfied reveals how often agents confidently reach the wrong answer.

I. Completing a workflow β‰  completing it correctly

Correctness gap between action completion and task satisfaction
Fluent execution vs. correct execution: the gap is the story

Claude Haiku completes 83% of workflows β€” only 23% correct. A 60-point gap! Gemini 3 Flash: 90% completed, 63% correct. These agents confidently navigate, act, and reach terminal states. They're just wrong: wrong source, missing add-on, suboptimal discount.

πŸ” General Pattern

LLMs are fluent executors. They chain tool calls and reach the end without getting stuck. But fluent β‰  correct. Any benchmark that only asks "did it finish?" will overstate capability.

II. Agents follow explicit instructions but fail along a reasoning hierarchy

Reasoning hierarchy showing progressive failure across task types
The reasoning hierarchy: from instruction-following to analysis

Explicit commands? Almost never fail. But when a task requires thinking, performance drops quickly:

  • Optimization reasoning (78% avg): "Cheapest item" β€” agents can search, compare, and execute well when the objective is explicit.
  • Instruction following (65%): "Fixed qty" β€” if the task is spelled out, they usually comply.
  • Multi-store coordination (53%): "Multi-store + discount" β€” error compounds across sources and constraints; in 23 episodes, agents failed to apply the best available discount.
  • Quantitative comparison (50%): "Best value item" β€” split-brain: top models solve it reliably, while smaller models go 0%.
  • Goal decomposition (39%): "Closest source" β€” nothing is spelled out; many agents pick a store without checking proximity.
  • Constraint planning (28%): "Max items under $XXX" β€” working backward from a budget and performing mathematic optimization collapses for most models.

πŸ” General Pattern

There is a reasoning hierarchy: instruction-following β†’ optimization β†’ planning β†’ analysis. Current models excel at the first step and collapse progressively through the rest. The boundary between "execute a command" and "make a decision" is where agent capability drops off β€” and that boundary is sharp.

III. Non-determinism means a single success proves nothing

Non-determinism across runs showing variance in agent outcomes
Same model, same task β€” different outcome every time

GPT-5.4 on the multi-store task: 2/3 runs succeed, 1/3 fail. Same task, same model. Even Opus on "closest source": 1/3. Flip a (weighted) coin.

πŸ” General Pattern

Because LLMs sample stochastically, multi-step runs can drift from attempt to attempt. One success doesn't prove reliability, so RL environments rerun tasks and report success rate.

IV. Scale buys capability, not consistency

Task heatmap showing model performance across task types and scale
Scale buys capability β€” but hits a ceiling on planning and reliability

93% at the top (Opus) vs. 23% at the bottom (Haiku, GPT-5 Mini). Bigger models do coordinate across stores, compare value, and solve the budget task. But consistency still breaks on underspecified goals ("closest source" is 39% overall; even Opus is 1/3). The cheaper models you'd actually deploy? Wrong 77% of the time.

πŸ” General Pattern

Bigger models do more, but they still struggle when the goal is implicit and the agent must plan under constraints. RL environments make that fail/succeed boundary measurable.

4

Next Steps

This environment was a first step. Even in a small, well-defined environment with relatively straightforward tasks, frontier models don't always arrive at the correct solution. We've captured and analyzed detailed performance data. Where we're heading:

1. Train Agents, Then Re-Benchmark

We started zero-shot on purpose: no learning, just a baseline. Next, we’ll use the verifier as training signal and try three concrete upgrades:

  • Few-shot: show 1–2 successful examples before the agent acts.
  • SFT: fine-tune on verifier-approved tool trajectories.
  • DPO: train the agent to prefer correct next-actions over common mistakes.

Then we’ll rerun the exact same benchmark and report how much each method improves Task Satisfied (correctness), not just β€œAction Completed” (finishing).

2. Build Higher-Impact, High-Distribution Environments

Food ordering is a clean, low-stakes testbed. The goal is environments that mirror real deployment: enterprise platforms (Slack, Jira, Salesforce, AWS) and deep research workflows where mistakes are costly. We’ll bring the same recipe β€” clear tasks, real tool APIs, and a strict verifier β€” to these domains so agent performance is measured and improved on work that actually matters.

The thesis: Can non-deterministic systems find deterministic solutions? The answer: sometimes, on some tasks, with the best models. RL environments make that answer precise and improvable. That's the point.

Interested in our work, access to the dataset, or looking to collaborate? Reach us at [email protected].