March 22, 2026· 6 min read

Implementing In-Context Learning in a Custom Environment

Oviya Seeniraj, Gary Shen, Mihir Kachroo

Can agents learn from past successes? We added a trajectory bank and few-shot learning. Same tasks, same models — but agents now receive demonstrations from prior wins. Here's what happened.

Prev. in series: RL agents in a food-order environment

Benchmarking AI Agents using RL Environments

Can non-deterministic systems arrive at deterministic solutions? We present our findings after evaluating frontier AI models across 180 episodes in our custom RL environment.

Read

Intro

In our first post, we tested top AI models in a food-ordering environment without giving them any prior examples. They couldn’t learn from past mistakes or successes. This raised a key question: could performance improve if agents were allowed to learn from previous runs?

In this blog, we explore in-context learning (ICL)—improving agent behavior without updating model weights, simply by providing better context at inference time. We intentionally chose ICL over supervised fine-tuning (SFT), since in real-world practice, fine-tuning flagship models is often impractical, costly, and frequently unnecessary.

To test this, we built a lightweight few-shot learning loop: successful task completions are stored, indexed, and surfaced as demonstrations before the agent tackles a new task.

We then reran the same benchmark from our first post in the same environment (10 tasks, 3 runs per model). This post answers a central question: can ICL improve frontier agents—and where does it fall short?

The Learning Loop

At a high level, the system has four parts that form a loop:

Past successes — We collect successful task completions (right store, right items, best discount applied).
Knowledge bank — These are indexed by task similarity so we can retrieve relevant examples.
Agent — Before acting, the agent receives one or two similar successful demonstrations in its prompt.
Environment — The agent then acts on the live task. New successes feed back into the bank.

The learning loop: past successes to knowledge bank to agent to environment, with success feeding back — Agents improve by learning from past successes

Same agent, same environment — the only change from baseline to few-shot is that we add context from previous similar runs to the prompt.

The key idea: demonstrations teach correctness, not just completion. Zero-shot agents often finish the workflow (place an order) but get details wrong — wrong store, suboptimal discount, missing add-on. Showing them how someone solved a similar task addresses those failure modes.

Observations & Results

We ran the benchmark with 3 runs × 10 tasks per model = 30 baseline and 30 in-context episodes each. Here are our results:

📌 Note

TC = Task Completed. Agents completed SOME order and arrived at SOME end state, regardless of correctness.
TS = Task Satisfied. Agents completed the order CORRECTLY, with right store/items/discount.

Model	Provider	BaseLINE TC	BaseLINE TS	ICL TC	ICL TS
Gemini 3.1 Pro	Google	93% (28/30)	90% (27/30)	90% (27/30)	90% (27/30)
GPT-5.4	OpenAI	57% (17/30)	43% (13/30)	97% (29/30)	90% (27/30)
GPT-5 Mini	OpenAI	60% (18/30)	33% (10/30)	93% (28/30)	87% (26/30)
Gemini 3 Flash	Google	100% (30/30)	80% (24/30)	100% (30/30)	100% (30/30)
Claude Opus 4.6	Anthropic	90% (27/30)	80% (24/30)	93% (28/30)	83% (25/30)
Claude Haiku 4.5	Anthropic	87% (26/30)	23% (7/30)	93% (28/30)	80% (24/30)

The gap between task completion and correctness — and the changes from baseline to ICL — is the story.

I. Frontier models learn prudence: Abstention is a feature, not a bug

Gemini 3.1 Pro baseline: 90% task satisfied. Gemini 3.1 Pro ICL: 90% task satisfied (no change in correctness).

But look at action completion: it drops from 93% to 90% (−3 points). This is not a regression—it's learned calibration. Demonstrations taught the model to abstain: when it's unsure, it skips the action rather than confidently complete an incorrect order.

Zero-shot, Gemini 3.1 Pro completes 93% of workflows—but some are wrong. Few-shot, it completes 90%—and all of them are correct. This abstention was not a failure: it was a learned strategy to reduce hallucination.

🔍 General Pattern

In-context learning like few-shot allows agents to learn to trade raw completion for accuracy, reducing hallucinated orders.

II. Smaller models gain most from demonstrations; scaling down becomes practical

Bar chart of all models with few-shot task satisfaction rates — All models with trajectory bank demonstrations: Claude Haiku reaches 80%, GPT-5 Mini at 87%, matching or exceeding mid-tier baselines

Claude Haiku gains +17 points, GPT-5 Mini +16 points, GPT-5.4 +14 points. The pattern holds: models with the largest baseline gaps gain the most from few-shot.

This has a practical implication: cheap models get significantly smarter when given the right examples. Claude Haiku at 80% ICL task satisfaction is competitive with Claude Opus at 83% baseline—without fine-tuning. GPT-5 Mini at 87% rivals frontier models' zero-shot performance.

The trajectory bank acts as a knowledge transfer mechanism: expensive models' successes lift cheaper models to comparable performance levels.

🔍 General Pattern

Few-shot learning gains can help compact models (Claude Haiku, GPT Mini, and Gemini Flash) achieve flagship model (Opus, Pro) accuracy -- a huge cost and compute win if explored correctly.

III. Few-shot dramatically improves smaller models; ceiling effects for frontier

Bar chart showing few-shot improvement deltas by model — Few-shot gains: Claude Haiku +17pt, GPT-5 Mini +16pt, GPT-5.4 +14pt; frontier models plateau near ceiling

Claude Haiku 4.5: 23% baseline → 80% few-shot (+17 points). GPT-5 Mini: 33% → 87% (+16 points). GPT-5.4: 43% → 90% (+14 points).

All three close the correctness gap significantly.

In contrast, Gemini 3.1 Pro—already at 90% baseline—shows no improvement. Claude Opus 4.6 moves from 80% to 83% (+3 points). The pattern is clear: few-shot helps where zero-shot leaves room for improvement; at the ceiling (high-performance models), demonstrations offer no benefit.

🔍 General Pattern

In-context learning in this case is most effective on compact or lower-performing models, where it drives substantial gains, while flagship models show little to no improvement.

IV. Simply put, in-context learning works without fine-tuning

Horizontal bar chart showing baseline vs few-shot task satisfied for all models — Few-shot significantly improves smaller models; frontier models show diminishing returns

We don’t modify model weights—the entire learning signal lives in the prompt. By showing agents similar successful runs before they act, we keep the system lightweight and easy to reproduce. The tradeoff is that improvements are purely in-context: cold-start performance doesn’t improve, and we remain constrained by limited context windows.

🔍 General Pattern

Few-shot from a trajectory bank proves learning improves agentic workflows, but does not have permanent effects.

Next Steps

1. Higher impact environments -- scaling up to longer horizons and enterprise workflows

Today, AI agents are increasingly used to automate multi-application workflows. We will now extend our environment-based benchmarking approach to complex, multi-step, multi-agent enterprise workflows (Slack, Jira, Gmail, Workday, etc.) to evaluate real-world impact.

2. Creating dynamic, self-generating environments

Environments can reveal a wide range of agent failure modes, but their effectiveness depends on the quality of the environment and task bank—both of which are highly specific to each agent. To address this, we are moving toward adversarially generated tasks and environments, enabling automated, intelligent stress-testing through Unsupervised Environment Design (UED).

Stay tuned: our next blog reveals the evaluation results when UED methodologies are brought to high-impact enterprise environments.

Recap:

The thesis: Agents can learn from past successes with ICL, without any fine-tuning.
Results: A trajectory bank + few-shot proves it across all frontier and mid-tier models. Flagship models train out hallucinations while smaller models reach baseline frontier level performance.

Interested in our work, access to this experiment, or looking to collaborate? Reach us at [email protected].