Return to Homepage

Notes on MedAgentBench: Benchmarking Medical LLM Agents

By Abul Hasan – AI/ML Scientist, University of Oxford + ChatGPT 5, OpenAI

24 September 2025

Reference:
Su J, Luo S, Yang S, et al. MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents. NEJM AI. 2025;1(3). DOI: 10.1056/AIdbp2500144

🌐 What is MedAgentBench?

MedAgentBench is a benchmark designed to test medical large language model (LLM) agents inside a virtual electronic health record (EHR) environment. Instead of just answering static multiple-choice or QA questions, it evaluates how well models can act as agentsβ€”retrieving information, writing to an EHR, and completing multi-step workflows.

πŸ—οΈ How it Works

πŸ“Š Findings

In evaluation of 12 models, Claude 3.5 Sonnet v2 was the best performer (~69.7% success). Agents did well on information retrieval tasks but struggled with multi-step reasoning and structured writes. Common failures included malformed JSON, incomplete responses, and poor planning across multiple steps.

πŸ’‘ Why it Matters

MedAgentBench shifts benchmarking from static QA toward interactive, workflow-based evaluation. It surfaces deployment challenges that matter in healthcare: handling structure, reliability, and multi-step reasoning. It also provides a public benchmark to accelerate research into safer and more effective medical agents.

πŸ€” Common Questions

1. What is the difference between agent orchestration and graphs?

- Agent orchestration refers to the logic that coordinates multiple agents. It decides which agent runs, in what order, and how they exchange information. Think of it like a conductor guiding musicians in an orchestra.

- Graphs (like state graphs or workflow graphs) are one way to implement orchestration. Each node can represent an agent (retriever, verifier, summariser), and edges define how the state flows between them. In short: orchestration is the concept; a graph is one tool to achieve it.

2. How is orchestration built in MedAgentBench?

MedAgentBench uses a simple round-based loop instead of a graph framework. At each round, the system gives the agent the task and history, receives an action proposal (GET/POST/FINISH), validates it, executes it in the EHR environment, and updates the history. This loop is repeated until the task is finished or the step limit is reached.

3. Why do strict formats matter?

Many failures came from agents generating malformed JSON or invalid payloads. In high-stakes fields like healthcare, even a small format error can break a workflow. Benchmarks like MedAgentBench highlight the need for validators, sanitizers, or stricter schema enforcement when deploying LLMs as agents.

← Back to Blogs