24 September 2025
Reference:
Su J, Luo S, Yang S, et al. MedAgentBench: A Virtual EHR Environment to Benchmark Medical LLM Agents.
NEJM AI. 2025;1(3). DOI: 10.1056/AIdbp2500144
MedAgentBench is a benchmark designed to test medical large language model (LLM) agents inside a virtual electronic health record (EHR) environment. Instead of just answering static multiple-choice or QA questions, it evaluates how well models can act as agentsβretrieving information, writing to an EHR, and completing multi-step workflows.
GET
, POST
, and FINISH
actions.In evaluation of 12 models, Claude 3.5 Sonnet v2 was the best performer (~69.7% success). Agents did well on information retrieval tasks but struggled with multi-step reasoning and structured writes. Common failures included malformed JSON, incomplete responses, and poor planning across multiple steps.
MedAgentBench shifts benchmarking from static QA toward interactive, workflow-based evaluation. It surfaces deployment challenges that matter in healthcare: handling structure, reliability, and multi-step reasoning. It also provides a public benchmark to accelerate research into safer and more effective medical agents.
- Agent orchestration refers to the logic that coordinates multiple agents.
It decides which agent runs, in what order, and how they exchange information.
Think of it like a conductor guiding musicians in an orchestra.
- Graphs (like state graphs or workflow graphs) are one way to implement orchestration.
Each node can represent an agent (retriever, verifier, summariser), and edges define how the state flows between them.
In short: orchestration is the concept; a graph is one tool to achieve it.
MedAgentBench uses a simple round-based loop instead of a graph framework. At each round, the system gives the agent the task and history, receives an action proposal (GET/POST/FINISH), validates it, executes it in the EHR environment, and updates the history. This loop is repeated until the task is finished or the step limit is reached.
Many failures came from agents generating malformed JSON or invalid payloads. In high-stakes fields like healthcare, even a small format error can break a workflow. Benchmarks like MedAgentBench highlight the need for validators, sanitizers, or stricter schema enforcement when deploying LLMs as agents.
β Back to Blogs