Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval.
We introduce DEEPSYNTH, a novel benchmark of 120 tasks across 7 domains and 67 countries, designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning. When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 of only 8.97. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces.
DEEPSYNTH evaluates agents on their ability to navigate multiple websites, extract information from both structured and unstructured sources, and reason effectively to produce correct solutions. Each task yields a concise JSON output enabling straightforward verification. The design of DEEPSYNTH tasks is driven by five criteria:
Tasks require identifying connections across multiple data sources and combining information to produce a coherent solution.
Tasks are designed so that insights would conceivably shape decisions of policy makers, travel agents, political scientists, etc.
Each task has a closed-form JSON answer that can be automatically verified and is stable over time for reproducible evaluation.
Tasks span 67 countries and 7 domains with temporal analyses, comparative evaluations, and relational reasoning.
Gold-standard answers are intentionally non-retrievable through verbatim lookup, compelling agents to plan and perform multi-step reasoning to derive the correct output.
Data Collection Pipeline. Building DEEPSYNTH involved four key stages: (a) identifying data sources, (b) gathering hypotheses, (c) validating hypotheses through analysis, and (d) formulating tasks with intermediate steps. 16 human experts (81.25% PhD holders) proposed 223 data sources across 7 domains. All tasks underwent independent double-annotation; only tasks with agreement were retained, yielding the final 120 tasks.
Required Capabilities. Web search and browsing are needed for 100% of tasks, while 45% require diverse filetype reading, 43% need code execution, and 3% involve multi-modal inputs.
| Model | F1 | Prec. | Recall | EM | LLM Judge |
|---|---|---|---|---|---|
| LLM Baselines | |||||
| o4-mini | 3.05 | 2.33 | 4.39 | 0.0 | 0.0 |
| GPT-4.1 | 3.46 | 2.86 | 4.39 | 0.0 | 0.0 |
| o3 | 3.29 | 2.85 | 3.90 | 0.0 | 0.0 |
| GPT-5.1 | 3.83 | 2.98 | 5.37 | 0.0 | 0.0 |
| Gemini-Pro-2.5 | 6.25 | 4.71 | 9.27 | 0.0 | 5.0 |
| GPT-5.2-Pro | 8.70 | 8.45 | 8.96 | 6.25 | 6.67 |
| DeepSeek-R1-Chat | 3.23 | 2.75 | 3.90 | 1.67 | 2.5 |
| DeepSeek-R1-Reasoner | 2.80 | 2.73 | 2.87 | 2.50 | 6.67 |
| Framework-based Agents | |||||
| o3-deep-research | 8.97 | 7.73 | 10.69 | 2.50 | 17.5 |
| Smolagent (GPT-4.1) | 3.75 | 3.27 | 4.39 | 2.50 | 7.5 |
| Smolagent (GPT-5) | 6.42 | 6.34 | 6.50 | 1.67 | 2.5 |
| OWL (GPT-4.1) | 5.41 | 4.62 | 6.52 | 1.67 | 12.5 |
We evaluate on the DEEPSYNTH-Dev (Lite) subset. Among standalone LLMs, GPT-5.2 achieves the highest F1 (15.6), while Gemini-Pro-3 leads on LLM-Judge (15.0). Among agents, o3-deep-research attains the highest LLM-Judge score (20.0), reinforcing that tool augmentation benefits synthesis-heavy tasks.
Error Propagation. Evaluating intermediate step accuracy on 40 tasks reveals steep decay: retrieval steps achieve 2–12% F1, computation steps collapse to near zero. When a step fails, the next step also fails 91–100% of the time.
| Step | DeepSeek-R1 | GPT-4.1 | GPT-5.2 | Prop. (%) |
|---|---|---|---|---|
| Step 1 | 11.2 | 10.0 | 4.1 | — |
| Step 2 | 12.4 | 9.8 | 2.6 | 97.0 |
| Step 3 | 3.9 | 3.3 | 0.5 | 100.0 |
| Step 4 | 1.4 | 2.4 | 0.0 | 100.0 |
| Step 5+ | 0.0–0.2 | 0.0 | 0.0 | 100.0 |
| Final Answer | 20.1 | 18.5 | 16.7 | — |
Error Types. Manual analysis of 32 errors from OWL (GPT-4.1):
Geographic Bias. All models score F1 0.0 on Africa-related tasks (8.3% of benchmark). Performance varies sharply by region:
| Region | % | GPT-4.1 | o3-deep-res. | Gemini-2.5 | Smolagents |
|---|---|---|---|---|---|
| Africa | 8.3 | 0.0 | 0.0 | 0.0 | 0.0 |
| North America | 11.7 | 4.65 | 8.00 | 12.00 | 8.33 |
| South America | 5.0 | 0.0 | 25.00 | 0.0 | 0.0 |
| Asia | 29.2 | 3.36 | 12.70 | 6.50 | 11.88 |
| Europe | 38.3 | 3.45 | 10.83 | 4.91 | 5.28 |
| Oceania | 10.8 | 8.96 | 14.43 | 6.67 | 24.00 |
DEEPSYNTH is released in two splits: Dev (Lite) for prototyping and Test for evaluation.
@inproceedings{paul2026deepsynth,
title = {{DEEPSYNTH}: A Benchmark for Deep Information Synthesis},
author = {Debjit Paul and Daniel Murphy and Milan Gritta and Ronald Cardenas and Victor Prokhorov
and Lena Sophia Bolliger and Aysim Toker and Roy Miles and Andreea-Maria Oncescu and
Jasivan Alex Sivakumar and Philipp Borchert and Ismail Elezi and Meiru Zhang and
Ka Yiu Lee and Guchun Zhang and Jun Wang and Gerasimos Lampouras},
booktitle = {The Fourteenth International Conference on
Learning Representations (ICLR)},
year = {2026},
url = {https://openreview.net/forum?id=0Dhpt9aY3n}
}