DEEPSYNTH

A Benchmark for Deep Information Synthesis

Debjit Paul¹, Daniel Murphy², Milan Gritta¹, Ronald Cardenas¹, Victor Prokhorov¹, Jun Wang³, Gerasimos Lampouras¹

Dataset Contributors: Lena Sophia Bolliger⁴, Aysim Toker¹, Roy Miles¹, Andreea-Maria Oncescu¹,
Jasivan Alex Sivakumar⁵, Philipp Borchert¹, Ismail Elezi¹, Meiru Zhang⁶, Ka Yiu Lee¹, Guchun Zhang¹

¹Huawei Noah's Ark Lab ²Imperial College London ³UCL Centre for AI ⁴University of Zurich ⁵University of Sheffield ⁶University of Cambridge

Published at ICLR 2026

Paper Code Data Cite

Abstract

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval.

We introduce DEEPSYNTH, a novel benchmark of 120 tasks across 7 domains and 67 countries, designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning. When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 of only 8.97. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces.

How DEEPSYNTH Works

Figure 1. A sample task illustrating the multi-step agent pipeline: web search → browse multiple sources → extract & filter data → reason → generate structured JSON answer.

The DEEPSYNTH Benchmark

DEEPSYNTH evaluates agents on their ability to navigate multiple websites, extract information from both structured and unstructured sources, and reason effectively to produce correct solutions. Each task yields a concise JSON output enabling straightforward verification. The design of DEEPSYNTH tasks is driven by five criteria:

a) Multi-source Synthesis

Tasks require identifying connections across multiple data sources and combining information to produce a coherent solution.

b) Real-World Inspired

Tasks are designed so that insights would conceivably shape decisions of policy makers, travel agents, political scientists, etc.

c) Verifiable Answers

Each task has a closed-form JSON answer that can be automatically verified and is stable over time for reproducible evaluation.

d) Diversity

Tasks span 67 countries and 7 domains with temporal analyses, comparative evaluations, and relational reasoning.

e) Robust Against Memorisation

Gold-standard answers are intentionally non-retrievable through verbatim lookup, compelling agents to plan and perform multi-step reasoning to derive the correct output.

Data Collection Pipeline. Building DEEPSYNTH involved four key stages: (a) identifying data sources, (b) gathering hypotheses, (c) validating hypotheses through analysis, and (d) formulating tasks with intermediate steps. 16 human experts (81.25% PhD holders) proposed 223 data sources across 7 domains. All tasks underwent independent double-annotation; only tasks with agreement were retained, yielding the final 120 tasks.

Figure 2. Overview of the four-stage data collection process: data source identification, hypothesis gathering, hypothesis validation, and task formulation.

Required Capabilities. Web search and browsing are needed for 100% of tasks, while 45% require diverse filetype reading, 43% need code execution, and 3% involve multi-modal inputs.

Tool capabilities required for DEEPSYNTH tasks

Figure 3. Percentage of tasks per capability required to solve DEEPSYNTH.

Main Results

Model	F1	Prec.	Recall	EM	LLM Judge
LLM Baselines
o4-mini	3.05	2.33	4.39	0.0	0.0
GPT-4.1	3.46	2.86	4.39	0.0	0.0
o3	3.29	2.85	3.90	0.0	0.0
GPT-5.1	3.83	2.98	5.37	0.0	0.0
Gemini-Pro-2.5	6.25	4.71	9.27	0.0	5.0
GPT-5.2-Pro	8.70	8.45	8.96	6.25	6.67
DeepSeek-R1-Chat	3.23	2.75	3.90	1.67	2.5
DeepSeek-R1-Reasoner	2.80	2.73	2.87	2.50	6.67
Framework-based Agents
o3-deep-research	8.97	7.73	10.69	2.50	17.5
Smolagent (GPT-4.1)	3.75	3.27	4.39	2.50	7.5
Smolagent (GPT-5)	6.42	6.34	6.50	1.67	2.5
OWL (GPT-4.1)	5.41	4.62	6.52	1.67	12.5

Key finding: Under strict exact-match, almost all LLM baselines score zero. GPT-5.2-Pro achieves the highest F1 score of 8.70, and both GPT-5.2-Pro and DeepSeek-R1-Reasoner achieve the highest LLM Judge score of 6.67, indicating substantial room for improvement.

DEEPSYNTH-Dev Results

We evaluate on the DEEPSYNTH-Dev (Lite) subset. Among standalone LLMs, GPT-5.2 achieves the highest F1 (15.6), while Gemini-Pro-3 leads on LLM-Judge (15.0). Among agents, o3-deep-research attains the highest LLM-Judge score (20.0), reinforcing that tool augmentation benefits synthesis-heavy tasks.

Figure 4a. Pass@1 performance on DEEPSYNTH-Dev.

Figure 4b. Best@N and Self-Consistency@5 on DEEPSYNTH-Dev.

Best@N vs Self-Consistency: Under Best@5, Smolagents reaches 25.0% LLM-Judge accuracy vs. only 5.0% with majority voting — current agents exhibit high output variance where occasional runs succeed but models lack reliability.

Analysis

Error Propagation. Evaluating intermediate step accuracy on 40 tasks reveals steep decay: retrieval steps achieve 2–12% F1, computation steps collapse to near zero. When a step fails, the next step also fails 91–100% of the time.

Step	DeepSeek-R1	GPT-4.1	GPT-5.2	Prop. (%)
Step 1	11.2	10.0	4.1	—
Step 2	12.4	9.8	2.6	97.0
Step 3	3.9	3.3	0.5	100.0
Step 4	1.4	2.4	0.0	100.0
Step 5+	0.0–0.2	0.0	0.0	100.0
Final Answer	20.1	18.5	16.7	—

Error Types. Manual analysis of 32 errors from OWL (GPT-4.1):

Synthesis — wrong conclusions despite correct data

Navigation — failed to locate the correct source

No answer produced

Technical / tool failures

Geographic Bias. All models score F1 0.0 on Africa-related tasks (8.3% of benchmark). Performance varies sharply by region:

Region	%	GPT-4.1	o3-deep-res.	Gemini-2.5	Smolagents
Africa	8.3	0.0	0.0	0.0	0.0
North America	11.7	4.65	8.00	12.00	8.33
South America	5.0	0.0	25.00	0.0	0.0
Asia	29.2	3.36	12.70	6.50	11.88
Europe	38.3	3.45	10.83	4.91	5.28
Oceania	10.8	8.96	14.43	6.67	24.00

Planning is the bottleneck. Providing ground-truth intermediate steps (without answers) boosts GPT-4.1 from 3.46 → 9.36 F1 and Smolagent from 3.75 → 10.50 F1 — current agents lack effective planning, not reasoning ability.

Dataset & Code

DEEPSYNTH is released in two splits: Dev (Lite) for prototyping and Test for evaluation.

📄 Paper

Full paper on OpenReview

OpenReview

💻 Code

Eval scripts & baselines

GitHub

🧪 Dev (Lite)

Quick iteration split

Coming Soon

📊 Test Set

Full 120-task benchmark

Coming Soon

BibTeX

@inproceedings{paul2026deepsynth,
  title     = {{DEEPSYNTH}: A Benchmark for Deep Information Synthesis},
  author    = {Debjit Paul and Daniel Murphy and Milan Gritta and Ronald Cardenas and Victor Prokhorov 
               and Lena Sophia Bolliger and Aysim Toker and Roy Miles and Andreea-Maria Oncescu and
               Jasivan Alex Sivakumar and Philipp Borchert and Ismail Elezi and Meiru Zhang and 
               Ka Yiu Lee and Guchun Zhang and Jun Wang and Gerasimos Lampouras},
  booktitle = {The Fourteenth International Conference on
               Learning Representations (ICLR)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=0Dhpt9aY3n}
}