DEEPSYNTH

A Benchmark for Deep Information Synthesis

Debjit Paul1, Daniel Murphy2, Milan Gritta1, Ronald Cardenas1, Victor Prokhorov1, Jun Wang3, Gerasimos Lampouras1
Dataset Contributors: Lena Sophia Bolliger4, Aysim Toker1, Roy Miles1, Andreea-Maria Oncescu1,
Jasivan Alex Sivakumar5, Philipp Borchert1, Ismail Elezi1, Meiru Zhang6, Ka Yiu Lee1, Guchun Zhang1
1Huawei Noah's Ark Lab   2Imperial College London   3UCL Centre for AI   4University of Zurich   5University of Sheffield   6University of Cambridge  
 Published at ICLR 2026
Paper Code Data Cite

Abstract

Large language model (LLM)-based agents are increasingly used to solve complex tasks involving tool use, such as web browsing, code execution, and data analysis. However, current evaluation benchmarks do not adequately assess their ability to solve real-world tasks that require synthesizing information from multiple sources and inferring insights beyond simple fact retrieval.

We introduce DEEPSYNTH, a novel benchmark of 120 tasks across 7 domains and 67 countries, designed to evaluate agents on realistic, time-consuming problems that combine information gathering, synthesis, and structured reasoning. When evaluated on DEEPSYNTH, 11 state-of-the-art LLMs and deep research agents achieve a maximum F1 of only 8.97. Our analysis reveals that current agents struggle with hallucinations and reasoning over large information spaces.


How DEEPSYNTH Works

DEEPSYNTH pipeline animation
Figure 1. A sample task illustrating the multi-step agent pipeline: web search → browse multiple sources → extract & filter data → reason → generate structured JSON answer.

The DEEPSYNTH Benchmark

DEEPSYNTH evaluates agents on their ability to navigate multiple websites, extract information from both structured and unstructured sources, and reason effectively to produce correct solutions. Each task yields a concise JSON output enabling straightforward verification. The design of DEEPSYNTH tasks is driven by five criteria:

a) Multi-source Synthesis

Tasks require identifying connections across multiple data sources and combining information to produce a coherent solution.

b) Real-World Inspired

Tasks are designed so that insights would conceivably shape decisions of policy makers, travel agents, political scientists, etc.

c) Verifiable Answers

Each task has a closed-form JSON answer that can be automatically verified and is stable over time for reproducible evaluation.

d) Diversity

Tasks span 67 countries and 7 domains with temporal analyses, comparative evaluations, and relational reasoning.

e) Robust Against Memorisation

Gold-standard answers are intentionally non-retrievable through verbatim lookup, compelling agents to plan and perform multi-step reasoning to derive the correct output.

Data Collection Pipeline. Building DEEPSYNTH involved four key stages: (a) identifying data sources, (b) gathering hypotheses, (c) validating hypotheses through analysis, and (d) formulating tasks with intermediate steps. 16 human experts (81.25% PhD holders) proposed 223 data sources across 7 domains. All tasks underwent independent double-annotation; only tasks with agreement were retained, yielding the final 120 tasks.

Data collection pipeline for DEEPSYNTH
Figure 2. Overview of the four-stage data collection process: data source identification, hypothesis gathering, hypothesis validation, and task formulation.

Required Capabilities. Web search and browsing are needed for 100% of tasks, while 45% require diverse filetype reading, 43% need code execution, and 3% involve multi-modal inputs.

Tool capabilities required for DEEPSYNTH tasks
Figure 3. Percentage of tasks per capability required to solve DEEPSYNTH.

Main Results

ModelF1Prec.RecallEMLLM Judge
LLM Baselines
o4-mini3.052.334.390.00.0
GPT-4.13.462.864.390.00.0
o33.292.853.900.00.0
GPT-5.13.832.985.370.00.0
Gemini-Pro-2.56.254.719.270.05.0
GPT-5.2-Pro8.708.458.966.256.67
DeepSeek-R1-Chat3.232.753.901.672.5
DeepSeek-R1-Reasoner2.802.732.872.506.67
Framework-based Agents
o3-deep-research8.977.7310.692.5017.5
Smolagent (GPT-4.1)3.753.274.392.507.5
Smolagent (GPT-5)6.426.346.501.672.5
OWL (GPT-4.1)5.414.626.521.6712.5
Key finding: Under strict exact-match, almost all LLM baselines score zero. GPT-5.2-Pro achieves the highest F1 score of 8.70, and both GPT-5.2-Pro and DeepSeek-R1-Reasoner achieve the highest LLM Judge score of 6.67, indicating substantial room for improvement.

DEEPSYNTH-Dev Results

We evaluate on the DEEPSYNTH-Dev (Lite) subset. Among standalone LLMs, GPT-5.2 achieves the highest F1 (15.6), while Gemini-Pro-3 leads on LLM-Judge (15.0). Among agents, o3-deep-research attains the highest LLM-Judge score (20.0), reinforcing that tool augmentation benefits synthesis-heavy tasks.

DEEPSYNTH-Dev Pass@1 results
Figure 4a. Pass@1 performance on DEEPSYNTH-Dev.
Best@N and Self-Consistency results
Figure 4b. Best@N and Self-Consistency@5 on DEEPSYNTH-Dev.
Best@N vs Self-Consistency: Under Best@5, Smolagents reaches 25.0% LLM-Judge accuracy vs. only 5.0% with majority voting — current agents exhibit high output variance where occasional runs succeed but models lack reliability.

Analysis

Error Propagation. Evaluating intermediate step accuracy on 40 tasks reveals steep decay: retrieval steps achieve 2–12% F1, computation steps collapse to near zero. When a step fails, the next step also fails 91–100% of the time.

StepDeepSeek-R1GPT-4.1GPT-5.2Prop. (%)
Step 1 11.210.04.1
Step 2 12.49.82.697.0
Step 3 3.93.30.5100.0
Step 4 1.42.40.0100.0
Step 5+ 0.0–0.20.00.0100.0
Final Answer20.118.516.7

Error Types. Manual analysis of 32 errors from OWL (GPT-4.1):

16
Synthesis — wrong conclusions despite correct data
15
Navigation — failed to locate the correct source
4
No answer produced
4
Technical / tool failures

Geographic Bias. All models score F1 0.0 on Africa-related tasks (8.3% of benchmark). Performance varies sharply by region:

Region%GPT-4.1o3-deep-res.Gemini-2.5Smolagents
Africa8.30.00.00.00.0
North America11.74.658.0012.008.33
South America5.00.025.000.00.0
Asia29.23.3612.706.5011.88
Europe38.33.4510.834.915.28
Oceania10.88.9614.436.6724.00
Planning is the bottleneck. Providing ground-truth intermediate steps (without answers) boosts GPT-4.1 from 3.46 → 9.36 F1 and Smolagent from 3.75 → 10.50 F1 — current agents lack effective planning, not reasoning ability.

Dataset & Code

DEEPSYNTH is released in two splits: Dev (Lite) for prototyping and Test for evaluation.

📄 Paper

Full paper on OpenReview

OpenReview

💻 Code

Eval scripts & baselines

GitHub

🧪 Dev (Lite)

Quick iteration split

Coming Soon

📊 Test Set

Full 120-task benchmark

Coming Soon

BibTeX

@inproceedings{paul2026deepsynth,
  title     = {{DEEPSYNTH}: A Benchmark for Deep Information Synthesis},
  author    = {Debjit Paul and Daniel Murphy and Milan Gritta and Ronald Cardenas and Victor Prokhorov 
               and Lena Sophia Bolliger and Aysim Toker and Roy Miles and Andreea-Maria Oncescu and
               Jasivan Alex Sivakumar and Philipp Borchert and Ismail Elezi and Meiru Zhang and 
               Ka Yiu Lee and Guchun Zhang and Jun Wang and Gerasimos Lampouras},
  booktitle = {The Fourteenth International Conference on
               Learning Representations (ICLR)},
  year      = {2026},
  url       = {https://openreview.net/forum?id=0Dhpt9aY3n}
}