DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

Jinxiang Meng1,2*, Shaoping Huang1,2*, Fangyu Lei1,2*, Jingyu Guo1, Haoxiang Liu1, Jiahao Su1, Sihan Wang1, Yao Wang2, Enrui Wang1, Ye Yang1, Hongze Chai1, Jinming Lv1, Anbang Yu1, Huangjing Zhang1, Yitong Zhang3, Yiming Huang1, Zeyao Ma4, Shizhu He1, Jun Zhao1, Kang Liu1†

1Institute of Automation, CAS, 2University of Chinese Academy of Sciences, 3NUS, 4Renmin University of China

*Equal Contribution, Corresponding authors

DV-World benchmarks data visualization agents in realistic settings. It includes 260 tasks across DV-Sheet, DV-Evolution, and DV-Interact, with hybrid evaluation of both values and visual quality.

260 tasks DV-Sheet DV-Evolution DV-Interact Hybrid evaluation

Main Framework Figure

Actual benchmark overview First thing people should see Rendered directly on the page
Loading main figure...
Abstract

DV-World evaluates data visualization agents in more realistic workflows.

Abstract Summary

Many existing benchmarks focus on simplified chart generation. DV-World tests agents in software-grounded settings that require editing spreadsheets, adapting reference visuals, and clarifying ambiguous requests.

  • DV-Sheet focuses on spreadsheet charting, repair, and dashboard tasks.
  • DV-Evolution tests visual adaptation across data and frameworks.
  • DV-Interact measures clarification and intent alignment.
  • Current leading models still score below 50% overall.

Evaluation Framework

260Tasks across realistic workflows.
3Domains: DV-Sheet, DV-Evolution, DV-Interact.
NativeGrounded in real software environments.
HybridCombines table-value checks with semantic judgment.
Benchmark Construction

DV-World covers three core settings: spreadsheet execution, visual adaptation, and interactive clarification.

DV-Sheet

Native spreadsheet manipulation

Agents work directly in spreadsheets to create, repair, and organize visualizations.

  • Direct spreadsheet editing.
  • Precise chart repair.
  • Dashboard composition.
DV-Evolution

Cross-modal logic evolution

Agents adapt reference visuals into correct outputs for new data and new frameworks.

  • Reference-to-code transfer.
  • Cross-framework adaptation.
  • Layout and semantic fidelity.
DV-Interact

Proactive clarification under ambiguity

Agents must ask the right questions before acting when user intent is unclear.

  • User-simulator evaluation.
  • Clarification quality.
  • Intent alignment.
Leaderboard

Task-specific leaderboard

Each tab shows the main paper-reported score for the selected task.

Error Atlas

Browse error cases by task instead of scanning one long mixed gallery.

Create task cases

Readability, structure, and data accuracy problems inside spreadsheet-native chart creation.

SC-1
Loading case...

Readability collapse

Labels and layout fight each other until the visual becomes difficult to read.

SC-2
Loading case...

Structure and data accuracy

A creation case where both chart structure and encoded values drift away from the intended result.

Fix task cases

Integrity failures, destructive regressions, and diagnostic blindness in targeted chart repair.

SF-1
Loading case...

Integrity and consequence errors

The intended fix leaks side effects into parts of the chart that should remain stable.

SF-2
Loading case...

Destructive regression

The repair spreads damage instead of containing it.

Dashboard task cases

Cases where the layout exists, but the business framing, completeness, or design logic still fails.

SD-1
Loading case...

Missing business insight

The dashboard exists, but does not answer the business question well.

SD-2
Loading case...

Logic and design mismatch

Visual composition and analytical logic are no longer aligned.

Evolution task cases

Wrong-vs-right library transfer examples that show whether visual semantics survive migration.

Python

Wrong vs better transfer

The side-by-side pair makes semantic drift visible immediately.

D3.js

D3 transfer

Great for showing how layout and semantics can break even in expressive libraries.

Interaction task cases

Ambiguity, avoidance, and inquiry failures under user-simulator pressure.

IN-1
Loading case...

Interactive avoidance

The agent proceeds instead of clarifying unresolved ambiguity.

IN-2
Loading case...

Inquiry deficit

The system fails to ask the questions it should have asked before acting.

Citation

BibTeX

Replace the anonymous metadata once the public version of the paper is finalized.

@article{dvworld2026,
  title   = {DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios},
  author  = {Anonymous Authors},
  journal = {arXiv preprint arXiv},
  year    = {2026},
  url     = {./8190_DV_World_Benchmarking_Dat.pdf}
}