Data science benchmarks for AI systems view markdown
Some benchmarks focusing on getting insight directly from data using LLMs / LLM agents (requires the models to interact with the data through code). I find these really compelling, as they are really hard tasks, useful for real-world applications, and also an extensible stepping stone to accelerate scientific discovery.
ScienceAgentBench (chen…huan sun, 2024) - 102 scientific coding tasks (from 44 papers in 4 disciplines validated by 9 subject-matter experts)
- target output for every task is a self-contained Python file
- each task has (a) task instruction, (b) dataset info, (c) expert-provided info and (d) a groundtruth annotated program
AutoSDT: Scaling Data-Driven Discovery Tasks Toward Open Co-Scientists (li…huan sun, 2025) - 5k scientific coding tasks automatically scraped from github repos for papers (as a sanity check, they manually verified that a subset were reasonable)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models (majumder…clark, 2024) - 264 tasks collected across 6 diverse domains, such as sociology and engineering, by manually deriving discovery workflows from papers
- each task has datasets, metadata, natural-language discovery goal
BLADE: Benchmarking Language Model Agents for Data-Driven Science (gu…althoff, 2024) - 12 tasks, each has a (fairly open-ended) research question, dataset, and groundtruth expert-conducted analysis
Mlagentbench: Benchmarking LLMs As AI Research Agents (huang, vora, liang, & leskovec, 2023) - 13 prediction tasks, e.g. CIFAR-10, BabyLM, kaggle (evaluate via test prediction perf.)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis (li…jordan, 2025) - scraped 25 notebooks from recent kaggle competitions, parse into goal + reference insights that incorporate domain knowledge
- paper emphasizes interactive setting: evaluates by using the instruction materials to build a knowledgeable user simulator and then tests data science agents’ ability to help the user simulator improve predictive performance
InfiAgent-DABench: Evaluating Agents on Data Analysis Tasks (hu…wu, 2024) - 257 precise (relatively easy) questions that can be answered from 1 of 52 csv datasets