ProgramBench: Can Language Models Rebuild Programs from Scratch?

hackernews May 07, 2026

Computer Science > Software Engineering

arXiv:2605.03546 (cs)

[Submitted on 5 May 2026]

Title:ProgramBench: Can Language Models Rebuild Programs From Scratch?

Authors:John Yang, Kilian Lieret, Jeffrey Ma, Parth Thakkar, Dmitrii Pedchenko, Sten Sootla, Emily McMilin, Pengcheng Yin, Rui Hou, Gabriel Synnaeve, Diyi Yang, Ofir Press

View PDF HTML (experimental)

Abstract:Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce ProgramBench to measure the ability of software engineering agents to develop software holisitically. In ProgramBench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable's behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95\% of tests on only 3\% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.03546 [cs.SE]
(or arXiv:2605.03546v1 [cs.SE] for this version)
https://doi.org/10.48550/arXiv.2605.03546

Submission history

From: John Yang B [view email]
[v1] Tue, 5 May 2026 09:17:02 UTC (1,752 KB)

Full-text links:

Access Paper:

view license

Current browse context:

cs.SE

< prev | next >

new | recent | 2026-05

Change to browse by:

cs
cs.AI

References & Citations

Bookmark

Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)

Source: hackernews

ProgramBench: Can Language Models Rebuild Programs from Scratch?

Computer Science > Software Engineering

Title:ProgramBench: Can Language Models Rebuild Programs From Scratch?

Submission history

Access Paper:

Current browse context:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

ProgramBench: Can Language Models Rebuild Programs from Scratch?

Computer Science > Software Engineering

Title:ProgramBench: Can Language Models Rebuild Programs From Scratch?

Submission history

Access Paper:

Current browse context:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators