Abstract:AI coding agents powered by large language models can read codebases and produce functional code, but they routinely violate team-specific product decisions that are invisible in the source code alone. We introduce a controlled benchmark measuring decision compliance, the rate at which an AI coding agent follows established product, design, and engineering decisions, across 8 realistic software engineering tasks containing 41 weighted decision points. We compare a baseline configuration (Claude Code with codebase access only) against an augmented configuration that adds Brief, a product-context retrieval system providing spec generation, mid-build consultation, and retrieval of recorded decisions, persona pain points, customer signals, and competitive intelligence. On identical prompts and the same repository, the augmented configuration achieves 95% decision compliance versus 46% for the baseline, a 49 percentage point improvement. Per-decision analysis reveals that the baseline achieves 100% compliance on decisions visible in the codebase and 0-33% on decisions requiring product context, suggesting that product-context retrieval is a key driver of the improvement. We release the benchmark repository, all 16 pull requests, and scoring harness for independent reproduction.
| Comments: | 16 pages, 3 figures, 16 tables. Benchmark repository: this https URL |
| Subjects: | Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Computational Engineering, Finance, and Science (cs.CE); Machine Learning (cs.LG); Logic in Computer Science (cs.LO) |
| ACM classes: | D.2.1; I.2.2; D.2.5 |
| Cite as: | arXiv:2605.08112 [cs.SE] |
| (or arXiv:2605.08112v1 [cs.SE] for this version) | |
| https://doi.org/10.48550/arXiv.2605.08112 arXiv-issued DOI via DataCite |
Submission history
From: Kasyap Varanasi [view email]
[v1]
Mon, 27 Apr 2026 20:38:55 UTC (23 KB)
