Abstract:Drug-information question answering is a high-stakes setting where hallucinated facts can mislead clinical decision-making and the provenance of each cited fact matters as much as the fact itself. We present DrugClaw, a multi-agent retrieval-augmented system that queries a registry of drug and pharmacovigilance skills via a reflection-driven state-machine workflow and returns answers grounded in primary regulatory or peer-reviewed records. We also contribute DrugAudit, a 3,772-item authority-aware benchmark with an evaluation panel that scores upstream-of-gold source match, token-level semantic snippet overlap, and citation faithfulness under a dual-judge LLM-as-judge protocol with inter-judge kappa = 0.88 (almost-perfect). Across DrugAudit plus drug-related subsets of MedQA (751) and PubMedQA (512), DrugClaw is top-1 on every column of the headline table: composite Evidence Index under both judges, judge-mediated answer correctness, primary-source rate (0.918, +10.1 pp over next-best), faithfulness (0.887, +5.9 pp), MedQA (0.920), and PubMedQA (0.693).
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2606.01434 [cs.CL] |
| (or arXiv:2606.01434v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.01434 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Qing Wang [view email]
[v1]
Sun, 31 May 2026 20:11:05 UTC (1,959 KB)
