Abstract:Across self-consistency samples from an LLM, vote agreement tracks instance difficulty: on SemEval-2026 Task 4 (Narrative Story Similarity), supermajority cases (>= 7/8 votes) resolve at 85 percent accuracy, split votes at 67 percent, and perfect ties at 61 percent, a monotone gradient that holds across the development set. We exploit this in CascadeMind, which routes eight Gemini 2.5 Flash votes by consensus, escalates split votes to additional sampling rounds, and falls through to a symbolic ensemble of theory-inspired narrative signals only on perfect ties (5 percent of cases). The system reached 72.75 percent on Track A test, placing 10th of 44 teams. Ablations show that the symbolic component contributes negligibly end-to-end and that nearly all gains come from confidence-aware routing. The takeaway is methodological: for narrative similarity, calibrating when to spend more compute on a hard instance matters more than adding auxiliary representations to reason about it.
| Comments: | 7 pages, 2 figures, 5 tables. Accepted paper for SemEval-2026 Task 4 at ACL. Code: this https URL |
| Subjects: | Computation and Language (cs.CL) |
| ACM classes: | I.2.7 |
| Cite as: | arXiv:2601.19931 [cs.CL] |
| (or arXiv:2601.19931v3 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2601.19931 arXiv-issued DOI via DataCite |
Submission history
From: Sebastien Kawada [view email]
[v1]
Mon, 12 Jan 2026 00:30:38 UTC (27 KB)
[v2]
Mon, 2 Mar 2026 06:18:00 UTC (27 KB)
[v3]
Wed, 13 May 2026 02:52:26 UTC (31 KB)
