Abstract:Reference-free evaluation of large language model (LLM) creativity relies on perplexity, entropy, and top-1 margin. We show that a much stronger signal lives one step earlier in the pipeline: in how sampling temperature \emph{reshapes} the model's token distribution before the next token is drawn. On Llama-3.1-8B-Instruct generations of 500 open-ended creative prompts at $T \in \{0.3, 0.8, 1.5\}$, a single per-token feature derived from this reshaping predicts the within-prompt creativity rank at Spearman $\rho{=}0.918$ against an averaged gpt-4o\,/\,gemini-2.5-pro judge ($n{=}500$) and $\rho{=}0.870$ against a three-rater human-majority ranking ($n{=}150$). Each of four standard reference-free baselines (self-perplexity, mean predictive entropy, top-1 margin, gzip compression ratio) tops out at $|\rho|\!\approx\!0.76$ on both ground truths: a gap of $+0.165$ on averaged-LLM and $+0.110$ on human-majority, both far larger than the spread among the baselines themselves. The two ground-truth panels agree with each other at $\rho{=}0.83$, above the inter-human ceiling of $\rho{=}0.77$, so the comparison is not bottlenecked by judge noise. Mechanistically, the win comes from a sharp distributional signature of the incoherence regime: at $T{=}1.5$ the cumulative-mass width $n_{95}(q)$ inflates from $\sim\!1$ to ${\sim}\!131$ tokens and post-temperature mass leaks off the pre-temperature top-$90\%$ plausible set by about $13$ percentage points. The per-token aggregates do not separate $T{=}0.8$ from $T{=}0.3$; discriminating the two coherent regimes is left to sequence-level features.
| Comments: | Submitted to NGEN-AI 2026 |
| Subjects: | Computation and Language (cs.CL) |
| Cite as: | arXiv:2606.01451 [cs.CL] |
| (or arXiv:2606.01451v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.01451 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: V Subrahmanya Raghu Ram Kishore Parupudi [view email]
[v1]
Sun, 31 May 2026 21:13:47 UTC (40 KB)
