Abstract:The rapid advancement of large language models (LLMs) has made machine-generated text increasingly difficult to distinguish from human-written text. While recent studies explore leveraging internal representations of language models to uncover deeper detection signals, these raw features often exhibit substantial overlap between classes, limiting their discriminative power. To address this challenge, we propose Steer-to-Detect (\texttt{S2D}), a two-stage framework for detecting LLM-generated text. In the first stage, \texttt{S2D} learns a steering vector that is injected into the hidden states of a frozen observer LLM, producing representations with improved class separability. In the second stage, detection is performed via a hypothesis testing procedure based on the steered representations. We establish finite-sample, high-probability guarantees for Type I and Type II errors, providing a theoretical characterization of the procedure. Empirically, \texttt{S2D} achieves strong and consistent performance across a range of settings, including out-of-distribution scenarios and adversarial perturbations.
| Subjects: | Applications (stat.AP); Machine Learning (cs.LG) |
| Cite as: | arXiv:2605.12890 [stat.AP] |
| (or arXiv:2605.12890v1 [stat.AP] for this version) | |
| https://doi.org/10.48550/arXiv.2605.12890 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Luxu Liang [view email]
[v1]
Wed, 13 May 2026 02:14:21 UTC (2,565 KB)
