Text Degeneration: A Production Failure Mode That Most Benchmarks Do Not Track

A self-reinforcing failure mode of autoregressive language models, with measurable consequences for inference cost and throughput, and a structural fix grounded in the training distribution.
The Anomaly in the Inference Log
Why Degeneration Is Structural, Not Configurable
The Cost Multiplier Hiding in Plain Sight
The Benchmark Blind Spot
Why Mitigation Is Itself a Tax
The Specialization–Stability Link
Reframing Evaluation and Observability
What Changes When You Start Measuring This
Sources:
A self-reinforcing failure mode of autoregressive language models, with measurable consequences for inference cost and throughput, and a structural fix grounded in the training distribution.

In our recent work specializing a small language model for domain-specific OCR — detailed in the DharmaOCR paper and available on HuggingFace (with demo space) — we show that in real-world PDF document OCR scenarios, fewer than three percent of pages can consume nearly half of the total wall-clock time.

The requests responsible were the ones that had hit the configured maximum-token limit and exhibited an n-gram repetition pattern at their tail. They had not produced a complete output. They had stopped emitting an end-of-sequence token, repeated a fragment, and continued repeating until the system’s hard limit cut them off.

Start and end time of each request (in submission order). Each request is represented by a bar whose left edge marks start time and whose right edge marks end time. Degenerate requests are highlighted in red.

We re-ran the experiment on a second dataset, and a third. The same shape appeared, with varying intensity. A small minority of requests was responsible for a measurable share of the total wall-clock time of the batches they were in. This phenomenon is called Text Degeneration.

The Anomaly in the Inference Log

What we were measuring was not noise, text degeneration is a known phenomenon , described in the language-modeling literature since Holtzman and colleagues’ 2020 paper, and characterized in subsequent works as a self-reinforcing failure of autoregressive generation.

The shape was always the same. A small number of requests would enter a generation loop. The model would repeat a token, then a fragment, then the same token again, until the system’s max-tokens guard cut it off.

Pictorial example of token and sequence-level text degeneration, in which a single token (or token sequence) dominates the conditional distribution, producing repetitions indefinitely.

The reason is structural. A healthy request ends when the model emits an EOS token — the model’s signal that the output is complete. A degenerate request never reaches that signal. It loops, filling its allocated context with repeated tokens or sentences until the hard max-tokens limit forcibly terminates it.

The difference in output length between the two is not marginal. And because inference time scales directly with the number of tokens generated, a degenerate request occupies the GPU for a multiple of the time a well-formed request of comparable input would.

Source: Dong et al, 2026.

The instinct, watching one of these requests in real time, is to treat it as a tuning problem. Raise the repetition penalty. Lower the temperature. Switch the decoder. Add a streaming check that aborts a request once it begins to repeat. These instincts are reasonable, and they all help. They do not address the cause.

The cause is older than any of these decoders, and it is built into the optimization objective that produced the model in the first place.

Why Degeneration Is Structural, Not Configurable

A language model trained with maximum-likelihood — which is to say, almost every model in production today — is trained on a single, narrow imperative: given everything that has come before, assign high probability to whatever came next. Minimize the negative log-likelihood of the reference sequence, token by token, across the entire corpus.

Because the model is autoregressive, it never sees the full sequence it will eventually produce: it only ever predicts one token at a time, conditioned on what precedes it. The objective does not care what the model generates as a whole. It cares only that, at each step, the model assigns high probability to the next token in the reference corpus.

This produces models that are extraordinarily good at continuation. It also produces a side effect that has been documented in the literature for years and that, despite a steady record of inference-time mitigations, remains structurally unresolved.

The effect was first formalized by Holtzman and colleagues in 2020 and can be stated, in the form most useful here, as a self-reinforcement of the conditional distribution:

The more often a token or a fragment has appeared in recent context, the more probable it becomes on the next step. Once the model enters such a region, the gradient of probability points back into it, not out of it. The end-of-sequence token, which would normally close the generation, sits at a vanishingly low probability relative to the repeated fragment. The loop sustains itself until something external — a max-tokens cap, a streaming abort, an exhausted KV cache — finally interrupts it.

This is what makes degeneration structural. The loop is not a defect of the decoding strategy. It is a high-probability region of the distribution itself, produced by the training objective, reinforced by repetitive patterns in the empirical training data, and embedded in the geometry of the model’s internal activations — a description supported in successive analyses since 2020 (Source: Holtzman et al, 2020).

Decoding strategies — temperature, top-p, repetition penalties, beam search variants — operate on top of that distribution. They can make the loop less likely to be entered. They cannot remove it. The same is true of specialized models and general-purpose models alike: each inherits this geometry from the optimization that produced it.

This is the part of the problem that has been discussed in research papers. What is much less discussed — and what we addressed directly in our recent work — is what happens to the rest of the system while one of these loops is running.

Because the cost of a degenerate request is not contained inside the request. That the output is broken is obvious — no one wants a model that has locked itself into printing the same fragment until the hard limit cuts it off. The problem that has received far less attention is what that loop costs the system around it.

The Cost Multiplier Hiding in Plain Sight

We replaced the degenerate requests with synthetic requests of average duration in our experiment — a simple way to estimate the cost the loops had imposed. Total inference time fell from 7.3 minutes to 4.2 minutes. The wall-clock cost of the entire batch had been inflated by 42.47% by a small minority of degenerated requests.

This is not a story about the failed request. The failed request’s runtime was, in some sense, secondary. What mattered was that during the time the loop was alive, every other request running on the same GPU paid for it.

We measured this directly. Across three datasets, the duration distribution of healthy requests shifted whenever a degenerate request was active in parallel. The mean duration of a healthy request rose by at least 15%, and in one dataset by more than 71%, when at least one degenerate sequence was occupying the same machine. The healthy requests had not become more difficult. The system serving them had become measurably slower.

Distribution of healthy-request durations for the three datasets, contrasting periods with at least one degenerate request running in parallel versus periods with no degenerate request running. The mean of each distribution is marked in black with an “x” inside each box, with its value next to it.

The mechanism is mundane and material. Modern inference servers — vLLM in our experiments — extract throughput by holding many requests in a dynamic batch and serving them in parallel through paged memory. The amount of memory occupied by a sequence grows roughly linearly with the number of tokens it has produced. When a sequence enters a degeneration loop and approaches the configured token cap, it occupies a disproportionate share of the available memory, for a disproportionately long time. The scheduler has less room to admit new sequences to the batch. Parallelism falls. Throughput across the batch falls with it. (Source: Kwon et al, 2026).

Which means the cost of a single degenerate request is not paid by the request. It is paid by the queue.

That cost is measurable. The degeneration-rate of the model used in this experiment — Qwen2.5-VL-7B-Instruct vanilla — was 2.42% on this benchmark. A failure rate that small, on a workload of any meaningful volume, was sufficient to inflate wall-clock time by 42.47% on the most affected dataset.

This raises a question about evaluation. If degeneration is a structural property of the training objective, and if its production cost is large, sustained, and contagious — why does it not appear on the standard benchmarks used to compare these models?

It does not appear on any benchmark we are aware of — OCR-specific or otherwise. The metric is absent from every standard evaluation suite used to compare these models. The omission is worth taking seriously.

The Benchmark Blind Spot

The explanation is likely straightforward: benchmark designers focus on measuring output quality, and standard evaluations tend to capture average response quality rather than pathological edge cases. Failure modes fall outside that frame. But in the case of text degeneration, that omission has real consequences — even occurring in fewer than 3% of requests, its impact on system throughput is disproportionate enough to matter.

A consequence visible in our results is that two models can produce nearly identical quality scores while differing substantially in degeneration rate, and therefore in production cost. In our Table 1, several pairs of fine-tuned models illustrate this. A model with a marginally higher quality score is not necessarily the better model to deploy. The benchmark cannot tell which is which. It was not designed to.

Results of the models evaluated on DharmaOCR Paper.

The argument we make explicitly in the DharmaOCR paper (Source: Cardoso et al, 2026) is that this is a methodological gap. Studies that propose models and benchmarks for autoregressive generation should track degeneration rate as a first-class metric, alongside accuracy and cost. The omission is structural. The consequences are operational and economic.

Another reasonable response to all of this is that benchmark omission may be that the failure mode itself is solvable at the inference layer. Detect repetition early. Abort the request. Retry. Route to a fallback. The system, the argument goes, can be made resilient even if the benchmark cannot be made complete.

It is a reasonable response. It is also, by the paper’s evidence, partial.

Why Mitigation Is Itself a Tax

The mitigations described in the literature operate at the inference layer. Real-time repetition detection screens for loops as they form. Retry mechanisms reissue affected requests, sometimes against a different model or decoding configuration. Both are real interventions, and both reduce the visible footprint of degeneration in production.

Both also have a cost. Real-time detection runs on every output, not only the ones that fail — it is an online monitoring mechanism running alongside inference, which carries its own latency and compute overhead. Retries multiply the inference cost of the requests they handle. And degeneration does not always manifest as simple, easily recognizable repetition patterns: heuristics broad enough to catch pathological loops will also penalize legitimate outputs that contain natural repetition. The mitigations contain the worst-case behaviour of any single request. They do not reduce the compute already spent on requests that have entered the loop, and they do not address the source of the loop.

The inference-layer fix is partial. Total throughput, measured under realistic load, remains depressed by the contagion effect even when individual requests are aborted on detection. A more complete fix has to be addressed earlier — in the model itself.

The Specialization–Stability Link

If the failure is structural, the fix has to be structural too — which means it cannot live in the decoder. It has to live in the distribution itself.

The intervention we evaluated was a two-stage training pipeline. The first stage was supervised fine-tuning on domain-aligned examples — the canonical move when adapting a general-purpose model to a narrower task. SFT pulls the model toward the target distribution. It is necessary, and on its own, it is not sufficient. Even after SFT, the high-probability loop regions inherited from the pretraining objective remained present in the models we tested. The models were better at the task. They still occasionally fell into the same wells.

The second stage was Direct Preference Optimization, applied to a curated set of preference pairs. DPO is usually framed as an alignment technique for chat — training a model to prefer better responses over worse ones in a curated set of pairs (Source: Rafailov et al, 2023). We used it differently. We constructed pairs in which the rejected example was a degenerate generation drawn from the same model, and the chosen example was a healthy one. The training signal pushed the model away from the geometry of its own failure mode.

The effect was substantial. Across five model families, ranging from 3B to 7B parameters, DPO reduced the degeneration rate by 37% to 87% relative to SFT alone. The strongest result was on a 3B model — Nanonets-OCR2 — which fell from 1.61% to 0.20%, an 87.6% reduction. The same intervention applied to general-purpose models in the 7B range produced reductions in the 37–56% range. The average reduction across families was 59.4%.

Text degeneration rate (%) across alignment stages. SFT reduces degeneration relative to vanilla models in most cases, whereas DPO further reduces it, even compared to the SFT-tuned model.

There are two ways to read these numbers. The first is that DPO works as a degeneration-mitigation technique, which it does. The second, which we think is the more important reading, is that the smallest specialized model in our experiments achieved the lowest degeneration rate of any model we tested — including substantially larger general-purpose ones. The variable that mattered most for stability was not the size of the model. It was the distance between its training history and the task.

This is what specialization changes. Not only the average quality of the output, but the geometry of the failure itself. A model that has been moved closer to the target distribution by SFT, and then explicitly pushed away from the failure regions by DPO with degenerate-rejected pairs, is operating in a different probabilistic landscape than the model it started as. The loop is not simply less likely to be sampled. It is, by the empirical evidence, less present.

In the comparisons we conducted, training history mattered more than parameter count.

Which raises a final question, suggested directly by the paper’s argument.

If degeneration can be measured, if it can be reduced structurally, and if the difference between models on this dimension is large enough to matter — what should change about how these models are evaluated and observed?

Reframing Evaluation and Observability

The first change is observability. Degeneration rate is computable from data the inference server already produces: a request that hits the configured token cap with n-gram repetition at its tail is a degenerate request, and the rate is the fraction of those over a window. The paper argues, and the empirical results support, that this metric should be tracked alongside latency, throughput, and quality. Adding it costs little. Not adding it leaves the largest cost-shape in the system invisible.

The second change is evaluation. A model’s score on a benchmark that does not measure stability is incomplete. Two models can score nearly identically on a quality benchmark and differ by an order of magnitude in degeneration rate, and therefore in inference stability. The benchmark score cannot reveal that. Comparisons are most useful when made against an evaluation that includes stability and cost on workloads representative of the deployment, not only quality on synthetic ones.

What this amounts to, taken together, is the methodological argument the paper makes directly: text degeneration belongs in the evaluation frame, not outside it.

What Changes When You Start Measuring This

The argument of this article — and of the paper it draws on — has been narrow.

A failure mode characterized in the language-modeling literature for years. Treated, in most prior work, as a generation-quality issue. Routed around, in production, at the inference layer. Measured directly under realistic inference load, it produces costs and throughput effects that the conventional metrics do not surface.

It is structurally produced by the same training objective that makes language models useful. It is contagious across the requests sharing a GPU with it. It is not currently tracked in the major OCR benchmarks used to compare models. And, in the experiments we report, it is reducible — by close to an order of magnitude in the strongest case — by training models in a way that explicitly pushes the failure regions out of their distribution rather than waiting to detect them at runtime.

Each of those claims is empirically grounded. Taken together, they argue for a different way of evaluating autoregressive generation systems and a different way of observing them in production. Once degeneration is measured, much of what depends on it changes with it.

Sources:

Cardoso, Gabriel Pimenta de Freitas, et al. “DharmaOCR: Specialized Small Language Models for Structured OCR that outperform Open-Source and Commercial Baselines.” arXiv preprint arXiv:2604.14314 (2026).
Holtzman, Ari, et al. “The curious case of neural text degeneration.” arXiv preprint arXiv:1904.09751 (2020).
Dong, Sixun, et al. “Rethinking Model Efficiency: Multi-Agent Inference with Large Models.” arXiv preprint arXiv:2604.04929 (2026).
Kwon, Yongchan, et al. “What LLMs think when you don’t tell them what to think about?” arXiv preprint arXiv:2602.01689 (2026).
Rafailov, Rafael, et al. “Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems (NeurIPS).” arXiv preprint arXiv:2305.18290 (2023).