feat: repetition detector for degenerate token loops by janhilgard · Pull Request #65 · waybarrios/vllm-mlx

janhilgard · 2026-02-11T07:13:45Z

Summary

Adds a lightweight repetition detector to the scheduler that monitors the last 32 generated tokens per request
Stops generation with finish_reason="stop" when degenerate patterns are detected:
- Single-token repetition (8+ identical tokens, e.g. 0 0 0 0 0 0 0 0)
- Short sequence repetition (2-4 token patterns repeated 6+ times, e.g. ab ab ab ab ab ab)
Ring buffer per UID with automatic cleanup on request finish/abort
Zero overhead when no repetition occurs (simple list append + periodic check)

Split out from PR #53 per review feedback — this touches the scheduler hot path and is independent of the GPT-OSS reasoning parser.

Test plan

15 unit tests covering all detection patterns and edge cases (tests/test_repetition_detector.py)
Manual testing with models known to produce degenerate output
Verify no performance regression on normal generation

pytest tests/test_repetition_detector.py -v

🤖 Generated with Claude Code

Adds a lightweight repetition detector to the scheduler that monitors the last 32 generated tokens per request and stops generation when degenerate patterns are detected: - Single-token repetition (8+ identical tokens) - Short sequence repetition (2-4 token patterns repeated 6+ times) This prevents runaway generation when models enter degenerate loops, saving compute and improving reliability for long-running requests. Includes 15 unit tests covering all detection patterns and edge cases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Moves repetition detection logic to feature/repetition-detector branch (PR waybarrios#65) per review feedback on PR waybarrios#53. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TomLucidor · 2026-02-13T02:32:18Z

How is this different from repetition penalties or DRY?

janhilgard · 2026-02-13T20:55:17Z

Good question! They solve different problems:

Repetition penalty / DRY are preventative — they modify logits during sampling to discourage repetition before it happens. They work well most of the time.

This detector is a safety net — it doesn't touch sampling at all. It monitors output and terminates generation when degenerate loops have already formed. Think of it as a circuit breaker for the server.

Why both are needed:

Repetition penalties don't always prevent loops, especially with aggressively quantized models (4-bit, 6-bit) or certain MoE architectures where expert routing can get stuck
Without a detector, a stuck request burns compute indefinitely until max_tokens — potentially hundreds of seconds of wasted GPU time on a serving endpoint
vllm-mlx is an inference server, not a chat frontend — we can't rely on users configuring sampling params correctly. This catches the cases where penalties fail or aren't set

The overhead is near-zero (list append + periodic check on a 32-token window), so it's cheap insurance.

janhilgard mentioned this pull request Feb 11, 2026

feat: GPT-OSS reasoning parser for channel-based token format #53

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat: repetition detector for degenerate token loops#65

feat: repetition detector for degenerate token loops#65
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feature/repetition-detector

janhilgard commented Feb 11, 2026

Uh oh!

TomLucidor commented Feb 13, 2026

Uh oh!

janhilgard commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

janhilgard commented Feb 11, 2026

Summary

Test plan

Uh oh!

TomLucidor commented Feb 13, 2026

Uh oh!

janhilgard commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants