Skip to content

Comments

feat: repetition detector for degenerate token loops#65

Open
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feature/repetition-detector
Open

feat: repetition detector for degenerate token loops#65
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feature/repetition-detector

Conversation

@janhilgard
Copy link
Collaborator

Summary

  • Adds a lightweight repetition detector to the scheduler that monitors the last 32 generated tokens per request
  • Stops generation with finish_reason="stop" when degenerate patterns are detected:
    • Single-token repetition (8+ identical tokens, e.g. 0 0 0 0 0 0 0 0)
    • Short sequence repetition (2-4 token patterns repeated 6+ times, e.g. ab ab ab ab ab ab)
  • Ring buffer per UID with automatic cleanup on request finish/abort
  • Zero overhead when no repetition occurs (simple list append + periodic check)

Split out from PR #53 per review feedback — this touches the scheduler hot path and is independent of the GPT-OSS reasoning parser.

Test plan

  • 15 unit tests covering all detection patterns and edge cases (tests/test_repetition_detector.py)
  • Manual testing with models known to produce degenerate output
  • Verify no performance regression on normal generation
pytest tests/test_repetition_detector.py -v

🤖 Generated with Claude Code

Adds a lightweight repetition detector to the scheduler that monitors
the last 32 generated tokens per request and stops generation when
degenerate patterns are detected:

- Single-token repetition (8+ identical tokens)
- Short sequence repetition (2-4 token patterns repeated 6+ times)

This prevents runaway generation when models enter degenerate loops,
saving compute and improving reliability for long-running requests.

Includes 15 unit tests covering all detection patterns and edge cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
janhilgard added a commit to janhilgard/vllm-mlx that referenced this pull request Feb 11, 2026
Moves repetition detection logic to feature/repetition-detector branch
(PR waybarrios#65) per review feedback on PR waybarrios#53.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@TomLucidor
Copy link

How is this different from repetition penalties or DRY?

@janhilgard
Copy link
Collaborator Author

Good question! They solve different problems:

Repetition penalty / DRY are preventative — they modify logits during sampling to discourage repetition before it happens. They work well most of the time.

This detector is a safety net — it doesn't touch sampling at all. It monitors output and terminates generation when degenerate loops have already formed. Think of it as a circuit breaker for the server.

Why both are needed:

  • Repetition penalties don't always prevent loops, especially with aggressively quantized models (4-bit, 6-bit) or certain MoE architectures where expert routing can get stuck
  • Without a detector, a stuck request burns compute indefinitely until max_tokens — potentially hundreds of seconds of wasted GPU time on a serving endpoint
  • vllm-mlx is an inference server, not a chat frontend — we can't rely on users configuring sampling params correctly. This catches the cases where penalties fail or aren't set

The overhead is near-zero (list append + periodic check on a 32-token window), so it's cheap insurance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants