Fix Metal resource leak under high concurrency by janhilgard · Pull Request #92 · waybarrios/vllm-mlx

janhilgard · 2026-02-16T08:30:45Z

Summary

Fixes #91 — Metal buffer leak under high concurrency.

Root cause: batch.tokens grows via mx.concatenate() each generation step but is never evaluated, so computation graph nodes hold AGXAllocation handles indefinitely. Under high concurrency this exhausts Metal resource handles.

Changes:

Add mx.async_eval(*batch.tokens) after each generation step to eagerly evaluate accumulated token concatenations and release Metal buffers
Make cache clear interval adaptive: scales inversely with active sequence count (min interval 8, base interval 32) to prevent Metal resource handle exhaustion under high-concurrency workloads
Add explicit mx.eval(*tokens) during periodic cache clear to collapse any remaining lazy concatenation chains

How it works

Fix A — Eager evaluation (line ~202):

mx.async_eval(batch.y, batch.logprobs)
# NEW: evaluate accumulated tokens to prevent Metal buffer buildup
if batch.tokens:
    mx.async_eval(*batch.tokens)

Fix B — Adaptive cache clearing (line ~2239):

Base interval: 32 steps (unchanged for single-user)
At 8+ concurrent sequences: interval halves to 16
At 16+ concurrent sequences: interval drops to 8 (minimum)
Each clear also evaluates batch.tokens to collapse lazy chains

Test plan

Verify ioclasscount | grep AGXAllocation stays stable under sustained high-concurrency load
Benchmark single-user throughput (should be unchanged)
Benchmark multi-user throughput with 8+ concurrent requests
Verify no regression in generation quality

🤖 Generated with Claude Code

Addresses Metal buffer leak where batch.tokens grows via mx.concatenate() each generation step without evaluation, causing computation graph nodes to hold AGXAllocation handles indefinitely. Changes: - Add mx.async_eval(*batch.tokens) after each generation step to eagerly evaluate accumulated token concatenations and release Metal buffers - Make cache clear interval adaptive: scales inversely with active sequence count (min interval 8) to prevent Metal resource handle exhaustion under high-concurrency workloads - Add explicit mx.eval(*tokens) during periodic cache clear to collapse any remaining lazy concatenation chains before clearing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix Metal resource leak under high concurrency#92

Fix Metal resource leak under high concurrency#92
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:fix/metal-resource-leak

janhilgard commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

janhilgard commented Feb 16, 2026

Summary

How it works

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant