Skip to content

Comments

Fix Metal resource leak under high concurrency#92

Open
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:fix/metal-resource-leak
Open

Fix Metal resource leak under high concurrency#92
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:fix/metal-resource-leak

Conversation

@janhilgard
Copy link
Collaborator

Summary

Fixes #91 — Metal buffer leak under high concurrency.

Root cause: batch.tokens grows via mx.concatenate() each generation step but is never evaluated, so computation graph nodes hold AGXAllocation handles indefinitely. Under high concurrency this exhausts Metal resource handles.

Changes:

  • Add mx.async_eval(*batch.tokens) after each generation step to eagerly evaluate accumulated token concatenations and release Metal buffers
  • Make cache clear interval adaptive: scales inversely with active sequence count (min interval 8, base interval 32) to prevent Metal resource handle exhaustion under high-concurrency workloads
  • Add explicit mx.eval(*tokens) during periodic cache clear to collapse any remaining lazy concatenation chains

How it works

Fix A — Eager evaluation (line ~202):

mx.async_eval(batch.y, batch.logprobs)
# NEW: evaluate accumulated tokens to prevent Metal buffer buildup
if batch.tokens:
    mx.async_eval(*batch.tokens)

Fix B — Adaptive cache clearing (line ~2239):

  • Base interval: 32 steps (unchanged for single-user)
  • At 8+ concurrent sequences: interval halves to 16
  • At 16+ concurrent sequences: interval drops to 8 (minimum)
  • Each clear also evaluates batch.tokens to collapse lazy chains

Test plan

  • Verify ioclasscount | grep AGXAllocation stays stable under sustained high-concurrency load
  • Benchmark single-user throughput (should be unchanged)
  • Benchmark multi-user throughput with 8+ concurrent requests
  • Verify no regression in generation quality

🤖 Generated with Claude Code

Addresses Metal buffer leak where batch.tokens grows via mx.concatenate()
each generation step without evaluation, causing computation graph nodes
to hold AGXAllocation handles indefinitely.

Changes:
- Add mx.async_eval(*batch.tokens) after each generation step to eagerly
  evaluate accumulated token concatenations and release Metal buffers
- Make cache clear interval adaptive: scales inversely with active
  sequence count (min interval 8) to prevent Metal resource handle
  exhaustion under high-concurrency workloads
- Add explicit mx.eval(*tokens) during periodic cache clear to collapse
  any remaining lazy concatenation chains before clearing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metal resource leak

1 participant