Replies: 7 comments 1 reply
-
|
Aha, if only I could read:
Still not sure about the double memory consumption though. |
Beta Was this translation helpful? Give feedback.
-
|
@jundot hi, could it be the same #582 as before? Or am I dramatically wrong with trying to pair my own quantized Qwen3.5-35B-A3B with z-lab/Qwen3.5-35B-A3B-DFlash BF16? |
Beta Was this translation helpful? Give feedback.
-
|
Here are some of my own benchmarking results, also perplexing. Model: Qwen3.5-35B-A3B-mlx-lm-mxfp4 |
Beta Was this translation helpful? Give feedback.
-
|
Works with the 58b3ca5: |
Beta Was this translation helpful? Give feedback.
-
DFLASH_MAX_CTX=32768 uv run python -m omlx.cli serveIt starts to hurt the throughput the more context I give to speculative decoding, which is understandable. Will try to use |
Beta Was this translation helpful? Give feedback.
-
From https://huggingface.co/z-lab/Qwen3.5-27B-DFlash No mention of that in the 35B-A3B-DFlash readme file, but I think it's similar. So the draft model has never seen that long context during training, so I'd expect the predicted token acceptance rate to actually drop. |
Beta Was this translation helpful? Give feedback.
-
|
Benchmarks are fun. Now try it in a chat... something is wrong. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
All benchmarks were ran on M2 Max 12/30 64GB with Qwen3.5-35B-A3B-MLX-MXFP4-FP16 and z-lab/Qwen3.5-35B-A3B-DFlash for DFlash.
Raw
DFlash bf16
DFlash 4-bit
DFlash 4-bit + TurboQuant 4-bit
So far I find it all quite confusing…
Beta Was this translation helpful? Give feedback.
All reactions