Some v0.3.5 DFlash benchmarks #763

deepsweet · 2026-04-14T17:27:55Z

deepsweet
Apr 14, 2026

All benchmarks were ran on M2 Max 12/30 64GB with Qwen3.5-35B-A3B-MLX-MXFP4-FP16 and z-lab/Qwen3.5-35B-A3B-DFlash for DFlash.

Raw

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1417.6       12.36   722.4 tok/s    81.5 tok/s       2.988   385.6 tok/s    19.22 GB
pp4096/tg128          5014.6       13.21   816.8 tok/s    76.3 tok/s       6.692   631.2 tok/s    20.84 GB
pp8192/tg128         11577.0       14.66   707.6 tok/s    68.8 tok/s      13.439   619.1 tok/s    21.51 GB
pp16384/tg128        25752.3       17.35   636.2 tok/s    58.1 tok/s      27.955   590.7 tok/s    22.76 GB
pp32768/tg128        61163.1       30.30   535.7 tok/s    33.3 tok/s      65.012   506.0 tok/s    25.39 GB

DFlash bf16

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1543.4        6.23   663.5 tok/s   161.8 tok/s       2.334   493.5 tok/s    38.40 GB
pp4096/tg128          6625.1       13.46   618.3 tok/s    74.9 tok/s       8.335   506.8 tok/s    38.88 GB
pp8192/tg128         12086.0       15.23   677.8 tok/s    66.2 tok/s      14.020   593.4 tok/s    39.56 GB
pp16384/tg128        26529.4       18.70   617.6 tok/s    53.9 tok/s      28.904   571.3 tok/s    40.81 GB
pp32768/tg128        62506.8       31.79   524.2 tok/s    31.7 tok/s      66.544   494.3 tok/s    43.43 GB

DFlash 4-bit

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          2354.2        6.96   435.0 tok/s   144.7 tok/s       3.239   355.7 tok/s    37.58 GB
pp4096/tg128          6705.0       13.28   610.9 tok/s    75.9 tok/s       8.391   503.4 tok/s    38.24 GB
pp8192/tg128         11876.2       14.35   689.8 tok/s    70.2 tok/s      13.698   607.4 tok/s    38.92 GB
pp16384/tg128        25274.6       18.06   648.2 tok/s    55.8 tok/s      27.568   599.0 tok/s    40.17 GB
pp32768/tg128        59718.5       29.95   548.7 tok/s    33.7 tok/s      63.522   517.9 tok/s    42.80 GB

DFlash 4-bit + TurboQuant 4-bit

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1784.4        6.44   573.8 tok/s   156.4 tok/s       2.603   442.6 tok/s    37.58 GB
pp4096/tg128          6178.2       17.60   663.0 tok/s    57.3 tok/s       8.413   502.1 tok/s    38.09 GB
pp8192/tg128         11458.9       18.77   714.9 tok/s    53.7 tok/s      13.842   601.1 tok/s    38.63 GB
pp16384/tg128        25282.1       21.21   648.0 tok/s    47.5 tok/s      27.976   590.2 tok/s    39.71 GB
pp32768/tg128        61044.9       26.62   536.8 tok/s    37.9 tok/s      64.425   510.6 tok/s    41.88 GB

So far I find it all quite confusing…

deepsweet · 2026-04-14T20:32:17Z

deepsweet
Apr 14, 2026
Author

Aha, if only I could read:

Auto fallback to BatchedEngine/VLMBatchedEngine when context exceeds DFLASH_MAX_CTX (default 4096)

Still not sure about the double memory consumption though.

0 replies

deepsweet · 2026-04-15T07:59:52Z

deepsweet
Apr 15, 2026
Author

@jundot hi, could it be the same #582 as before? Or am I dramatically wrong with trying to pair my own quantized Qwen3.5-35B-A3B with z-lab/Qwen3.5-35B-A3B-DFlash BF16?

0 replies

pcomte3 · 2026-04-15T11:03:43Z

pcomte3
Apr 15, 2026

Here are some of my own benchmarking results, also perplexing.
TL;DR: compute overhead isn't worth it in its current impl.:

Model: Qwen3.5-35B-A3B-mlx-lm-mxfp4
Macbook M4 MAX 48GB

Baseline without KV quantization
Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128           776.1        8.37  1319.4 tok/s   120.4 tok/s       1.839   626.5 tok/s    18.29 GB
pp4096/tg128          2513.6        8.86  1629.5 tok/s   113.7 tok/s       3.639  1160.7 tok/s    19.07 GB
pp8192/tg128          4901.7        9.49  1671.2 tok/s   106.2 tok/s       6.107  1362.4 tok/s    19.41 GB
pp16384/tg128        10385.1       11.29  1577.6 tok/s    89.3 tok/s      11.818  1397.2 tok/s    20.03 GB
pp32768/tg128        24216.4       17.27  1353.1 tok/s    58.4 tok/s      26.410  1245.6 tok/s    21.37 GB
pp65536/tg128        63278.2       29.80  1035.7 tok/s    33.8 tok/s      67.063   979.1 tok/s    24.06 GB

TurboQuant 4bit
Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128           781.7       11.32  1310.0 tok/s    89.0 tok/s       2.219   519.0 tok/s    18.25 GB
pp4096/tg128          2512.2       11.91  1630.4 tok/s    84.7 tok/s       4.024  1049.6 tok/s    19.01 GB
pp8192/tg128          4972.4       12.66  1647.5 tok/s    79.6 tok/s       6.580  1264.4 tok/s    19.27 GB
pp16384/tg128        10637.2       14.75  1540.3 tok/s    68.3 tok/s      12.511  1319.8 tok/s    19.83 GB
pp32768/tg128        24865.1       20.09  1317.8 tok/s    50.2 tok/s      27.416  1199.9 tok/s    20.98 GB
pp65536/tg128        65929.9       31.75   994.0 tok/s    31.7 tok/s      69.963   938.6 tok/s    23.35 GB

TurboQuant 6bit
Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128           789.8       11.53  1296.6 tok/s    87.4 tok/s       2.255   510.9 tok/s    18.25 GB
pp4096/tg128          2756.9       12.26  1485.7 tok/s    82.2 tok/s       4.314   979.1 tok/s    19.02 GB
pp8192/tg128          5140.6       13.01  1593.6 tok/s    77.5 tok/s       6.792  1224.9 tok/s    19.29 GB
pp16384/tg128        10868.8       14.53  1507.4 tok/s    69.4 tok/s      12.714  1298.8 tok/s    19.87 GB
pp32768/tg128        25264.6       20.07  1297.0 tok/s    50.2 tok/s      27.813  1182.7 tok/s    21.05 GB
pp65536/tg128        66132.7       33.34   991.0 tok/s    30.2 tok/s      70.367   933.2 tok/s    23.49 GB

TurboQuant 8bit
Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128           843.8       11.59  1213.5 tok/s    87.0 tok/s       2.316   497.5 tok/s    18.25 GB
pp4096/tg128          2907.7       12.33  1408.7 tok/s    81.7 tok/s       4.474   944.2 tok/s    19.03 GB
pp8192/tg128          5627.2       13.33  1455.8 tok/s    75.6 tok/s       7.320  1136.5 tok/s    19.31 GB
pp16384/tg128        11828.4       15.20  1385.1 tok/s    66.3 tok/s      13.759  1200.1 tok/s    19.90 GB
pp32768/tg128        28221.3       22.23  1161.1 tok/s    45.3 tok/s      31.044  1059.6 tok/s    21.12 GB
pp65536/tg128        71923.5       36.31   911.2 tok/s    27.8 tok/s      76.535   858.0 tok/s    23.64 GB

1 reply

jussihuotari Apr 17, 2026

Agree with the TD;DR. This is an interesting development, but I couldn't get a significant speed boost at this point. Testing with Qwen3.5-9B-MLX-4bit and Qwen3.5-9B-DFlash, and testing using the benchmark.

deepsweet · 2026-04-15T15:50:21Z

deepsweet
Apr 15, 2026
Author

Works with the 58b3ca5:

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.5-35B-A3B-MLX-MXFP4-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1700.4        6.18   602.2 tok/s   163.2 tok/s       2.485   463.6 tok/s    21.24 GB
pp4096/tg128          7446.2       12.99   550.1 tok/s    77.6 tok/s       9.096   464.4 tok/s    20.83 GB
pp8192/tg128         11769.5       14.46   696.0 tok/s    69.7 tok/s      13.605   611.5 tok/s    21.51 GB
pp16384/tg128        25680.0       17.83   638.0 tok/s    56.5 tok/s      27.945   590.9 tok/s    22.76 GB
pp32768/tg128        61340.7       26.88   534.2 tok/s    37.5 tok/s      64.754   508.0 tok/s    25.39 GB

0 replies

deepsweet · 2026-04-15T16:01:51Z

deepsweet
Apr 15, 2026
Author

DFLASH_MAX_CTX=32768 uv run python -m omlx.cli serve

oMLX - LLM inference, optimized for your Mac
https://github.com/jundot/omlx
Benchmark Model: Qwen3.5-35B-A3B-MLX-MXFP4-FP16
================================================================================

Single Request Results
--------------------------------------------------------------------------------
Test                TTFT(ms)    TPOT(ms)        pp TPS        tg TPS      E2E(s)    Throughput    Peak Mem
pp1024/tg128          1508.9        8.20   678.6 tok/s   123.0 tok/s       2.550   451.8 tok/s    21.24 GB
pp4096/tg128          7434.7        6.81   550.9 tok/s   148.0 tok/s       8.300   508.9 tok/s    23.49 GB
pp8192/tg128         20005.3        7.71   409.5 tok/s   130.7 tok/s      20.984   396.5 tok/s    24.08 GB
pp16384/tg128        60227.3       10.97   272.0 tok/s    91.9 tok/s      61.621   268.0 tok/s    24.97 GB
pp32768/tg128        70990.9       26.26   461.6 tok/s    38.4 tok/s      74.326   442.6 tok/s    25.39 GB

It starts to hurt the throughput the more context I give to speculative decoding, which is understandable. Will try to use DFLASH_MAX_CTX=8192 for a while and see how it goes.

0 replies

deepsweet · 2026-04-15T18:01:41Z

deepsweet
Apr 15, 2026
Author

It was trained with a context length of 4096 tokens.

From https://huggingface.co/z-lab/Qwen3.5-27B-DFlash

No mention of that in the 35B-A3B-DFlash readme file, but I think it's similar.

So the draft model has never seen that long context during training, so I'd expect the predicted token acceptance rate to actually drop.

0 replies

johnmarshall4 · 2026-04-16T00:00:25Z

johnmarshall4
Apr 16, 2026

Benchmarks are fun. Now try it in a chat... something is wrong.

0 replies

Some v0.3.5 DFlash benchmarks #763

Uh oh!

Uh oh!

deepsweet Apr 14, 2026

Raw

DFlash bf16

DFlash 4-bit

DFlash 4-bit + TurboQuant 4-bit

Replies: 7 comments · 1 reply

Uh oh!

deepsweet Apr 14, 2026 Author

Uh oh!

deepsweet Apr 15, 2026 Author

Uh oh!

Uh oh!

pcomte3 Apr 15, 2026

Uh oh!

jussihuotari Apr 17, 2026

Uh oh!

deepsweet Apr 15, 2026 Author

Uh oh!

deepsweet Apr 15, 2026 Author

Uh oh!

deepsweet Apr 15, 2026 Author

Uh oh!

johnmarshall4 Apr 16, 2026

deepsweet
Apr 14, 2026

Replies: 7 comments 1 reply

deepsweet
Apr 14, 2026
Author

deepsweet
Apr 15, 2026
Author

pcomte3
Apr 15, 2026

deepsweet
Apr 15, 2026
Author

deepsweet
Apr 15, 2026
Author

deepsweet
Apr 15, 2026
Author

johnmarshall4
Apr 16, 2026