REPL, tracing and performance counters #11

GandalfTea · 2025-10-20T19:14:51Z

fixes #10

This pr implements:

a REPL for easier node management with a backing API server.
a dynamic tracing framework that wraps sections of the code in frames tracking execution time and state, batches them per worker node and sends them back to the REPL for easy debugging and performance analytics.
two trace frame aggregators, one for annotating and sorting each registered frame across nodes by execution time and percentage of total time (like perf annotate) with control for scope and nested frames and another for per-request runtime statistics/metrics tracking (like TTFT, TPS, etc.) across nodes.

erhant · 2025-10-21T11:43:25Z

@GandalfTea heyo, #12 introduces some conflicts with this one but we can merge your PR later together, its just some code locations that were changed & some typings.

GandalfTea · 2025-10-21T17:21:26Z

should be an easy fix but I'll need to wrap more code in tracer frames to get detailed timings. I'll resolve with master later tonight, still have quite a few local changes to push.

GandalfTea · 2025-10-21T21:47:59Z

llama3 model spends most time stalling on comm queue. also true for gpt_oss. qwen3 doesn't have this problem. looking into this

dnet > trace annotate

  name                                   total      max       mean      p50       p90       p99       samples
 ......................................  ........  ........   ........  ........  ........  ........  .......
  grpc.ingress.get                       206147.08  47231.65   2021.05   1469.84   1892.08   2947.89    102.00
  compute.dequeue                        200638.13   1157.19    751.45   1008.63   1010.10   1064.74    267.00
  compute.thread.grpc.send               141505.76   5033.23   1401.05   1272.46   1646.91   2642.69    101.00
  grpc.send_activations.cast_to_dtype    141425.41   4996.80   1400.25   1272.10   1646.61   2641.82    101.00
  compute.dequeue.wait                   70287.18    1045.03    468.58    214.18   1005.07   1016.86    150.00
  grpc.ingress.get.wait                  67293.22   44723.10    659.74    209.04    295.69    502.80    102.00
  compute.thread.mdns.send               19892.35     313.46    195.02    186.38    239.33    296.10    102.00
  memory.memory.load.api_layers           7998.83    4243.05   3999.41   3755.78   4243.05   4243.05      2.00
  grpc.token_request                      1523.49     119.22     14.94     10.29     19.20    116.82    102.00
  compute.thread.execute                  1117.97      33.41      5.51      1.48     11.07     28.48    203.00
  startup.grpc                             417.65     208.98    208.83    208.68    208.98    208.98      2.00
  memory.model.load_metadata               317.23     164.00    158.61    153.23    164.00    164.00      2.00
  compute.forward                          307.76     198.32      1.52      0.37      1.27      3.92    203.00
  startup.discovery                        307.60     180.29    153.80    127.30    180.29    180.29      2.00
  grpc.send_activation.next                263.94      61.97      2.61      2.01      2.34      2.82    101.00
  grpc.ingress.prepare                     159.38       4.86      0.78      0.77      1.45      2.95    204.00
  grpc.ingress.prepare_activation.tokens    77.92       4.48      0.38      0.36      0.60      1.85    204.00
  compute.thread.activations.process        38.61       5.52      0.19      0.11      0.22      1.23    204.00
  grpc.ingress                              31.19       4.15      0.15      0.07      0.20      1.48    204.00
  memory.warmup                             30.52      27.46     15.26      3.06     27.46     27.46      2.00
  compute.thread.weights.prepare            25.43      14.04      0.13      0.02      0.03      1.30    203.00
  grpc.ingress.queue                        25.01       1.24      0.12      0.10      0.15      0.90    204.00
  grpc.ingress.connect_next_node            14.61       0.56      0.07      0.06      0.12      0.21    204.00
  memory.weight_cache.init                  14.27       8.10      7.13      6.17      8.10      8.10      2.00
  compute.thread.cleanup                    12.33       4.67      0.06      0.04      0.05      0.15    203.00
  memory.model.load                          6.05       4.21      3.03      1.84      4.21      4.21      2.00
  grpc.ingress.kvcache.init                  4.22       0.91      0.02      0.01      0.01      0.57    204.00
  grpc.ingress.prefetch                      4.17       0.67      0.02      0.01      0.02      0.57    204.00
  compute.thread.kvcache.init                2.86       0.35      0.01      0.01      0.02      0.10    204.00
  compute.thread.execute.enqueue_prefetch    2.55       0.05      0.01      0.01      0.02      0.03    203.00
  compute.thread.execute.evict_and_unload    2.04       0.02      0.01      0.01      0.02      0.02    203.00
  grpc.send_activation.final                 2.01       0.20      0.02      0.01      0.02      0.14    102.00
  compute.thread.activations.load            1.48       0.09      0.01      0.01      0.01      0.03    204.00
  network.connect.next_node                  1.12       0.86      0.56      0.26      0.86      0.86      2.00
  grpc.stream.flush                          0.80       0.30      0.01      0.01      0.01      0.01    101.00
  startup.compute                            0.51       0.34      0.25      0.17      0.34      0.34      2.00
  memory.prefetch                            0.36       0.27      0.18      0.09      0.27      0.27      2.00
  memory.memory_pools.init                   0.17       0.10      0.08      0.06      0.10      0.10      2.00
  startup.workers                            0.09       0.06      0.04      0.03      0.06      0.06      2.00
  memory.cache.reset                         0.03       0.03      0.03      0.03      0.03      0.03      1.00

GandalfTea · 2025-10-25T10:49:54Z

added perf command for runtime metrics annotation:

dnet > perf stats
Performance stats for request 'chatcmpl-15f7710c-2204-4524-8699-9d47b1bf63c3':

                  51            prompt_tokens
                1051            generated_tokens
                1102            total_tokens

              167.80 ms         ttft
               55.86 tok/s      tokens_per_second       # 0.018 s/tok
              0.0179 ms         mean_inter_token_latency (ITL)
                                   0.019 (p90),  0.021 (p99)
                                   0.016 (min), 0.186 (max)

             UNKNOWN GFLOPs     estimated_flops

    node0 [shard-d06eff0e-dnet1-48494.local]
            17967.79ms          compute_time   # [25.10%]
            53622.37ms          network_time   # [74.90%]
                0.00ms          memory_time    # [0.00%]

For now frames are aggregated into groups like compute and network but later on we should allow exact symbols like perf add [name] to create custom groups. I'll fix some async counter tracking bugs on multiple nodes and clean up the formatting to get a better diff and finish this pr. To add runtime memory stats I'll look into using sysctl and maybe get some OS level metrics annotated to specific parts of the code.

@erhant, @andthattoo let me know if you have any other default metrics to add here. I have more info I don't surface yet like network statistics that I would like to add.

GandalfTea · 2025-10-27T00:59:41Z

dnet > perf stats
Performance stats for request 'chatcmpl-288fe360-2578-42ef-95f9-447904d37955':

                  51            prompt_tokens
                1000            generated_tokens
                1051            total_tokens

               33.34 ms         ttft
               29.03 tok/s      tokens_per_second       # 0.032 s/tok
             32.2025 ms         mean_inter_token_latency (ITL)
                                   43.368 (p90),  119.929 (p99)
                                   26.883 (min), 129.225 (max)

             UNKNOWN GFLOPs     estimated_flops

    node0 [shard-5a4f95b7-dnet1-55663.local]
            17425.80 ms          compute_time   # [50.68%]
            16634.20 ms          network_time   # [48.38%]
              322.86 ms          memory_time    # [0.94%]

    node1 [shard-aef510b0-dnet0-13578.local]
            12326.78 ms          compute_time   # [20.78%]
            46792.39 ms          network_time   # [78.89%]
              192.85 ms          memory_time    # [0.33%]

Correctly aggregating the lower-level frames to get time distribution per node. There's now a staging buffer for events that arrive before the request.start event that registers the request in the aggregator. Some frames are still dropped though the numbers should be insignificant for now, might look at it again later

GandalfTea · 2025-10-27T07:55:29Z

Rough metrics for default config llama 3.3 70b 4bit on 2 Macs (M4, M4 Pro) and a MacBook (M3), 56gb total ram, TB4:

Performance stats for request 'chatcmpl-e0b89d06-52b0-4813-ae81-2690911f1413':

                  79            prompt_tokens
                   4            generated_tokens
                  83            total_tokens

           120813.57 ms         ttft
                0.01 tok/s      tokens_per_second       # 122.210 s/tok
         121511.5810 ms         mean_inter_token_latency (ITL)
                                   123186.812 (p90),  123563.739 (p99)
                                   120813.568 (min), 122209.594 (max)

             UNKNOWN GFLOPs     estimated_flops

    node0 [shard-43559743-dnet1-56725.local]
           303454.09ms          compute_time   # [40.00%]
           454759.81ms          network_time   # [59.95%]
              384.58ms          memory_time    # [0.05%]

    node1 [shard-700c9685-dnet0-14290.local]
             1095.01ms          compute_time   # [38.56%]
             1407.14ms          network_time   # [49.55%]
              337.65ms          memory_time    # [11.89%]

    node2 [shard-84ce4446-abriel-41351.local]
             1231.04ms          compute_time   # [29.16%]
             2618.43ms          network_time   # [62.03%]
              371.74ms          memory_time    # [8.81%]


name                                     total     max       mean      p50       p90       p99       samples
.......................................  ........  ........  ........  ........  ........  ........  .......
  network.rx.wait                        6948628.64  447819.37  25546.43  23790.94  27055.47  247484.08    272.00
  network.tx                             1635264.17  27899.77   6079.05     34.95  24281.77  26332.01    269.00
  network.tx.enque                       700490.09  10021.14   2604.05     35.96  10004.82  10009.83    269.00
  compute.thread.sampling                436511.04  32235.27  25677.12  26545.60  28549.98  32235.27     17.00
  startup.profile.device                 152082.69  64478.76  50694.23  55016.40  64478.76  64478.76      3.00
  network.send_activation.next            6393.57   3010.25     25.37      3.96     36.23    128.80    252.00
  memory.model.load                       1717.23    912.05    286.20      2.61    788.41    912.05      6.00
  memory.model.load_metadata               958.71    356.78    319.57    306.67    356.78    356.78      3.00
  compute.thread.execute                   453.85     45.66      1.67      0.50      3.96     14.48    271.00
  network.rx.prepare                       316.85     16.19      1.17      0.93      2.12      4.39    271.00
  network.rx.prepare_activation.default    199.71     16.10      0.79      0.56      1.38      3.64    252.00
  memory.weights.cache.search              147.01     16.88      0.10      0.01      0.12      1.52   1457.00
  compute.thread.weights.prepare           111.49     22.68      0.41      0.04      0.26     10.92    271.00
  network.token_request                     89.29     49.99      5.25      0.98      6.43     49.99     17.00
  network.rx                                58.66      4.32      0.22      0.14      0.39      1.49    271.00
  network.rx.enque                          49.88      1.79      0.18      0.12      0.42      0.89    271.00
  memory.weights.cache.load                 44.37      5.72      0.55      0.04      1.85      4.28     80.00
  compute.forward                           41.88      3.16      0.16      0.10      0.20      1.27    269.00
  compute.thread.activations.process        31.63      3.10      0.12      0.09      0.22      0.72    271.00
  memory.weights.cache.init                 30.73     21.39     10.24      9.03     21.39     21.39      3.00
  compute.thread.cleanup                    25.52      7.86      0.09      0.05      0.09      0.23    269.00
  memory.warmup                             19.50     14.75      6.50      3.68     14.75     14.75      3.00
  compute.thread.kvcache.init               13.27      3.16      0.05      0.02      0.03      1.06    271.00
  network.send_activation.final             11.94     11.49      0.70      0.03      0.04     11.49     17.00
  network.rx.prepare_activation.tokens       8.21      0.78      0.43      0.37      0.56      0.78     19.00
  network.rx.prefetch                        4.71      0.24      0.02      0.01      0.03      0.12    272.00
  compute.thread.execute.enqueue_prefetch    4.59      1.43      0.02      0.01      0.02      0.02    270.00
  compute.thread.activations.load            2.80      0.26      0.01      0.01      0.01      0.14    271.00
  network.send_activations.default           1.94      0.09      0.01      0.01      0.01      0.02    252.00
  compute.thread.execute.evict_and_unload    1.79      0.17      0.01      0.01      0.01      0.01    271.00
  network.tx.send                            1.20      0.09      0.00      0.00      0.01      0.01    270.00
  network.connect.next_node                  0.87      0.41      0.29      0.25      0.41      0.41      3.00
  memory.cache.reset                         0.66      0.18      0.17      0.17      0.18      0.18      4.00
  memory.prefetch                            0.40      0.19      0.13      0.13      0.19      0.19      3.00
  memory.pools.init                          0.28      0.11      0.09      0.10      0.11      0.11      3.00
  memory.cache.make_cache                    0.08      0.03      0.03      0.03      0.03      0.03      3.00

…tered callback

…l noise

…e matmul

This reverts commit a12eefb6f3807ff9f1812cd755743bc4664a8714.

… oto/perf

GandalfTea changed the title ~~Implement REPL~~ Trace and improve Oct 21, 2025

erhant assigned GandalfTea Oct 21, 2025

GandalfTea force-pushed the oto/perf branch 2 times, most recently from 7c3934b to 9456d61 Compare October 22, 2025 07:40

GandalfTea changed the title ~~Trace and improve~~ REPL, tracing and performance counters Oct 25, 2025

GandalfTea marked this pull request as ready for review October 27, 2025 08:24

GandalfTea mentioned this pull request Oct 27, 2025

feat: add sharding support for mlx-lm models #5

Open

18 tasks

erhant and others added 17 commits November 5, 2025 16:37

initial parallelization stuff

4da88b8

add type (smol commit)

de9fdb6

repl sketch

481c766

manage api server, discover and print nodes table

2e099c2

trace to file (+ trace frames)

c7e5e7b

aggregate trace buffers back to api

e7bfee6

Receive correct data format, dump to temp log file without REPL regis…

01ff687

…tered callback

register repl callback

5ae6ed2

add llama3 model script

b145c88

comment out unavailable models

6e6cfd3

embed correctly

e4312e2

don't filter weights based on is_api_layer

e7a8ee8

tracer config request, aggregate and separate api logger for less rep…

9103284

…l noise

mlx bug in lm_head, transposes weights for no reason, manually comput…

132f1cd

…e matmul

wrap in trace frames

6f9612b

trace ingress worker

1b54b84

trace token request stall

f96dd28

GandalfTea added 29 commits November 10, 2025 11:31

always register t0

d8ce175

mark round

2b44848

don't track queue wait

563160c

break down tx.enque into correct compute-network frames

013ed18

aggregate compound subsytem metrics correctly per node

d7149c0

fix ms scaling

a0780b1

rename old grpc frames to network

547543d

correctly aggregate global memory use per node

5924383

request.round continue

561accd

update signature and unload

f13c0aa

force quantization field in model config (mlx_lm doesn't have it)

88ce17f

add ShardConfig to __init__

994ccfa

rm old bench framework

9091031

remove old memory frame path

1896d24

not-working chat interface

20e39e1

fix indent after rebase

ba93e0e

cleanup weight_cache

af49e36

Revert "cleanup weight_cache"

cc977c5

This reverts commit a12eefb6f3807ff9f1812cd755743bc4664a8714.

remove double code in weight_cache from rebase

0bc3253

comment message field for compatibility

d186593

fix indent and duplicates from rebase

4ce0405

small rebase fixes

a176562

change order of elements so callback_url is position 4 again

c6b1f58

fix missing if conditions in compute thread

41b932c

remove prepare topology response object

de8e5ee

Merge branch 'oto/perf' of https://github.com/firstbatchxyz/dnet into…

a768a7c

… oto/perf

refactor the model management system

c7a1356

refactor help

6415a13

basic chat interface

bef2b12

GandalfTea marked this pull request as draft November 23, 2025 19:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

REPL, tracing and performance counters #11

REPL, tracing and performance counters #11

Uh oh!

GandalfTea commented Oct 20, 2025 •

edited

Loading

Uh oh!

erhant commented Oct 21, 2025

Uh oh!

GandalfTea commented Oct 21, 2025

Uh oh!

GandalfTea commented Oct 21, 2025

Uh oh!

GandalfTea commented Oct 25, 2025

Uh oh!

GandalfTea commented Oct 27, 2025 •

edited

Loading

Uh oh!

GandalfTea commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

REPL, tracing and performance counters #11

Are you sure you want to change the base?

REPL, tracing and performance counters #11

Uh oh!

Conversation

GandalfTea commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erhant commented Oct 21, 2025

Uh oh!

GandalfTea commented Oct 21, 2025

Uh oh!

GandalfTea commented Oct 21, 2025

Uh oh!

GandalfTea commented Oct 25, 2025

Uh oh!

GandalfTea commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GandalfTea commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

GandalfTea commented Oct 20, 2025 •

edited

Loading

GandalfTea commented Oct 27, 2025 •

edited

Loading