Skip to content

Conversation

@GandalfTea
Copy link
Contributor

@GandalfTea GandalfTea commented Oct 20, 2025

fixes #10

This pr implements:

  • a REPL for easier node management with a backing API server.
  • a dynamic tracing framework that wraps sections of the code in frames tracking execution time and state, batches them per worker node and sends them back to the REPL for easy debugging and performance analytics.
  • two trace frame aggregators, one for annotating and sorting each registered frame across nodes by execution time and percentage of total time (like perf annotate) with control for scope and nested frames and another for per-request runtime statistics/metrics tracking (like TTFT, TPS, etc.) across nodes.

@GandalfTea GandalfTea changed the title Implement REPL Trace and improve Oct 21, 2025
@erhant
Copy link
Member

erhant commented Oct 21, 2025

@GandalfTea heyo, #12 introduces some conflicts with this one but we can merge your PR later together, its just some code locations that were changed & some typings.

@GandalfTea
Copy link
Contributor Author

should be an easy fix but I'll need to wrap more code in tracer frames to get detailed timings. I'll resolve with master later tonight, still have quite a few local changes to push.

@GandalfTea
Copy link
Contributor Author

llama3 model spends most time stalling on comm queue. also true for gpt_oss. qwen3 doesn't have this problem. looking into this

dnet > trace annotate

  name                                   total      max       mean      p50       p90       p99       samples
 ......................................  ........  ........   ........  ........  ........  ........  .......
  grpc.ingress.get                       206147.08  47231.65   2021.05   1469.84   1892.08   2947.89    102.00
  compute.dequeue                        200638.13   1157.19    751.45   1008.63   1010.10   1064.74    267.00
  compute.thread.grpc.send               141505.76   5033.23   1401.05   1272.46   1646.91   2642.69    101.00
  grpc.send_activations.cast_to_dtype    141425.41   4996.80   1400.25   1272.10   1646.61   2641.82    101.00
  compute.dequeue.wait                   70287.18    1045.03    468.58    214.18   1005.07   1016.86    150.00
  grpc.ingress.get.wait                  67293.22   44723.10    659.74    209.04    295.69    502.80    102.00
  compute.thread.mdns.send               19892.35     313.46    195.02    186.38    239.33    296.10    102.00
  memory.memory.load.api_layers           7998.83    4243.05   3999.41   3755.78   4243.05   4243.05      2.00
  grpc.token_request                      1523.49     119.22     14.94     10.29     19.20    116.82    102.00
  compute.thread.execute                  1117.97      33.41      5.51      1.48     11.07     28.48    203.00
  startup.grpc                             417.65     208.98    208.83    208.68    208.98    208.98      2.00
  memory.model.load_metadata               317.23     164.00    158.61    153.23    164.00    164.00      2.00
  compute.forward                          307.76     198.32      1.52      0.37      1.27      3.92    203.00
  startup.discovery                        307.60     180.29    153.80    127.30    180.29    180.29      2.00
  grpc.send_activation.next                263.94      61.97      2.61      2.01      2.34      2.82    101.00
  grpc.ingress.prepare                     159.38       4.86      0.78      0.77      1.45      2.95    204.00
  grpc.ingress.prepare_activation.tokens    77.92       4.48      0.38      0.36      0.60      1.85    204.00
  compute.thread.activations.process        38.61       5.52      0.19      0.11      0.22      1.23    204.00
  grpc.ingress                              31.19       4.15      0.15      0.07      0.20      1.48    204.00
  memory.warmup                             30.52      27.46     15.26      3.06     27.46     27.46      2.00
  compute.thread.weights.prepare            25.43      14.04      0.13      0.02      0.03      1.30    203.00
  grpc.ingress.queue                        25.01       1.24      0.12      0.10      0.15      0.90    204.00
  grpc.ingress.connect_next_node            14.61       0.56      0.07      0.06      0.12      0.21    204.00
  memory.weight_cache.init                  14.27       8.10      7.13      6.17      8.10      8.10      2.00
  compute.thread.cleanup                    12.33       4.67      0.06      0.04      0.05      0.15    203.00
  memory.model.load                          6.05       4.21      3.03      1.84      4.21      4.21      2.00
  grpc.ingress.kvcache.init                  4.22       0.91      0.02      0.01      0.01      0.57    204.00
  grpc.ingress.prefetch                      4.17       0.67      0.02      0.01      0.02      0.57    204.00
  compute.thread.kvcache.init                2.86       0.35      0.01      0.01      0.02      0.10    204.00
  compute.thread.execute.enqueue_prefetch    2.55       0.05      0.01      0.01      0.02      0.03    203.00
  compute.thread.execute.evict_and_unload    2.04       0.02      0.01      0.01      0.02      0.02    203.00
  grpc.send_activation.final                 2.01       0.20      0.02      0.01      0.02      0.14    102.00
  compute.thread.activations.load            1.48       0.09      0.01      0.01      0.01      0.03    204.00
  network.connect.next_node                  1.12       0.86      0.56      0.26      0.86      0.86      2.00
  grpc.stream.flush                          0.80       0.30      0.01      0.01      0.01      0.01    101.00
  startup.compute                            0.51       0.34      0.25      0.17      0.34      0.34      2.00
  memory.prefetch                            0.36       0.27      0.18      0.09      0.27      0.27      2.00
  memory.memory_pools.init                   0.17       0.10      0.08      0.06      0.10      0.10      2.00
  startup.workers                            0.09       0.06      0.04      0.03      0.06      0.06      2.00
  memory.cache.reset                         0.03       0.03      0.03      0.03      0.03      0.03      1.00

@GandalfTea GandalfTea force-pushed the oto/perf branch 2 times, most recently from 7c3934b to 9456d61 Compare October 22, 2025 07:40
@GandalfTea GandalfTea changed the title Trace and improve REPL, tracing and performance counters Oct 25, 2025
@GandalfTea
Copy link
Contributor Author

added perf command for runtime metrics annotation:

dnet > perf stats
Performance stats for request 'chatcmpl-15f7710c-2204-4524-8699-9d47b1bf63c3':

                  51            prompt_tokens
                1051            generated_tokens
                1102            total_tokens

              167.80 ms         ttft
               55.86 tok/s      tokens_per_second       # 0.018 s/tok
              0.0179 ms         mean_inter_token_latency (ITL)
                                   0.019 (p90),  0.021 (p99)
                                   0.016 (min), 0.186 (max)

             UNKNOWN GFLOPs     estimated_flops

    node0 [shard-d06eff0e-dnet1-48494.local]
            17967.79ms          compute_time   # [25.10%]
            53622.37ms          network_time   # [74.90%]
                0.00ms          memory_time    # [0.00%]

For now frames are aggregated into groups like compute and network but later on we should allow exact symbols like perf add [name] to create custom groups. I'll fix some async counter tracking bugs on multiple nodes and clean up the formatting to get a better diff and finish this pr. To add runtime memory stats I'll look into using sysctl and maybe get some OS level metrics annotated to specific parts of the code.

@erhant, @andthattoo let me know if you have any other default metrics to add here. I have more info I don't surface yet like network statistics that I would like to add.

@GandalfTea
Copy link
Contributor Author

GandalfTea commented Oct 27, 2025

dnet > perf stats
Performance stats for request 'chatcmpl-288fe360-2578-42ef-95f9-447904d37955':

                  51            prompt_tokens
                1000            generated_tokens
                1051            total_tokens

               33.34 ms         ttft
               29.03 tok/s      tokens_per_second       # 0.032 s/tok
             32.2025 ms         mean_inter_token_latency (ITL)
                                   43.368 (p90),  119.929 (p99)
                                   26.883 (min), 129.225 (max)

             UNKNOWN GFLOPs     estimated_flops

    node0 [shard-5a4f95b7-dnet1-55663.local]
            17425.80 ms          compute_time   # [50.68%]
            16634.20 ms          network_time   # [48.38%]
              322.86 ms          memory_time    # [0.94%]

    node1 [shard-aef510b0-dnet0-13578.local]
            12326.78 ms          compute_time   # [20.78%]
            46792.39 ms          network_time   # [78.89%]
              192.85 ms          memory_time    # [0.33%]

Correctly aggregating the lower-level frames to get time distribution per node. There's now a staging buffer for events that arrive before the request.start event that registers the request in the aggregator. Some frames are still dropped though the numbers should be insignificant for now, might look at it again later

@GandalfTea
Copy link
Contributor Author

Rough metrics for default config llama 3.3 70b 4bit on 2 Macs (M4, M4 Pro) and a MacBook (M3), 56gb total ram, TB4:

Performance stats for request 'chatcmpl-e0b89d06-52b0-4813-ae81-2690911f1413':

                  79            prompt_tokens
                   4            generated_tokens
                  83            total_tokens

           120813.57 ms         ttft
                0.01 tok/s      tokens_per_second       # 122.210 s/tok
         121511.5810 ms         mean_inter_token_latency (ITL)
                                   123186.812 (p90),  123563.739 (p99)
                                   120813.568 (min), 122209.594 (max)

             UNKNOWN GFLOPs     estimated_flops

    node0 [shard-43559743-dnet1-56725.local]
           303454.09ms          compute_time   # [40.00%]
           454759.81ms          network_time   # [59.95%]
              384.58ms          memory_time    # [0.05%]

    node1 [shard-700c9685-dnet0-14290.local]
             1095.01ms          compute_time   # [38.56%]
             1407.14ms          network_time   # [49.55%]
              337.65ms          memory_time    # [11.89%]

    node2 [shard-84ce4446-abriel-41351.local]
             1231.04ms          compute_time   # [29.16%]
             2618.43ms          network_time   # [62.03%]
              371.74ms          memory_time    # [8.81%]

name                                     total     max       mean      p50       p90       p99       samples
.......................................  ........  ........  ........  ........  ........  ........  .......
  network.rx.wait                        6948628.64  447819.37  25546.43  23790.94  27055.47  247484.08    272.00
  network.tx                             1635264.17  27899.77   6079.05     34.95  24281.77  26332.01    269.00
  network.tx.enque                       700490.09  10021.14   2604.05     35.96  10004.82  10009.83    269.00
  compute.thread.sampling                436511.04  32235.27  25677.12  26545.60  28549.98  32235.27     17.00
  startup.profile.device                 152082.69  64478.76  50694.23  55016.40  64478.76  64478.76      3.00
  network.send_activation.next            6393.57   3010.25     25.37      3.96     36.23    128.80    252.00
  memory.model.load                       1717.23    912.05    286.20      2.61    788.41    912.05      6.00
  memory.model.load_metadata               958.71    356.78    319.57    306.67    356.78    356.78      3.00
  compute.thread.execute                   453.85     45.66      1.67      0.50      3.96     14.48    271.00
  network.rx.prepare                       316.85     16.19      1.17      0.93      2.12      4.39    271.00
  network.rx.prepare_activation.default    199.71     16.10      0.79      0.56      1.38      3.64    252.00
  memory.weights.cache.search              147.01     16.88      0.10      0.01      0.12      1.52   1457.00
  compute.thread.weights.prepare           111.49     22.68      0.41      0.04      0.26     10.92    271.00
  network.token_request                     89.29     49.99      5.25      0.98      6.43     49.99     17.00
  network.rx                                58.66      4.32      0.22      0.14      0.39      1.49    271.00
  network.rx.enque                          49.88      1.79      0.18      0.12      0.42      0.89    271.00
  memory.weights.cache.load                 44.37      5.72      0.55      0.04      1.85      4.28     80.00
  compute.forward                           41.88      3.16      0.16      0.10      0.20      1.27    269.00
  compute.thread.activations.process        31.63      3.10      0.12      0.09      0.22      0.72    271.00
  memory.weights.cache.init                 30.73     21.39     10.24      9.03     21.39     21.39      3.00
  compute.thread.cleanup                    25.52      7.86      0.09      0.05      0.09      0.23    269.00
  memory.warmup                             19.50     14.75      6.50      3.68     14.75     14.75      3.00
  compute.thread.kvcache.init               13.27      3.16      0.05      0.02      0.03      1.06    271.00
  network.send_activation.final             11.94     11.49      0.70      0.03      0.04     11.49     17.00
  network.rx.prepare_activation.tokens       8.21      0.78      0.43      0.37      0.56      0.78     19.00
  network.rx.prefetch                        4.71      0.24      0.02      0.01      0.03      0.12    272.00
  compute.thread.execute.enqueue_prefetch    4.59      1.43      0.02      0.01      0.02      0.02    270.00
  compute.thread.activations.load            2.80      0.26      0.01      0.01      0.01      0.14    271.00
  network.send_activations.default           1.94      0.09      0.01      0.01      0.01      0.02    252.00
  compute.thread.execute.evict_and_unload    1.79      0.17      0.01      0.01      0.01      0.01    271.00
  network.tx.send                            1.20      0.09      0.00      0.00      0.01      0.01    270.00
  network.connect.next_node                  0.87      0.41      0.29      0.25      0.41      0.41      3.00
  memory.cache.reset                         0.66      0.18      0.17      0.17      0.18      0.18      4.00
  memory.prefetch                            0.40      0.19      0.13      0.13      0.19      0.19      3.00
  memory.pools.init                          0.28      0.11      0.09      0.10      0.11      0.11      3.00
  memory.cache.make_cache                    0.08      0.03      0.03      0.03      0.03      0.03      3.00

@GandalfTea GandalfTea marked this pull request as ready for review October 27, 2025 08:24
@GandalfTea GandalfTea marked this pull request as draft November 23, 2025 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: REPL with tracing and isolated benchmarking

4 participants