Skip to content

Add OTLP metrics-only survey plugin for crowd-sourced model performance telemetry #372

@i386

Description

@i386

Summary

Add a new built-in survey plugin that exports metrics only over OTLP/HTTP to a configurable telemetry endpoint.

The purpose of this plugin is to crowd source real-world model performance across live mesh-llm nodes without using the mesh itself for reporting. We want to learn which models are being launched, on what hardware, with what context sizes, and how well they actually perform in the wild.

This should use the standard Rust OpenTelemetry crates and exporter stack. We should not implement a custom OTLP wire protocol or custom JSON sink.

Goals

  • Collect real-world model performance data from live nodes
  • Understand:
    • which models are being launched
    • which launches succeed or fail
    • what hardware the model ran on
    • what context length was used
    • how long launches take
    • how long models stay loaded
    • which models exit unexpectedly
  • Build a dataset for comparing model viability across hardware and context settings
  • Keep the implementation metrics-only
  • Make the OTLP endpoint configurable
  • Preserve privacy:
    • no raw prompts
    • no completions
    • no logs
    • no traces

Non-Goals

  • No reporting through mesh gossip, mesh channels, or peer-to-peer relay
  • No OTLP logs
  • No OTLP traces
  • No custom JSON/HTTP telemetry backend
  • No raw prompt capture
  • No prompt hashing or content-derived summaries
  • No mesh protocol changes for this feature

Why This Exists

We need crowd-sourced performance data from real usage, not just local benchmarks.

The main questions we want this to answer are:

  • Which models are people actually trying to run?
  • Which models fail to launch in the wild?
  • Which hardware/model combinations work reliably?
  • What context lengths are actually being used successfully?
  • Which models are unstable after launch?
  • Which models stay loaded and useful versus churn quickly?

This is primarily a product/data feature, not just an operator telemetry feature.

High-Level Design

Implement survey as a built-in mesh-llm plugin with outbound OTLP metric export.

Although it is a plugin from a configuration/product perspective, the telemetry hooks should come from host/runtime lifecycle code, not from polling local APIs and inferring state changes.

Why host-emitted metrics instead of polling

The runtime already has clear source-of-truth transitions for:

  • launch start
  • launch success
  • launch failure
  • runtime load
  • unload
  • unexpected process exit

Those are the events we want to aggregate into crowd-sourced performance metrics. Polling would lose fidelity and make failure attribution weaker.

Transport

Use OTLP/HTTP only.

Use official crates:

  • opentelemetry
  • opentelemetry_sdk
  • opentelemetry-otlp

Do not add:

  • tracing-opentelemetry
  • opentelemetry-appender-tracing

This feature is metrics-only.

Configuration

Add mesh-llm config for telemetry while still respecting standard OTel env vars.

Mesh config

[telemetry]
enabled = true
service_name = "mesh-llm"
endpoint = "https://otel.example.com"
headers = { "authorization" = "Bearer TOKEN" }
export_interval_secs = 15
queue_size = 2048
prompt_shape_metrics = false

Optional metrics override

[telemetry.metrics]
endpoint = "https://otel.example.com/v1/metrics"

Standard env vars to support

  • OTEL_EXPORTER_OTLP_ENDPOINT
  • OTEL_EXPORTER_OTLP_METRICS_ENDPOINT

Precedence

  1. explicit mesh-llm config
  2. standard OTEL_* env vars
  3. disabled if no endpoint is configured

Metrics Schema

Use a mesh-llm-specific schema with low-cardinality attributes that are useful for aggregate analysis.

Counters

  • mesh_llm_model_launch_total
  • mesh_llm_model_launch_success_total
  • mesh_llm_model_launch_failure_total
  • mesh_llm_model_unload_total
  • mesh_llm_model_exit_unexpected_total

Gauges

  • mesh_llm_loaded_models
  • mesh_llm_model_loaded
  • mesh_llm_model_context_length

Histograms

  • mesh_llm_model_launch_duration_ms
  • mesh_llm_model_uptime_s

Attributes

Use only bounded, aggregatable attributes.

Model attributes

  • mesh_llm.model
  • mesh_llm.architecture
  • mesh_llm.quantization
  • mesh_llm.launch_kind

Hardware attributes

  • mesh_llm.gpu_name
  • mesh_llm.gpu_stable_id
  • mesh_llm.backend_device
  • mesh_llm.gpu_count
  • mesh_llm.is_soc

Runtime attributes

  • mesh_llm.backend
  • mesh_llm.context_bucket
  • mesh_llm.service_version

Outcome attributes

  • mesh_llm.failure_reason

Suggested enums

mesh_llm.launch_kind:

  • startup
  • runtime_load
  • multi_model
  • moe_fallback
  • moe_shard

mesh_llm.failure_reason:

  • spawn_failed
  • health_timeout
  • exited_before_healthy
  • backend_proxy_failed
  • capacity_rejected
  • known_kv_cache_crash
  • mmproj_missing
  • other

mesh_llm.context_bucket:

  • <=8k
  • 8k_16k
  • 16k_32k
  • 32k_64k
  • 64k_128k
  • >128k

Privacy Rules

The exporter must use a strict allowlist.

Allowed

  • model identifier
  • model architecture / quantization
  • hardware facts
  • launch outcome classification
  • context length bucket or exact context length if we decide it is safe enough
  • durations
  • counts and gauges

Not allowed

  • raw prompts
  • completions
  • logs
  • file paths
  • URLs from user payloads
  • hostnames from prompt content
  • request IDs with unbounded cardinality
  • raw error strings as metric attributes

If prompt-shape metrics are ever added later, they must remain disabled by default and still avoid content-bearing fields.

Integration Points

Hook metric emission at the runtime/launch source of truth.

Launch path

Emit:

  • launch attempt
  • launch success
  • launch failure
  • launch duration
  • context length
  • hardware used

Runtime model control

Emit:

  • runtime load success/failure
  • unload

Unexpected exit path

Emit:

  • unexpected exit
  • uptime if known

Current code areas to wire

Likely integration points:

  • mesh-llm/src/inference/launch.rs
  • mesh-llm/src/runtime/mod.rs
  • mesh-llm/src/runtime/local.rs
  • built-in plugin registration in mesh-llm/src/plugin/config.rs and mesh-llm/src/plugin/mod.rs

Runtime Behavior

  • bounded async export queue
  • batch OTLP export
  • retry with backoff
  • drop oldest metrics when queue is full
  • never block model launch or unload on telemetry export
  • failure to export metrics must not affect inference availability

Plugin/Product Behavior

Expose survey as a built-in plugin that can be enabled/disabled in config.

Example:

[[plugin]]
name = "survey"
enabled = true

Telemetry-specific settings should live under [telemetry], not in a separate plugin-specific config tree.

Key Questions This Data Should Answer

  • Which models are most frequently launched?
  • Which models have the highest launch failure rate?
  • Which hardware/model combinations are most reliable?
  • Which context sizes are viable for each model/hardware combination?
  • Which models have poor stability after successful launch?
  • Which model/hardware combinations produce long launch times or short uptimes?

Acceptance Criteria

  • survey can be enabled via config
  • metrics export over OTLP/HTTP to a configurable endpoint
  • no mesh transport is used for telemetry
  • launch success increments success metrics
  • launch failure increments failure metrics with classified reason
  • unload increments unload metrics
  • unexpected process exit increments unexpected-exit metrics
  • loaded-model state is visible through gauges
  • hardware and context metadata are attached
  • exporter failures do not affect serving, startup, or unload flows
  • no logs, traces, raw prompts, or content-bearing fields are exported

Validation

Minimum validation:

  • local smoke test against a test OTLP endpoint
  • verify launch success/failure metrics appear
  • verify unload metrics appear
  • verify unexpected exit metrics appear
  • verify gauges update when models load/unload
  • verify exporter misconfiguration does not break runtime behavior

Open Questions

  • Should mesh_llm.model use the display name, canonical ref, or normalized local model name?
  • Should exact context length be exported, or only a bucket?
  • Is gpu_stable_id acceptable as-is, or should it be hashed before export?
  • Do we want a lightweight “still loaded” gauge only, or also a periodic heartbeat metric?
  • Should telemetry auto-enable when an OTLP endpoint is configured, or require explicit opt-in?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions