Add OTLP metrics-only `survey` plugin for crowd-sourced model performance telemetry

## Summary

Add a new built-in `survey` plugin that exports **metrics only** over **OTLP/HTTP** to a configurable telemetry endpoint.

The purpose of this plugin is to **crowd source real-world model performance** across live mesh-llm nodes without using the mesh itself for reporting. We want to learn which models are being launched, on what hardware, with what context sizes, and how well they actually perform in the wild.

This should use the standard Rust OpenTelemetry crates and exporter stack. We should not implement a custom OTLP wire protocol or custom JSON sink.

## Goals

- Collect real-world model performance data from live nodes
- Understand:
  - which models are being launched
  - which launches succeed or fail
  - what hardware the model ran on
  - what context length was used
  - how long launches take
  - how long models stay loaded
  - which models exit unexpectedly
- Build a dataset for comparing model viability across hardware and context settings
- Keep the implementation **metrics-only**
- Make the OTLP endpoint configurable
- Preserve privacy:
  - no raw prompts
  - no completions
  - no logs
  - no traces

## Non-Goals

- No reporting through mesh gossip, mesh channels, or peer-to-peer relay
- No OTLP logs
- No OTLP traces
- No custom JSON/HTTP telemetry backend
- No raw prompt capture
- No prompt hashing or content-derived summaries
- No mesh protocol changes for this feature

## Why This Exists

We need crowd-sourced performance data from real usage, not just local benchmarks.

The main questions we want this to answer are:

- Which models are people actually trying to run?
- Which models fail to launch in the wild?
- Which hardware/model combinations work reliably?
- What context lengths are actually being used successfully?
- Which models are unstable after launch?
- Which models stay loaded and useful versus churn quickly?

This is primarily a product/data feature, not just an operator telemetry feature.

## High-Level Design

Implement `survey` as a **built-in mesh-llm plugin** with outbound OTLP metric export.

Although it is a plugin from a configuration/product perspective, the telemetry hooks should come from host/runtime lifecycle code, not from polling local APIs and inferring state changes.

### Why host-emitted metrics instead of polling

The runtime already has clear source-of-truth transitions for:
- launch start
- launch success
- launch failure
- runtime load
- unload
- unexpected process exit

Those are the events we want to aggregate into crowd-sourced performance metrics. Polling would lose fidelity and make failure attribution weaker.

## Transport

Use **OTLP/HTTP** only.

Use official crates:
- `opentelemetry`
- `opentelemetry_sdk`
- `opentelemetry-otlp`

Do not add:
- `tracing-opentelemetry`
- `opentelemetry-appender-tracing`

This feature is metrics-only.

## Configuration

Add mesh-llm config for telemetry while still respecting standard OTel env vars.

### Mesh config

```toml
[telemetry]
enabled = true
service_name = "mesh-llm"
endpoint = "https://otel.example.com"
headers = { "authorization" = "Bearer TOKEN" }
export_interval_secs = 15
queue_size = 2048
prompt_shape_metrics = false
```

### Optional metrics override

```toml
[telemetry.metrics]
endpoint = "https://otel.example.com/v1/metrics"
```

### Standard env vars to support

- `OTEL_EXPORTER_OTLP_ENDPOINT`
- `OTEL_EXPORTER_OTLP_METRICS_ENDPOINT`

### Precedence

1. explicit mesh-llm config
2. standard `OTEL_*` env vars
3. disabled if no endpoint is configured

## Metrics Schema

Use a mesh-llm-specific schema with low-cardinality attributes that are useful for aggregate analysis.

### Counters

- `mesh_llm_model_launch_total`
- `mesh_llm_model_launch_success_total`
- `mesh_llm_model_launch_failure_total`
- `mesh_llm_model_unload_total`
- `mesh_llm_model_exit_unexpected_total`

### Gauges

- `mesh_llm_loaded_models`
- `mesh_llm_model_loaded`
- `mesh_llm_model_context_length`

### Histograms

- `mesh_llm_model_launch_duration_ms`
- `mesh_llm_model_uptime_s`

## Attributes

Use only bounded, aggregatable attributes.

### Model attributes

- `mesh_llm.model`
- `mesh_llm.architecture`
- `mesh_llm.quantization`
- `mesh_llm.launch_kind`

### Hardware attributes

- `mesh_llm.gpu_name`
- `mesh_llm.gpu_stable_id`
- `mesh_llm.backend_device`
- `mesh_llm.gpu_count`
- `mesh_llm.is_soc`

### Runtime attributes

- `mesh_llm.backend`
- `mesh_llm.context_bucket`
- `mesh_llm.service_version`

### Outcome attributes

- `mesh_llm.failure_reason`

### Suggested enums

`mesh_llm.launch_kind`:
- `startup`
- `runtime_load`
- `multi_model`
- `moe_fallback`
- `moe_shard`

`mesh_llm.failure_reason`:
- `spawn_failed`
- `health_timeout`
- `exited_before_healthy`
- `backend_proxy_failed`
- `capacity_rejected`
- `known_kv_cache_crash`
- `mmproj_missing`
- `other`

`mesh_llm.context_bucket`:
- `<=8k`
- `8k_16k`
- `16k_32k`
- `32k_64k`
- `64k_128k`
- `>128k`

## Privacy Rules

The exporter must use a strict allowlist.

### Allowed

- model identifier
- model architecture / quantization
- hardware facts
- launch outcome classification
- context length bucket or exact context length if we decide it is safe enough
- durations
- counts and gauges

### Not allowed

- raw prompts
- completions
- logs
- file paths
- URLs from user payloads
- hostnames from prompt content
- request IDs with unbounded cardinality
- raw error strings as metric attributes

If prompt-shape metrics are ever added later, they must remain disabled by default and still avoid content-bearing fields.

## Integration Points

Hook metric emission at the runtime/launch source of truth.

### Launch path

Emit:
- launch attempt
- launch success
- launch failure
- launch duration
- context length
- hardware used

### Runtime model control

Emit:
- runtime load success/failure
- unload

### Unexpected exit path

Emit:
- unexpected exit
- uptime if known

### Current code areas to wire

Likely integration points:
- `mesh-llm/src/inference/launch.rs`
- `mesh-llm/src/runtime/mod.rs`
- `mesh-llm/src/runtime/local.rs`
- built-in plugin registration in `mesh-llm/src/plugin/config.rs` and `mesh-llm/src/plugin/mod.rs`

## Runtime Behavior

- bounded async export queue
- batch OTLP export
- retry with backoff
- drop oldest metrics when queue is full
- never block model launch or unload on telemetry export
- failure to export metrics must not affect inference availability

## Plugin/Product Behavior

Expose `survey` as a built-in plugin that can be enabled/disabled in config.

Example:

```toml
[[plugin]]
name = "survey"
enabled = true
```

Telemetry-specific settings should live under `[telemetry]`, not in a separate plugin-specific config tree.

## Key Questions This Data Should Answer

- Which models are most frequently launched?
- Which models have the highest launch failure rate?
- Which hardware/model combinations are most reliable?
- Which context sizes are viable for each model/hardware combination?
- Which models have poor stability after successful launch?
- Which model/hardware combinations produce long launch times or short uptimes?

## Acceptance Criteria

- `survey` can be enabled via config
- metrics export over OTLP/HTTP to a configurable endpoint
- no mesh transport is used for telemetry
- launch success increments success metrics
- launch failure increments failure metrics with classified reason
- unload increments unload metrics
- unexpected process exit increments unexpected-exit metrics
- loaded-model state is visible through gauges
- hardware and context metadata are attached
- exporter failures do not affect serving, startup, or unload flows
- no logs, traces, raw prompts, or content-bearing fields are exported

## Validation

Minimum validation:
- local smoke test against a test OTLP endpoint
- verify launch success/failure metrics appear
- verify unload metrics appear
- verify unexpected exit metrics appear
- verify gauges update when models load/unload
- verify exporter misconfiguration does not break runtime behavior

## Open Questions

- Should `mesh_llm.model` use the display name, canonical ref, or normalized local model name?
- Should exact context length be exported, or only a bucket?
- Is `gpu_stable_id` acceptable as-is, or should it be hashed before export?
- Do we want a lightweight “still loaded” gauge only, or also a periodic heartbeat metric?
- Should telemetry auto-enable when an OTLP endpoint is configured, or require explicit opt-in?


Add OTLP metrics-only survey plugin for crowd-sourced model performance telemetry #372

Description

Summary

Goals

Non-Goals

Why This Exists

High-Level Design

Why host-emitted metrics instead of polling

Transport

Configuration

Mesh config

Optional metrics override

Standard env vars to support

Precedence

Metrics Schema

Counters

Gauges

Histograms

Attributes

Model attributes

Hardware attributes

Runtime attributes

Outcome attributes

Suggested enums

Privacy Rules

Allowed

Not allowed

Integration Points

Launch path

Runtime model control

Unexpected exit path

Current code areas to wire

Runtime Behavior

Plugin/Product Behavior

Key Questions This Data Should Answer

Acceptance Criteria

Validation

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add OTLP metrics-only `survey` plugin for crowd-sourced model performance telemetry #372