feat: add observability, memory, resilience, and guidance modules by reh3376 · Pull Request #329 · karpathy/autoresearch

reh3376 · 2026-03-18T14:50:10Z

Summary

Adds four optional, non-breaking Python modules that improve the effectiveness of autonomous experiment runs by bringing production-grade observability, learning, and resilience patterns to autoresearch. Inspired by mdemg — a persistent memory system for AI coding agents built on Neo4j, Hebbian learning, and Prometheus metrics.

Key principle: These modules enhance the infrastructure around the experiment loop — they never touch train.py or prepare.py. The core 5-minute experiment loop works exactly as before. All modules use only Python stdlib (json, time, math, dataclasses) — zero new dependencies.

New Modules

`monitor.py` — Experiment Metrics & Observability

Inspired by mdemg's internal/metrics/ Prometheus pipeline and 10-panel Grafana dashboard

ExperimentTracker class with full experiment lifecycle tracking (start_experiment → record_step → end_experiment)
Per-step loss curve capture (sampled at configurable intervals to bound memory)
Session-level aggregates: keep rate, improvement velocity (BPB/hour), training hours
Real-time alerting with configurable thresholds (env vars):
- Loss spike detection (smoothed EMA vs raw loss ratio)
- VRAM pressure warnings
- Consecutive crash streak alerts
- Improvement plateau detection
Prometheus text exposition format export (get_prometheus_text()) — compatible with node_exporter textfile collector or direct scraping
JSON export for external dashboard consumption (Grafana, custom UIs)
Terminal dashboard (format_dashboard()) for quick-glance session status
Crash-resilient: persists session state to disk, recovers on restart

`memory.py` — Cross-Session Experiment Memory with Hebbian Learning

Inspired by mdemg's Conversation Memory System (internal/conversation/) and Hebbian learning engine (internal/learning/)

ExperimentMemory class — persistent knowledge base across research sessions
Hebbian association tracking: strengthens connections between change categories and positive/negative outcomes using mdemg's tanh soft-capping formula (w = wmax * tanh((w + eta * signal) / wmax)) — smooth saturation instead of hard clamping, allowing continuous learning near weight limits
15 standardized change categories: architecture, attention, activation, optimizer, learning_rate, schedule, batch_size, initialization, regularization, normalization, embedding, numerical, simplification, combination, radical
Auto-tagging: keyword-based classifier automatically assigns categories from experiment descriptions
Temporal decay: exponential weight decay with cautious skipping of recently-reinforced associations (mirrors mdemg's cautious decay window)
Surprise-weighted storage: unexpected results (contradicting Hebbian expectations) receive higher surprise scores and stronger learning signals — inspired by mdemg's CMS which retains novel observations longer than routine ones
Pattern extraction APIs:
- get_promising_directions() — ranked categories blending Hebbian weight with exploration bonus
- get_dead_ends() — categories that consistently fail (agent should avoid)
- get_plateaus() — velocity-based plateau detection comparing recent vs earlier improvement rates
- get_surprise_highlights() — most unexpected results for investigation
Persists to .autoresearch/memory/memory.json

`resilience.py` — Circuit Breakers & Anomaly Detection

Inspired by mdemg's internal/circuitbreaker/, internal/anomaly/, and internal/backpressure/

CircuitBreaker — prevents wasting GPU time on repeated failures:
- State machine: CLOSED → OPEN (after N crashes) → HALF_OPEN (probe) → CLOSED/OPEN
- Exponential backoff on probe failures (configurable multiplier and max cooldown)
- Mirrors mdemg's per-endpoint circuit breaker with half-open recovery
AnomalyDetector — multi-pattern detection across experiment history:
- Plateau: no improvements in configurable window
- VRAM creep: monotonically increasing memory usage across experiments
- Systematic regression: BPB worsening over consecutive experiments
- Crash clustering: high crash rate in recent experiments
BackpressureMonitor — VRAM pressure tracking:
- Warning/critical thresholds based on GPU VRAM capacity
- Trend analysis (increasing/stable/decreasing)
- Actionable suggestions (reduce batch size, reduce model depth)
ExperimentGuard — unified pre/post experiment safety wrapper:
- pre_experiment() → checks circuit breaker, VRAM pressure, anomalies → returns PreExperimentVerdict with allowed, blocked, warnings, suggestions
- post_experiment() → updates all safety systems
- Single integration point for all resilience features
Persists state to .autoresearch/resilience/state.json

`guidance.py` — Proactive Experiment Suggestions

Inspired by mdemg's Jiminy inner voice (internal/jiminy/) and RSIC (internal/ape/)

ExperimentAdvisor — synthesizes signals from memory, monitoring, and resilience:
- Suggestion generation (4 sources, mirroring Jiminy's parallel fan-out):
  1. Hebbian memory associations → promising categories
  2. Plateau detection → radical change recommendations
  3. Surprise analysis → revisit unexpectedly bad results (try the opposite)
  4. Dead end avoidance → categories to skip
- Contradiction detection: finds pairs where the same change category produced opposite outcomes — suggests context-dependent dynamics worth investigating
- Strategy assessment (simplified RSIC assess phase):
  - Phase detection: exploring | exploiting | plateaued | recovering
  - Effectiveness score (0-1) blending keep rate with velocity trend
  - High-level strategy recommendations
- Formatted guidance (get_guidance()["formatted"]): human-readable text block designed for injection into agent context before each experiment decision

Modified Files

`program.md`

Added comprehensive "Observability & Intelligence Modules" section (+205 lines)
Full usage documentation for all four modules with code examples
Environment variable reference for alert thresholds
Recommended integration pattern showing complete experiment loop
State directory documentation

`analysis.ipynb`

Change Category Effectiveness cell: builds Hebbian associations from results.tsv, plots horizontal bar charts of category weights and success rates
Monitoring Dashboard cell: loads session.json, plots VRAM trends and training durations, displays alert timeline
Guidance Report cell: generates and displays formatted guidance with suggestion table and contradictions

`.gitignore`

Added .autoresearch/ to exclude module state directories from version control

Design Decisions

Zero new dependencies — all modules use Python stdlib only. No changes to pyproject.toml.
Non-intrusive — modules never modify train.py or prepare.py. They observe and advise.
Opt-in — every module is independently usable. You can use just monitor.py without memory.py, etc.
Crash-resilient — all state persists to disk and gracefully handles corrupted state files.
Hebbian tanh soft-capping over hard clamping — continuous learning without saturation walls.
Circuit breaker with exponential backoff — progressively longer cooldowns prevent thrashing.

Test plan

Verify train.py and prepare.py are completely unmodified (zero diff)
Verify no new dependencies added to pyproject.toml
Import each module independently: python -c "import monitor", import memory", etc.
Run ExperimentTracker lifecycle: start_experiment → record_step → end_experiment
Run ExperimentMemory.store_experiment() and verify .autoresearch/memory/memory.json is created
Run CircuitBreaker through CLOSED → OPEN → HALF_OPEN → CLOSED transition
Run ExperimentAdvisor.get_guidance() and verify formatted output
Run existing analysis.ipynb cells (original cells unchanged, new cells gracefully handle missing data)
Verify .autoresearch/ is gitignored

Authored-by: reh3376

Inspired by mdemg's production-grade AI memory infrastructure, adds four optional modules that improve experiment effectiveness without modifying the core train.py loop: - monitor.py: Prometheus-compatible metrics, loss curve tracking, alerting - memory.py: Hebbian association learning across experiment sessions - resilience.py: circuit breakers, anomaly detection, VRAM backpressure - guidance.py: proactive experiment suggestions (Jiminy-style inner voice) Updates program.md with full integration documentation and enhances analysis.ipynb with category effectiveness charts and guidance reports. Authored-by: reh3376

… work Comprehensive handoff document covering completed work (research, implementation, documentation), suggested future work prioritized by impact, architecture reference, module dependency graph, and onboarding reading order. Authored-by: reh3376

reh3376 mentioned this pull request Mar 18, 2026

Request to be added as contributor #330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add observability, memory, resilience, and guidance modules#329

feat: add observability, memory, resilience, and guidance modules#329
reh3376 wants to merge 2 commits intokarpathy:masterfrom
reh3376:feat/mdemg-observability-and-memory

reh3376 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

reh3376 commented Mar 18, 2026

Summary

New Modules

monitor.py — Experiment Metrics & Observability

memory.py — Cross-Session Experiment Memory with Hebbian Learning

resilience.py — Circuit Breakers & Anomaly Detection

guidance.py — Proactive Experiment Suggestions

Modified Files

program.md

analysis.ipynb

.gitignore

Design Decisions

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`monitor.py` — Experiment Metrics & Observability

`memory.py` — Cross-Session Experiment Memory with Hebbian Learning

`resilience.py` — Circuit Breakers & Anomaly Detection

`guidance.py` — Proactive Experiment Suggestions

`program.md`

`analysis.ipynb`

`.gitignore`