Skip to content

Add UCB1 Dimension-Aware Search + Experiment Memory to program.md — no code changes required #284

@Insider77Circle

Description

@Insider77Circle

Add UCB1 Dimension-Aware Search + Experiment Memory to program.md — no code changes required

Problem: The Current Loop is a Memoryless Random Walk

The autoresearch agent runs a sequential keep/discard loop, but it has no memory of what it has tried and no principled strategy for what to try next. After 50 overnight experiments, the agent has the same strategic information it started with: none.

Concretely, three structural gaps:

  1. No experiment memory. The agent's only record of history is the current state of train.py — i.e., only what was kept. Zero record of what failed, what was close, or what dimension it already exhausted.

  2. No guided exploration. No mechanism to balance trying a new class of change (exploration) vs. doubling down on a dimension that already showed improvement (exploitation). It is a pure random walk across code-editing decisions.

  3. No early abort. Every run burns the full 5 minutes regardless of whether the first 90 seconds of loss curves indicate a regression. At 12 experiments/hour, roughly half of overnight runs are discards that could have been caught at the 90-second mark.


Current Loop vs. Proposed Loop

flowchart LR
    subgraph NOW ["❌ Current Loop"]
        direction TB
        A1([Read train.py]) --> B1([Guess a change])
        B1 --> C1([Run 5 min])
        C1 --> D1{val_bpb\nimproved?}
        D1 -- yes --> E1([Keep])
        D1 -- no --> F1([Discard])
        E1 --> A1
        F1 --> A1
        style NOW fill:#2d0000,stroke:#ff4444,color:#fff
        style A1 fill:#3d0000,stroke:#ff6666,color:#fff
        style B1 fill:#3d0000,stroke:#ff6666,color:#fff
        style C1 fill:#3d0000,stroke:#ff6666,color:#fff
        style D1 fill:#4d0000,stroke:#ff6666,color:#fff
        style E1 fill:#3d0000,stroke:#ff6666,color:#fff
        style F1 fill:#3d0000,stroke:#ff6666,color:#fff
    end

    subgraph NEW ["✅ DUSE Loop"]
        direction TB
        A2([Read experiments.json]) --> B2([Compute UCB1\nacross 7 dims])
        B2 --> C2([Select highest\nUCB1 dimension])
        C2 --> D2([Propose targeted\nchange])
        D2 --> E2([90s gate:\nloss regressing?])
        E2 -- abort --> A2
        E2 -- continue --> F2([Run full 5 min])
        F2 --> G2{val_bpb\nimproved?}
        G2 -- yes --> H2([Keep + log])
        G2 -- no --> I2([Discard + rescue\npool check])
        H2 --> A2
        I2 --> A2
        style NEW fill:#001a00,stroke:#44ff44,color:#fff
        style A2 fill:#002200,stroke:#66ff66,color:#fff
        style B2 fill:#002200,stroke:#66ff66,color:#fff
        style C2 fill:#002200,stroke:#66ff66,color:#fff
        style D2 fill:#002200,stroke:#66ff66,color:#fff
        style E2 fill:#003300,stroke:#66ff66,color:#fff
        style F2 fill:#002200,stroke:#66ff66,color:#fff
        style G2 fill:#003300,stroke:#66ff66,color:#fff
        style H2 fill:#002200,stroke:#66ff66,color:#fff
        style I2 fill:#002200,stroke:#66ff66,color:#fff
    end
Loading

How UCB1 Dimension Selection Works

%%{init: {'theme': 'dark', 'themeVariables': {'fontSize': '14px'}}}%%
xychart-beta
    title "UCB1 Scores After 20 Experiments — Agent Selects 'attention' (never tried)"
    x-axis ["optimizer", "architecture", "attention", "normalization", "schedule", "regularization", "batching"]
    y-axis "UCB1 Score" 0 --> 3.5
    bar [0.91, 0.73, 3.20, 0.58, 1.45, 2.10, 1.88]
Loading

The agent computes UCB1(dim) = avg_improvement(dim) + 1.0 × √(ln(N) / n(dim)) for each dimension before every experiment. High scores emerge from either strong past returns (exploitation) or low trial count (exploration). Untried dimensions always get a large exploration bonus — meaning no dimension ever gets permanently abandoned.


Proposed Modification: Dimensional UCB1 Search + Experiment Memory (DUSE)

Pure program.md addition. Zero changes to train.py, prepare.py, or any code. No new dependencies.

Section 1 — Dimension Map

Add to program.md:

## Dimension Map

Every experiment belongs to exactly one of these seven architectural dimensions.
Assign a label before proposing any change.

| Dimension       | Covers |
|-----------------|--------|
| `optimizer`     | optimizer algorithm, LR, weight decay, gradient clipping, momentum |
| `architecture`  | n_layer, n_embd, n_head, feedforward ratio, parameter count |
| `attention`     | attention pattern, relative position, sparse or windowed attention |
| `normalization` | RMSNorm vs LayerNorm, pre/post norm, placement |
| `schedule`      | LR warmup, decay shape (cosine/linear), cycle length |
| `regularization`| dropout, weight decay schedule, stochastic depth |
| `batching`      | batch size, gradient accumulation, sequence packing |

One change = one dimension. If a change spans two dimensions, split into two experiments.

Section 2 — Experiment Log

## Experiment Log

After every run, append one record to `experiments.json`:

{
  "id": 1,
  "dimension": "optimizer",
  "delta": "switched AdamW weight_decay 0.1 → 0.01",
  "val_bpb": 0.991,
  "baseline_bpb": 0.998,
  "improvement": 0.007,
  "status": "keep"
}

If experiments.json does not exist, create it as an empty array [] before run 1.

Section 3 — UCB1 Dimension Selector

## Choosing What to Experiment On Next

Read experiments.json. Compute UCB1 for each of the seven dimensions:

  UCB1(dim) = mean_improvement(dim) + 1.0 * sqrt( ln(N) / n(dim) )

Where:
  mean_improvement(dim) = average improvement across all experiments in this dim (0.0 if none)
  N = total experiments logged
  n(dim) = experiments in this dim (use 0.5 if none, to avoid divide-by-zero)

Select the dimension with the highest score.
Print all seven scores before proposing your change so the reasoning is auditable.

Example:
  UCB1 scores (N=12):
    optimizer:     0.003 + 0.930 = 0.933
    architecture:  0.001 + 1.315 = 1.316
    attention:     0.000 + 2.630 = 2.630  ← SELECTED (never tried)
    normalization: 0.002 + 1.045 = 1.047
    schedule:      0.005 + 0.740 = 0.745
    regularization:0.000 + 1.858 = 1.858
    batching:      0.001 + 1.315 = 1.316

Section 4 — Early Abort Gate

## Early Abort Gate

At the 90-second training checkpoint, compare current val loss to the baseline
val loss at the same step from the last kept experiment.

If current val loss > baseline * 1.05, abort.
Log the run as discarded with "early_abort": true.
Immediately start the next UCB1 selection cycle.

This recovers 3-4 minutes per bad experiment and lifts throughput from ~12 to ~16-18 experiments/hour.

Section 5 — Crossover Rescue Pool (Optional)

## Rescue Pool

When discarding, note whether any sub-mechanism showed local promise before the regression
(e.g., faster early convergence even though final val_bpb regressed).

Append to rescue_pool.json:

{
  "from_experiment": 7,
  "dimension": "schedule",
  "mechanism": "linear warmup over 200 steps showed faster initial convergence",
  "reuse_signal": "recombine with a lower peak LR"
}

Before any new experiment, scan rescue_pool.json for recombination candidates in the same dimension.

Why This Is Novel

The bandit and NAS literature (Hyperband, BOHB, SMAC, PBT) applies UCB1 / Thompson sampling to hyperparameter values within a predefined search space. None apply it to the meta-question of which code-region category an autonomous code-editing agent should touch next. That is the specific gap this fills.

Cross-domain research backing each mechanism:

Mechanism Source Signal
UCB1 arm selection Exploration vs. Exploitation: Comparative Analysis and Practical Implications; In-depth Exploration and Implementation of Multi-Armed Bandit — UCB1 is robust to non-stationary arm rewards, which is true here since improving one dimension shifts marginal returns on others
Dimension mutation strategy Improving Evolutionary Neural Architecture Search: Flexibility — mutation strategy choice matters more than specific mutation; Dimension Map operationalizes this for a code-editing agent
Early abort = PBT restart Iterated Population Based Training with Task-Agnostic Restarts — abort underperforming workers early, reallocate budget; DUSE applies this to sequential single-agent setting
Rescue pool = partial crossover A Gradient-Guided Evolutionary Neural Architecture Search — retain sub-mechanisms from discarded experiments rather than discarding wholesale
Transferable schedules MLR-SNet: Transferable LR Schedules for Heterogeneous Tasks; Curvature-Adaptive Learning Rate Optimizer — LR schedule structure transfers; agent can reuse schedule patterns from successful runs in new contexts

Expected Impact

Metric Current With DUSE
Experiments/hour ~12 ~16–18 (early abort)
Dimension stagnation Common — agent re-explores same territory Bounded by UCB1 exploration term
Knowledge after 100 runs Implicit in train.py state only experiments.json — queryable per-dimension improvement breakdown
Agent reasoning transparency Implicit UCB1 scores printed every step, fully auditable
Wasted compute on clear regressions ~50% of runs Caught at 90s gate

After 100 overnight runs, experiments.json is a research artifact in its own right — a structured record of which architectural dimensions drove improvement, which were explored but unproductive, and whether the search converged or kept diversifying.


Implementation

Changes to train.py: None
Changes to prepare.py: None
New dependencies: None
Changes to program.md: The five sections above

The agent computes UCB1 in its own reasoning step using the JSON log it maintains. The only new files are experiments.json and optionally rescue_pool.json, both created and maintained by the agent itself.

Happy to draft the exact program.md diff if useful.


Research powered by https://insider77circle.github.io/redstorm-research/ — cutting-edge open-source research AI built for next-generation intelligence.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions