You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add UCB1 Dimension-Aware Search + Experiment Memory to program.md — no code changes required
Problem: The Current Loop is a Memoryless Random Walk
The autoresearch agent runs a sequential keep/discard loop, but it has no memory of what it has tried and no principled strategy for what to try next. After 50 overnight experiments, the agent has the same strategic information it started with: none.
Concretely, three structural gaps:
No experiment memory. The agent's only record of history is the current state of train.py — i.e., only what was kept. Zero record of what failed, what was close, or what dimension it already exhausted.
No guided exploration. No mechanism to balance trying a new class of change (exploration) vs. doubling down on a dimension that already showed improvement (exploitation). It is a pure random walk across code-editing decisions.
No early abort. Every run burns the full 5 minutes regardless of whether the first 90 seconds of loss curves indicate a regression. At 12 experiments/hour, roughly half of overnight runs are discards that could have been caught at the 90-second mark.
Current Loop vs. Proposed Loop
flowchart LR
subgraph NOW ["❌ Current Loop"]
direction TB
A1([Read train.py]) --> B1([Guess a change])
B1 --> C1([Run 5 min])
C1 --> D1{val_bpb\nimproved?}
D1 -- yes --> E1([Keep])
D1 -- no --> F1([Discard])
E1 --> A1
F1 --> A1
style NOW fill:#2d0000,stroke:#ff4444,color:#fff
style A1 fill:#3d0000,stroke:#ff6666,color:#fff
style B1 fill:#3d0000,stroke:#ff6666,color:#fff
style C1 fill:#3d0000,stroke:#ff6666,color:#fff
style D1 fill:#4d0000,stroke:#ff6666,color:#fff
style E1 fill:#3d0000,stroke:#ff6666,color:#fff
style F1 fill:#3d0000,stroke:#ff6666,color:#fff
end
subgraph NEW ["✅ DUSE Loop"]
direction TB
A2([Read experiments.json]) --> B2([Compute UCB1\nacross 7 dims])
B2 --> C2([Select highest\nUCB1 dimension])
C2 --> D2([Propose targeted\nchange])
D2 --> E2([90s gate:\nloss regressing?])
E2 -- abort --> A2
E2 -- continue --> F2([Run full 5 min])
F2 --> G2{val_bpb\nimproved?}
G2 -- yes --> H2([Keep + log])
G2 -- no --> I2([Discard + rescue\npool check])
H2 --> A2
I2 --> A2
style NEW fill:#001a00,stroke:#44ff44,color:#fff
style A2 fill:#002200,stroke:#66ff66,color:#fff
style B2 fill:#002200,stroke:#66ff66,color:#fff
style C2 fill:#002200,stroke:#66ff66,color:#fff
style D2 fill:#002200,stroke:#66ff66,color:#fff
style E2 fill:#003300,stroke:#66ff66,color:#fff
style F2 fill:#002200,stroke:#66ff66,color:#fff
style G2 fill:#003300,stroke:#66ff66,color:#fff
style H2 fill:#002200,stroke:#66ff66,color:#fff
style I2 fill:#002200,stroke:#66ff66,color:#fff
end
The agent computes UCB1(dim) = avg_improvement(dim) + 1.0 × √(ln(N) / n(dim)) for each dimension before every experiment. High scores emerge from either strong past returns (exploitation) or low trial count (exploration). Untried dimensions always get a large exploration bonus — meaning no dimension ever gets permanently abandoned.
Pure program.md addition. Zero changes to train.py, prepare.py, or any code. No new dependencies.
Section 1 — Dimension Map
Add to program.md:
## Dimension Map
Every experiment belongs to exactly one of these seven architectural dimensions.
Assign a label before proposing any change.
| Dimension | Covers ||-----------------|--------||`optimizer`| optimizer algorithm, LR, weight decay, gradient clipping, momentum ||`architecture`| n_layer, n_embd, n_head, feedforward ratio, parameter count ||`attention`| attention pattern, relative position, sparse or windowed attention ||`normalization`| RMSNorm vs LayerNorm, pre/post norm, placement ||`schedule`| LR warmup, decay shape (cosine/linear), cycle length ||`regularization`| dropout, weight decay schedule, stochastic depth ||`batching`| batch size, gradient accumulation, sequence packing |
One change = one dimension. If a change spans two dimensions, split into two experiments.
Section 2 — Experiment Log
## Experiment Log
After every run, append one record to `experiments.json`:
{
"id": 1,
"dimension": "optimizer",
"delta": "switched AdamW weight_decay 0.1 → 0.01",
"val_bpb": 0.991,
"baseline_bpb": 0.998,
"improvement": 0.007,
"status": "keep"
}
If experiments.json does not exist, create it as an empty array [] before run 1.
Section 3 — UCB1 Dimension Selector
## Choosing What to Experiment On Next
Read experiments.json. Compute UCB1 for each of the seven dimensions:
UCB1(dim) = mean_improvement(dim) + 1.0 * sqrt( ln(N) / n(dim) )
Where:
mean_improvement(dim) = average improvement across all experiments in this dim (0.0 if none)
N = total experiments logged
n(dim) = experiments in this dim (use 0.5 if none, to avoid divide-by-zero)
Select the dimension with the highest score.
Print all seven scores before proposing your change so the reasoning is auditable.
Example:
UCB1 scores (N=12):
optimizer: 0.003 + 0.930 = 0.933
architecture: 0.001 + 1.315 = 1.316
attention: 0.000 + 2.630 = 2.630 ← SELECTED (never tried)
normalization: 0.002 + 1.045 = 1.047
schedule: 0.005 + 0.740 = 0.745
regularization:0.000 + 1.858 = 1.858
batching: 0.001 + 1.315 = 1.316
Section 4 — Early Abort Gate
## Early Abort Gate
At the 90-second training checkpoint, compare current val loss to the baseline
val loss at the same step from the last kept experiment.
If current val loss > baseline * 1.05, abort.
Log the run as discarded with "early_abort": true.
Immediately start the next UCB1 selection cycle.
This recovers 3-4 minutes per bad experiment and lifts throughput from ~12 to ~16-18 experiments/hour.
Section 5 — Crossover Rescue Pool (Optional)
## Rescue Pool
When discarding, note whether any sub-mechanism showed local promise before the regression
(e.g., faster early convergence even though final val_bpb regressed).
Append to rescue_pool.json:
{
"from_experiment": 7,
"dimension": "schedule",
"mechanism": "linear warmup over 200 steps showed faster initial convergence",
"reuse_signal": "recombine with a lower peak LR"
}
Before any new experiment, scan rescue_pool.json for recombination candidates in the same dimension.
Why This Is Novel
The bandit and NAS literature (Hyperband, BOHB, SMAC, PBT) applies UCB1 / Thompson sampling to hyperparameter values within a predefined search space. None apply it to the meta-question of which code-region category an autonomous code-editing agent should touch next. That is the specific gap this fills.
Cross-domain research backing each mechanism:
Mechanism
Source Signal
UCB1 arm selection
Exploration vs. Exploitation: Comparative Analysis and Practical Implications; In-depth Exploration and Implementation of Multi-Armed Bandit — UCB1 is robust to non-stationary arm rewards, which is true here since improving one dimension shifts marginal returns on others
Dimension mutation strategy
Improving Evolutionary Neural Architecture Search: Flexibility — mutation strategy choice matters more than specific mutation; Dimension Map operationalizes this for a code-editing agent
Early abort = PBT restart
Iterated Population Based Training with Task-Agnostic Restarts — abort underperforming workers early, reallocate budget; DUSE applies this to sequential single-agent setting
Rescue pool = partial crossover
A Gradient-Guided Evolutionary Neural Architecture Search — retain sub-mechanisms from discarded experiments rather than discarding wholesale
Transferable schedules
MLR-SNet: Transferable LR Schedules for Heterogeneous Tasks; Curvature-Adaptive Learning Rate Optimizer — LR schedule structure transfers; agent can reuse schedule patterns from successful runs in new contexts
After 100 overnight runs, experiments.json is a research artifact in its own right — a structured record of which architectural dimensions drove improvement, which were explored but unproductive, and whether the search converged or kept diversifying.
Implementation
Changes to train.py: None Changes to prepare.py: None New dependencies: None Changes to program.md: The five sections above
The agent computes UCB1 in its own reasoning step using the JSON log it maintains. The only new files are experiments.json and optionally rescue_pool.json, both created and maintained by the agent itself.
Happy to draft the exact program.md diff if useful.
Add UCB1 Dimension-Aware Search + Experiment Memory to
program.md— no code changes requiredProblem: The Current Loop is a Memoryless Random Walk
The autoresearch agent runs a sequential keep/discard loop, but it has no memory of what it has tried and no principled strategy for what to try next. After 50 overnight experiments, the agent has the same strategic information it started with: none.
Concretely, three structural gaps:
No experiment memory. The agent's only record of history is the current state of
train.py— i.e., only what was kept. Zero record of what failed, what was close, or what dimension it already exhausted.No guided exploration. No mechanism to balance trying a new class of change (exploration) vs. doubling down on a dimension that already showed improvement (exploitation). It is a pure random walk across code-editing decisions.
No early abort. Every run burns the full 5 minutes regardless of whether the first 90 seconds of loss curves indicate a regression. At 12 experiments/hour, roughly half of overnight runs are discards that could have been caught at the 90-second mark.
Current Loop vs. Proposed Loop
flowchart LR subgraph NOW ["❌ Current Loop"] direction TB A1([Read train.py]) --> B1([Guess a change]) B1 --> C1([Run 5 min]) C1 --> D1{val_bpb\nimproved?} D1 -- yes --> E1([Keep]) D1 -- no --> F1([Discard]) E1 --> A1 F1 --> A1 style NOW fill:#2d0000,stroke:#ff4444,color:#fff style A1 fill:#3d0000,stroke:#ff6666,color:#fff style B1 fill:#3d0000,stroke:#ff6666,color:#fff style C1 fill:#3d0000,stroke:#ff6666,color:#fff style D1 fill:#4d0000,stroke:#ff6666,color:#fff style E1 fill:#3d0000,stroke:#ff6666,color:#fff style F1 fill:#3d0000,stroke:#ff6666,color:#fff end subgraph NEW ["✅ DUSE Loop"] direction TB A2([Read experiments.json]) --> B2([Compute UCB1\nacross 7 dims]) B2 --> C2([Select highest\nUCB1 dimension]) C2 --> D2([Propose targeted\nchange]) D2 --> E2([90s gate:\nloss regressing?]) E2 -- abort --> A2 E2 -- continue --> F2([Run full 5 min]) F2 --> G2{val_bpb\nimproved?} G2 -- yes --> H2([Keep + log]) G2 -- no --> I2([Discard + rescue\npool check]) H2 --> A2 I2 --> A2 style NEW fill:#001a00,stroke:#44ff44,color:#fff style A2 fill:#002200,stroke:#66ff66,color:#fff style B2 fill:#002200,stroke:#66ff66,color:#fff style C2 fill:#002200,stroke:#66ff66,color:#fff style D2 fill:#002200,stroke:#66ff66,color:#fff style E2 fill:#003300,stroke:#66ff66,color:#fff style F2 fill:#002200,stroke:#66ff66,color:#fff style G2 fill:#003300,stroke:#66ff66,color:#fff style H2 fill:#002200,stroke:#66ff66,color:#fff style I2 fill:#002200,stroke:#66ff66,color:#fff endHow UCB1 Dimension Selection Works
%%{init: {'theme': 'dark', 'themeVariables': {'fontSize': '14px'}}}%% xychart-beta title "UCB1 Scores After 20 Experiments — Agent Selects 'attention' (never tried)" x-axis ["optimizer", "architecture", "attention", "normalization", "schedule", "regularization", "batching"] y-axis "UCB1 Score" 0 --> 3.5 bar [0.91, 0.73, 3.20, 0.58, 1.45, 2.10, 1.88]Proposed Modification: Dimensional UCB1 Search + Experiment Memory (DUSE)
Pure
program.mdaddition. Zero changes totrain.py,prepare.py, or any code. No new dependencies.Section 1 — Dimension Map
Add to
program.md:Section 2 — Experiment Log
Section 3 — UCB1 Dimension Selector
## Choosing What to Experiment On Next Read experiments.json. Compute UCB1 for each of the seven dimensions: UCB1(dim) = mean_improvement(dim) + 1.0 * sqrt( ln(N) / n(dim) ) Where: mean_improvement(dim) = average improvement across all experiments in this dim (0.0 if none) N = total experiments logged n(dim) = experiments in this dim (use 0.5 if none, to avoid divide-by-zero) Select the dimension with the highest score. Print all seven scores before proposing your change so the reasoning is auditable. Example: UCB1 scores (N=12): optimizer: 0.003 + 0.930 = 0.933 architecture: 0.001 + 1.315 = 1.316 attention: 0.000 + 2.630 = 2.630 ← SELECTED (never tried) normalization: 0.002 + 1.045 = 1.047 schedule: 0.005 + 0.740 = 0.745 regularization:0.000 + 1.858 = 1.858 batching: 0.001 + 1.315 = 1.316Section 4 — Early Abort Gate
Section 5 — Crossover Rescue Pool (Optional)
## Rescue Pool When discarding, note whether any sub-mechanism showed local promise before the regression (e.g., faster early convergence even though final val_bpb regressed). Append to rescue_pool.json: { "from_experiment": 7, "dimension": "schedule", "mechanism": "linear warmup over 200 steps showed faster initial convergence", "reuse_signal": "recombine with a lower peak LR" } Before any new experiment, scan rescue_pool.json for recombination candidates in the same dimension.Why This Is Novel
The bandit and NAS literature (Hyperband, BOHB, SMAC, PBT) applies UCB1 / Thompson sampling to hyperparameter values within a predefined search space. None apply it to the meta-question of which code-region category an autonomous code-editing agent should touch next. That is the specific gap this fills.
Cross-domain research backing each mechanism:
Expected Impact
experiments.json— queryable per-dimension improvement breakdownAfter 100 overnight runs,
experiments.jsonis a research artifact in its own right — a structured record of which architectural dimensions drove improvement, which were explored but unproductive, and whether the search converged or kept diversifying.Implementation
Changes to
train.py: NoneChanges to
prepare.py: NoneNew dependencies: None
Changes to
program.md: The five sections aboveThe agent computes UCB1 in its own reasoning step using the JSON log it maintains. The only new files are
experiments.jsonand optionallyrescue_pool.json, both created and maintained by the agent itself.Happy to draft the exact
program.mddiff if useful.Research powered by https://insider77circle.github.io/redstorm-research/ — cutting-edge open-source research AI built for next-generation intelligence.