Skip to content

Conversation

nastaran78
Copy link
Collaborator

@nastaran78 nastaran78 commented Jul 22, 2025

we added support of async evaluation in this PR.

TL;DR
Async: return None, launch a job, and call neps.save_pipeline_results() when the job finishes.


1  Return types

Allowed return When to use Minimal example
Scalar simple objective, single fidelity return loss
Dict need cost/extra metrics {"objective_to_minimize": loss, "cost": 3}
None you launch the job elsewhere (SLURM, k8s …) see § 3 Async

All other values raise a TypeError inside NePS.

2  Result dictionary keys

key purpose required?
objective_to_minimize scalar NePS will minimise yes
cost wall‑clock, GPU‑hours, … — only if you passed max_cost_total to neps.run yes iff cost budget enabled
learning_curve list/np.array of intermediate objectives optional
extra any JSON‑serialisable blob optional
exception any Exception illustrating the error in evaluation optional

Tip  Return exactly what you need; extra keys are preserved in the trial’s report.yaml.


3  Asynchronous evaluation (advanced)

3.1 Design

  1. The Python side (your evaluate_pipeline function)

    • creates & submits a job script.
    • returns None so the worker thread isn’t blocked.
  2. The submit script or the job must call

    neps.save_pipeline_results(
        user_result=result_dict,
        pipeline_id=pipeline_id,
        root_directory=root_directory,
    )

    when it finishes.
    This writes report.yaml and marks the trial SUCCESS / CRASHED.

3.2 Code walk‑through

submit.py – called by NePS synchronously

from pathlib import Path
import neps

def evaluate_pipeline(
    pipeline_directory: Path,
    pipeline_id: str,          # NePS injects this automatically
    root_directory: Path,      # idem
    learning_rate: float,
    optimizer: str,
):
    # 1) write a Slurm script
    script = f"""#!/bin/bash
#SBATCH --time=0-00:10
#SBATCH --job-name=trial_{pipeline_id}
#SBATCH --partition=bosch_cpu-cascadelake
#SBATCH --output={pipeline_directory}/%j.out
#SBATCH --error={pipeline_directory}/%j.err

python run_pipeline.py \
       --learning_rate {learning_rate} \
       --optimizer {optimizer} \
       --pipeline_id {pipeline_id} \
       --root_dir {root_directory}
""")

    # 2) submit and RETURN None (async)
    sumit_job(script)
    return None  # ⟵ signals async mode

run_pipeline.py – executed on the compute node

import argparse, json, time, neps
from pathlib import Path

parser = argparse.ArgumentParser()
parser.add_argument("--learning_rate", type=float)
parser.add_argument("--optimizer")
parser.add_argument("--pipeline_id")
parser.add_argument("--root_dir")
args = parser.parse_args()
try:
    # … do heavy training …
    val_loss = 0.1234
    wall_clock_cost = 180  # seconds
    result = {
        "objective_to_minimize": val_loss,
        "cost": wall_clock_cost,
    }
except Exception as e:
    result = {
        "objective_to_minimize": val_loss,
        "cost": wall_clock_cost,
        "exception": e
    }

neps.save_pipeline_results(
    user_result=result,
    pipeline_id=args.pipeline_id,
    root_directory=Path(args.root_dir),
)

3.3 Why this matters

  • No worker idles while your job is in the queue ➜ better throughput.
  • Crashes inside the job still mark the trial CRASHED instead of hanging.
  • Compatible with Successive‑Halving/ASHA — NePS just waits for report.yaml.

4  Extra injected arguments

name provided when description
pipeline_directory always per‑trial working dir (…/trials/<id>/)
previous_pipeline_directory only for multi‑fidelity directory of the lower‑fidelity checkpoint. Can be None.
pipeline_id async only trial id string you pass to save_evaluation_results
root_directory async only optimisation root folder, same to pass back

@nastaran78 nastaran78 force-pushed the eval_callback branch 4 times, most recently from 195d67b to c605336 Compare July 29, 2025 14:20
@nastaran78 nastaran78 changed the title feat: Callback for saving Evaluation Result feat: Async Saving Evaluation Result Jul 29, 2025
@nastaran78 nastaran78 force-pushed the eval_callback branch 3 times, most recently from 9394b98 to 9bfc463 Compare August 18, 2025 22:57
@nastaran78 nastaran78 requested a review from Neeratyoy August 19, 2025 14:11
@nastaran78
Copy link
Collaborator Author

1- issue for using config instead of pipeline

@nastaran78 nastaran78 merged commit 64caecb into master Sep 8, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

2 participants