WIP: try to recover failing job #1142

aridder · 2025-09-18T09:45:33Z

Proposing a recovery method for failing orders

This is a WIP in a suggestion in how to solve the issue where orders is marked as failed beacause of the following error during proving:

Failure during task processing: Prove failed
Caused by:
    0: Failed to deserialize segment data from redis
    1: io error: unexpected end of file

I hope that we manage to solve this behavior. Currently the proposed changes is not working, but i hope that someone can assist with that.

Specifically we need a way to throw an enum that we can listen to and do a cleaning of tasks that is failing. Im not sure if the tasks fail because the segment data is corrupted or if it could be retried with same segment data and succesfully be proven.

I have managed to reset orders by cleaning the job and tasks and then set the orderstatus for the failed order to PendingProving. This works, but its not optimal e.g. when 90% into proving a 51B job it fails and we must start over.

Here is my current workflow for the issue where I tried to automate resetting orders marked as `Ffailed`

Context

On my node I’ve consistently seen that larger orders sometimes fail with:

Proving failed after retries …
Monitoring proof (stark) failed: [B-BON-005] Prover failure: SessionId

Smaller orders (<300 cycles) almost never fail, but larger ones do so frequently.
I’ve investigated this in the Rust code and proposed a fix in this PR, but I also want to document the current workaround I use in production.

Current Workaround

I run a systemd service that tails the broker logs, detects failing orders, and automatically resets them.
This makes the broker pick up the order again for proving without manual intervention.

Service setup

auto-reset.service:

[Unit]
Description=Auto reset broker orders on proving failure
After=docker.service

[Service]
WorkingDirectory=YOUR_BOUNDLESS_DIRECTORY
ExecStart=/bin/bash -c 'docker logs -f --since=0s bento-broker-1 | python3 scripts/auto-reset.py'
Restart=always
RestartSec=5
StandardOutput=journal
StandardError=journal
User=aridder

[Install]
WantedBy=multi-user.target

Enable and start with:

sudo systemctl enable auto-reset.service
sudo systemctl start auto-reset.service


⸻

Reset script

scripts/reset-order.sh:

#!/bin/bash

# This script performs a full reset of a failed order using the ORDER ID. It will:
# 1. Find the corresponding job_id (proof_id) from the SQLite database (broker.db).
# 2. Delete the job, tasks, and task dependencies from the PostgreSQL database (taskdb).
# 3. Reset the order's status to 'PendingProving' in the SQLite database.

set -euo pipefail

ORDER_ID_FRAGMENT="$1"
if [ -z "$ORDER_ID_FRAGMENT" ]; then
  echo "Usage: $0 <order_id_fragment>"
  echo "You can provide a partial or full order ID."
  exit 1
fi

echo "--- [Step 1/3] Finding Job ID (proof_id) for Order fragment: ${ORDER_ID_FRAGMENT} ---"

# The sqlite3 command to find the proof_id (job_id)
SQLITE_FIND_SQL="SELECT json_extract(data, '\$.proof_id') FROM orders WHERE id LIKE '%${ORDER_ID_FRAGMENT}%';"
JOB_ID=$(docker run --rm -i -v bento_broker-data:/db nouchka/sqlite3 /db/broker.db "${SQLITE_FIND_SQL}")

if [ -z "$JOB_ID" ] || [ "$JOB_ID" == "null" ]; then
  echo "Error: No order found with an ID fragment matching '${ORDER_ID_FRAGMENT}' or the order does not have a proof_id."
  exit 1
fi

echo "Found Job ID: ${JOB_ID}"
echo ""

echo "--- [Step 2/3] Resetting PostgreSQL data for Job ID: ${JOB_ID} ---"
PG_USER="${POSTGRES_USER:-worker}"
PG_DB="${POSTGRES_DB:-taskdb}"
PG_SQL="DELETE FROM public.task_deps WHERE job_id = '${JOB_ID}'; DELETE FROM public.tasks WHERE job_id = '${JOB_ID}'; DELETE FROM public.jobs WHERE id = '${JOB_ID}';"

PG_RESULT=$(docker compose exec -T postgres psql -U "${PG_USER}" -d "${PG_DB}" -c "${PG_SQL}")
echo "PostgreSQL cleanup complete."
echo ""

echo "--- [Step 3/3] Resetting SQLite order status to PendingProving ---"
# Use the original Order ID fragment to update the correct order
SQLITE_RESET_SQL="
UPDATE orders
SET data = json_set(
               json_set(data, '\$.status', 'PendingProving'),
               '\$.proof_id', NULL
           )
WHERE id LIKE '%${ORDER_ID_FRAGMENT}%';
SELECT 'SQLite: Order status reset for ' || changes() || ' order(s).';
"

# Retry logic if SQLite database is locked (error code 5)
n=0
max_retries=5
while true; do
  set +e
  SQLITE_RESULT=$(docker run --rm -i -v bento_broker-data:/db nouchka/sqlite3 /db/broker.db "${SQLITE_RESET_SQL}" 2>&1)
  rc=$?
  set -e
  if [ $rc -eq 0 ]; then
    echo "${SQLITE_RESULT}"
    break
  fi
  if echo "$SQLITE_RESULT" | grep -q "database is locked"; then
    if [ $n -ge $max_retries ]; then
      echo "[ERROR] SQLite remained locked after $max_retries attempts." >&2
      exit 1
    fi
    echo "[WARN] SQLite is locked, retrying in 2s... (attempt $((n+1))/$max_retries)"
    n=$((n+1))
    sleep 2
  else
    echo "[ERROR] SQLite reset failed: $SQLITE_RESULT" >&2
    exit $rc
  fi
done

echo ""
echo "--- Reset complete. The broker should now pick up the order for proving. ---"


⸻

Python log watcher

scripts/auto-reset.py:

#!/usr/bin/env python3
import re
import subprocess
import sys
import time
import os
import datetime

# Regex to capture the order id
ORDER_REGEX = re.compile(r"(0x[a-fA-F0-9]{64})")

# Track already reset orders with timestamp to avoid repeated resets
reset_orders: dict[str, float] = {}
RESET_COOLDOWN_SEC = 300  # don't reset same order more than once every 5 minutes

def reset_order(order_id: str, original_line: str):
    last_reset = reset_orders.get(order_id)
    now = time.time()
    if last_reset and (now - last_reset) < RESET_COOLDOWN_SEC:
        print(f"[auto-reset] Skipping reset for {order_id}, last reset {int(now - last_reset)}s ago", flush=True)
        return

    print(f"[auto-reset] Detected failed proof for order {order_id}, resetting...", flush=True)

    # Save debugging info into logfile
    logdir = "auto-reset-logs"
    os.makedirs(logdir, exist_ok=True)
    timestamp = datetime.datetime.now(datetime.UTC).strftime("%Y%m%dT%H%M%SZ")
    logfile = os.path.join(logdir, f"{order_id}_{timestamp}.log")
    with open(logfile, "w") as f:
        f.write(f"Order ID: {order_id}\n")
        f.write(f"Triggered at: {timestamp} UTC\n")
        f.write("Original log line:\n")
        f.write(original_line + "\n\n")

        # Capture running containers
        try:
            ps_out = subprocess.check_output(["docker", "ps"], text=True)
            f.write("=== docker ps ===\n")
            f.write(ps_out + "\n")
        except Exception as e:
            f.write(f"Failed to run docker ps: {e}\n")

        # Capture logs for selected containers (last 3 min)
        containers = [
            "bento-broker-1",
            "bento-rest_api-1",
            "bento-gpu_prove_agent0-1",
            "bento-gpu_prove_agent1-1",
            "bento-aux_agent-1",
        ] + [f"bento-exec_agent{i}-1" for i in range(0,14)]
        for c in containers:
            try:
                logs_out = subprocess.check_output(
                    ["docker", "logs", "--since=3m", c],
                    text=True, stderr=subprocess.STDOUT
                )
                f.write(f"\n=== docker logs --since=3m {c} ===\n")
                f.write(logs_out + "\n")
            except Exception as e:
                f.write(f"Failed to get logs for {c}: {e}\n")

    try:
        subprocess.run(
            ["./scripts/reset-order.sh", order_id],
            check=True
        )
        print(f"[auto-reset] Reset executed successfully for order {order_id}", flush=True)
        reset_orders[order_id] = now
    except subprocess.CalledProcessError as e:
        print(f"[auto-reset] Failed to reset order {order_id}: {e}", file=sys.stderr, flush=True)

    # Send Telegram notification (optional)
    try:
        import requests
        tg_env = {}
        try:
            with open(os.path.join(os.path.dirname(__file__), ".env.tg")) as envf:
                for line in envf:
                    if "=" in line and not line.strip().startswith("#"):
                        k,v = line.strip().split("=",1)
                        tg_env[k.strip()] = v.strip()
        except Exception as e:
            print(f"[auto-reset] Could not read .env.tg: {e}", file=sys.stderr, flush=True)
            tg_env = {}

        token = tg_env.get("TG_TOKEN")
        chat_id = tg_env.get("TG_CHAT_ID")

        if token and chat_id:
            msg = f"Reset order id {order_id}"
            url = f"https://api.telegram.org/bot{token}/sendMessage"
            resp = requests.post(url, data={"chat_id": chat_id, "text": msg})
            if resp.status_code != 200:
                print(f"[auto-reset] Telegram send failed: {resp.text}", file=sys.stderr, flush=True)
    except Exception as e:
        print(f"[auto-reset] Exception sending Telegram message: {e}", file=sys.stderr, flush=True)

def main():
    for line in sys.stdin:
        if (
            "Proving failed after retries" in line
            and "Monitoring proof (stark) failed: [B-BON-005] Prover failure: SessionId" in line
        ):
            match = ORDER_REGEX.search(line)
            if match:
                order_id = match.group(1)
                reset_order(order_id, line.strip())

if __name__ == "__main__":
    main()


⸻

Why this PR

This workaround has kept my node stable, but it’s clearly a patch and not a long-term solution.
That’s why I’ve proposed the changes in this PR: to address the underlying Redis/session handling issue in Rust directly, rather than relying on external scripts and resets.

---

Do you also want me to **add notes about improvements** (like regex hardening, dynamic container detection, log rotation) into this PR description, or should we keep it strictly about documenting your current workaround?

…iled due to redis inconsistent data like this: ```bash Failure during task processing: Prove failed Caused by: 0: Failed to deserialize segment data from redis 1: io error: unexpected end of file ```

Trying to recover a failed job by resetting tasks that is marked a fa…

2ff26a7

…iled due to redis inconsistent data like this: ```bash Failure during task processing: Prove failed Caused by: 0: Failed to deserialize segment data from redis 1: io error: unexpected end of file ```

aridder requested review from a team, Wollac, austinabell, capossele, willemolding and zeroecco as code owners September 18, 2025 09:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: try to recover failing job #1142

WIP: try to recover failing job #1142

Uh oh!

aridder commented Sep 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

WIP: try to recover failing job #1142

Are you sure you want to change the base?

WIP: try to recover failing job #1142

Uh oh!

Conversation

aridder commented Sep 18, 2025

Proposing a recovery method for failing orders

Here is my current workflow for the issue where I tried to automate resetting orders marked as Ffailed

Context

Current Workaround

Service setup

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Here is my current workflow for the issue where I tried to automate resetting orders marked as `Ffailed`