Skip to content

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Jan 8, 2026

Description

  • Add benchmarks/utils/conversation.py with a reusable build_event_persistence_callback that persists conversation events: logs full JSON for small events, logs trimmed metadata for large ones to avoid size limits, and gracefully tolerates
    failures.
  • Wire the persistence callback into every benchmark run_infer.py so Conversation instances now emit structured events (with run_id, instance_id, and attempt) instead of ad-hoc debug logs; EvalOutput now records the attempt alongside
    instance_id.
  • Track the current attempt on Evaluation (current_attempt), update it per retry, and surface it in logged events.
  • Extend reset_logger_for_multiprocessing to add a dedicated JSON logger/handler for conversation events, filtering only the conversation_event and conversation_event_metadata records and bypassing stdout redirection for reliable capture
    (e.g., Datadog).

- Log events to centralized logging (Datadog) instead of writing to
  container filesystem via bash commands
- Add run_id parameter to distinguish parallel runs
- Log metadata only for large events (>64KB) to avoid size limits
- Events now persist beyond pod lifetime and are queryable by
  run_id + instance_id

This fixes two issues:
1. 'Argument list too long' errors when persisting large events
2. Loss of event data when runtime pods are killed
Copy link
Collaborator Author

Updated to persist events via logging instead of bash commands.

Changes:

  • Events are now logged to centralized logging (Datadog) instead of writing to container filesystem
  • Added run_id parameter to distinguish parallel evaluation runs
  • Large events (>64KB) log metadata only to avoid size limits
  • Events persist beyond pod lifetime and are queryable by run_id + instance_id

Query example:

run_id:eval_out/swebench-lite-test/... instance_id:django__django-11490

Benefits:

  1. Fixes Argument list too long errors when persisting large events via bash
  2. Events survive pod termination (no more lost data when runtimes are killed)
  3. Searchable/queryable in Datadog UI

@simonrosenberg simonrosenberg changed the title Persist conversation events to runtime Persist conversation events to datadog Jan 10, 2026
@openhands-ai
Copy link

openhands-ai bot commented Jan 10, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Pre-commit checks

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #285 at branch `feature/issue-284-conversation-events`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, but maybe we can make this configurable, only on when we set an environment variable?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants