Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(llmobs): add sampling for ragas skeleton code #10719

Merged
merged 60 commits into from
Oct 9, 2024

Conversation

lievan
Copy link
Contributor

@lievan lievan commented Sep 19, 2024

V0 sampling implementation for evaluator runner. The evaluator runner is a period service that stores a list of (evaluations to generate / "evaluators") on finished LLM obs span events. The runner will have a list of sampling rules that it applies to spans before triggering any evaluator on that span.

Evaluator sampler rules are configured by setting _DD_LLMOBS_EVALUATOR_SAMPLING_RULES to a json list of rules

[
 {"sample_rate": ..., "evaluator_label":.., "span_name": ...}, {...
]

Each rule consists of the following:

  • sample_rate (required)
  • evaluator_label (optional, the evaluator name)
  • span_name (optional, the span name). Not that for APM trace sampling rules, span_name is just name. But since we're dealing with both evaluator names/labels and span names, and perhaps more names such as ml app in the future, I think it's better to be more verbose for clarity.

Supporting sampling rules based on evaluator label and span name is key since most evaluators do not apply to all types of spans. For example, a faithfulness evaluation only applies to an LLM generation that uses a ground truth reference context.

Example Usage:

_DD_LLMOBS_EVALUATOR_SAMPLING_RULES=‘[{"sample_rate":0.5, “evaluator_label”: “ragas_faithfulness”, “span_name”: ”augmented_generation"}]' python3 app.py

Code Changes:

  1. The evaluation runner buffer now includes both the span event dict and also a span object. This is because the sampler the span._trace_id_64bits field is used for sampling
  2. We implement a brand new EvaluatorSampler helper class that the EvaluationRunner uses for sampling. The EvaluationRunner stores one instance of EvaluatorSampler, which internally stores a list of EvaluatorSamplingRule(s). The EvaluatorSamplingRule class inherits from SamplingRule so we can re-use some helpful utilities e.g. the sample method.
  3. Rule matching is basic-string-equality only for now.

Follow ups

  • implement more matching capabilities for rules
  • support sampling on more span fields, e.g ml app

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Copy link
Contributor

github-actions bot commented Sep 19, 2024

CODEOWNERS have been resolved as:

ddtrace/llmobs/_evaluators/sampler.py                                   @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_evaluator_runner.send_score_metric.yaml  @DataDog/ml-observability
ddtrace/llmobs/_evaluators/runner.py                                    @DataDog/ml-observability
ddtrace/llmobs/_trace_processor.py                                      @DataDog/ml-observability
tests/llmobs/_utils.py                                                  @DataDog/ml-observability
tests/llmobs/conftest.py                                                @DataDog/ml-observability
tests/llmobs/test_llmobs_evaluator_runner.py                            @DataDog/ml-observability
tests/llmobs/test_llmobs_service.py                                     @DataDog/ml-observability

tests/llmobs/test_llmobs_evaluator_runner.py Outdated Show resolved Hide resolved
tests/llmobs/test_llmobs_evaluator_runner.py Outdated Show resolved Hide resolved
ddtrace/llmobs/_evaluators/runner.py Outdated Show resolved Hide resolved
ddtrace/llmobs/_evaluators/runner.py Outdated Show resolved Hide resolved
ddtrace/llmobs/_evaluators/runner.py Outdated Show resolved Hide resolved
ddtrace/llmobs/_evaluators/sampler.py Outdated Show resolved Hide resolved
ddtrace/llmobs/_evaluators/sampler.py Outdated Show resolved Hide resolved
ddtrace/llmobs/_evaluators/sampler.py Outdated Show resolved Hide resolved
@datadog-dd-trace-py-rkomorn
Copy link

datadog-dd-trace-py-rkomorn bot commented Sep 19, 2024

Datadog Report

Branch report: evan.li/ragas-skeleton-with-sampling
Commit report: 22cb049
Test service: dd-trace-py

✅ 0 Failed, 170 Passed, 780 Skipped, 1m 18.06s Total duration (13m 12.97s time saved)

@pr-commenter
Copy link

pr-commenter bot commented Sep 19, 2024

Benchmarks

Benchmark execution time: 2024-10-09 14:17:16

Comparing candidate commit 2ae2917 in PR branch evan.li/ragas-skeleton-with-sampling with baseline commit b9708de in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 371 metrics, 53 unstable metrics.

@lievan lievan added the changelog/no-changelog A changelog entry is not required for this PR. label Sep 30, 2024
Base automatically changed from evan.li/ragas-skeleton to main September 30, 2024 19:48
@lievan lievan marked this pull request as ready for review October 1, 2024 15:18
@lievan lievan requested a review from a team as a code owner October 1, 2024 15:18
Copy link
Contributor

@sabrenner sabrenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic here lgtm! just left one clarifying question but idt it's blocking.

ddtrace/llmobs/_evaluators/sampler.py Outdated Show resolved Hide resolved
ddtrace/llmobs/_evaluators/sampler.py Outdated Show resolved Hide resolved
Copy link
Member

@Kyle-Verhoog Kyle-Verhoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just did a quick first pass, i'll take a closer look later today

ddtrace/llmobs/_evaluators/sampler.py Outdated Show resolved Hide resolved
tests/llmobs/test_llmobs_evaluator_runner.py Outdated Show resolved Hide resolved
Copy link
Member

@Kyle-Verhoog Kyle-Verhoog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm otherwise, my comments are optional to address if you'd like

@lievan lievan enabled auto-merge (squash) October 9, 2024 13:35
@lievan lievan merged commit 24389a7 into main Oct 9, 2024
568 of 569 checks passed
@lievan lievan deleted the evan.li/ragas-skeleton-with-sampling branch October 9, 2024 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog/no-changelog A changelog entry is not required for this PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants