chore(llmobs): add sampling for ragas skeleton code #10719

lievan · 2024-09-19T14:36:55Z

V0 sampling implementation for evaluator runner. The evaluator runner is a period service that stores a list of (evaluations to generate / "evaluators") on finished LLM obs span events. The runner will have a list of sampling rules that it applies to spans before triggering any evaluator on that span.

Evaluator sampler rules are configured by setting _DD_LLMOBS_EVALUATOR_SAMPLING_RULES to a json list of rules

[
 {"sample_rate": ..., "evaluator_label":.., "span_name": ...}, {...
]

Each rule consists of the following:

sample_rate (required)
evaluator_label (optional, the evaluator name)
span_name (optional, the span name). Not that for APM trace sampling rules, span_name is just name. But since we're dealing with both evaluator names/labels and span names, and perhaps more names such as ml app in the future, I think it's better to be more verbose for clarity.

Supporting sampling rules based on evaluator label and span name is key since most evaluators do not apply to all types of spans. For example, a faithfulness evaluation only applies to an LLM generation that uses a ground truth reference context.

Example Usage:

_DD_LLMOBS_EVALUATOR_SAMPLING_RULES=‘[{"sample_rate":0.5, “evaluator_label”: “ragas_faithfulness”, “span_name”: ”augmented_generation"}]' python3 app.py

Code Changes:

The evaluation runner buffer now includes both the span event dict and also a span object. This is because the sampler the span._trace_id_64bits field is used for sampling
We implement a brand new EvaluatorSampler helper class that the EvaluationRunner uses for sampling. The EvaluationRunner stores one instance of EvaluatorSampler, which internally stores a list of EvaluatorSamplingRule(s). The EvaluatorSamplingRule class inherits from SamplingRule so we can re-use some helpful utilities e.g. the sample method.
Rule matching is basic-string-equality only for now.

Follow ups

implement more matching capabilities for rules
support sampling on more span fields, e.g ml app

Checklist

PR author has checked that all the criteria below are met
The PR description includes an overview of the change
The PR description articulates the motivation for the change
The change includes tests OR the PR description describes a testing strategy
The PR description notes risks associated with the change, if any
Newly-added code is easy to change
The change follows the library release note guidelines
The change includes or references documentation updates if necessary
Backport labels are set (if applicable)

Reviewer Checklist

Reviewer has checked that all the criteria below are met
Title is accurate
All changes are related to the pull request's stated goal
Avoids breaking API changes
Testing strategy adequately addresses listed risks
Newly-added code is easy to change
Release note makes sense to a user of the library
If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
Backport labels are set in a manner that is consistent with the release branch maintenance policy

github-actions · 2024-09-19T14:37:30Z

CODEOWNERS have been resolved as:

ddtrace/llmobs/_evaluators/sampler.py                                   @DataDog/ml-observability
tests/llmobs/llmobs_cassettes/tests.llmobs.test_llmobs_evaluator_runner.send_score_metric.yaml  @DataDog/ml-observability
ddtrace/llmobs/_evaluators/runner.py                                    @DataDog/ml-observability
ddtrace/llmobs/_trace_processor.py                                      @DataDog/ml-observability
tests/llmobs/_utils.py                                                  @DataDog/ml-observability
tests/llmobs/conftest.py                                                @DataDog/ml-observability
tests/llmobs/test_llmobs_evaluator_runner.py                            @DataDog/ml-observability
tests/llmobs/test_llmobs_service.py                                     @DataDog/ml-observability

tests/llmobs/test_llmobs_evaluator_runner.py

ddtrace/llmobs/_evaluators/runner.py

ddtrace/llmobs/_evaluators/sampler.py

datadog-dd-trace-py-rkomorn · 2024-09-19T14:50:23Z

Datadog Report

Branch report: evan.li/ragas-skeleton-with-sampling
Commit report: 22cb049
Test service: dd-trace-py

✅ 0 Failed, 170 Passed, 780 Skipped, 1m 18.06s Total duration (13m 12.97s time saved)

pr-commenter · 2024-09-19T15:15:33Z

Benchmarks

Benchmark execution time: 2024-10-09 14:17:16

Comparing candidate commit 2ae2917 in PR branch evan.li/ragas-skeleton-with-sampling with baseline commit b9708de in branch main.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 371 metrics, 53 unstable metrics.

…gas-skeleton

…py into evan.li/ragas-skeleton

…gas-skeleton

…h-sampling

ddtrace/llmobs/_evaluators/runner.py

sabrenner

the logic here lgtm! just left one clarifying question but idt it's blocking.

ddtrace/llmobs/_evaluators/sampler.py

…gas-skeleton-with-sampling

…aDog/dd-trace-py into evan.li/ragas-skeleton-with-sampling

Kyle-Verhoog

just did a quick first pass, i'll take a closer look later today

ddtrace/llmobs/_evaluators/sampler.py

tests/llmobs/test_llmobs_evaluator_runner.py

Kyle-Verhoog

lgtm otherwise, my comments are optional to address if you'd like

…gas-skeleton-with-sampling

lievan added 15 commits September 13, 2024 14:51

implement ragas faithfulenss runner with dummy ragas score generator

571d317

remove newline

4b3d840

pydantic v1

7b9c929

refactor into evaluator list

2e883a0

add unit tests

7b31443

fix expectde span event

13229bd

merg conf

b493e20

remove config option, use only env var

d290dcd

address comments

6e49cca

refactor into one evaluator service

fcf9991

dont cancel futures

10b276f

refactor dummy faithfulness into class

b6fa4e0

sampler

60d4e5c

sampler wip

064b770

fix tests'

cd478bf

datadog-datadog-prod-us1 bot reviewed Sep 19, 2024

View reviewed changes

refactor into seperate file

3910c40

datadog-datadog-prod-us1 bot reviewed Sep 19, 2024

View reviewed changes

ddtrace/llmobs/_evaluators/sampler.py Outdated Show resolved Hide resolved

ddtrace/llmobs/_evaluators/sampler.py Outdated Show resolved Hide resolved

ddtrace/llmobs/_evaluators/sampler.py Outdated Show resolved Hide resolved

rename field to label

be893a1

lievan and others added 8 commits September 19, 2024 11:28

change sampling rules parsing

7044405

Merge branch 'main' into evan.li/ragas-skeleton

a309330

Merge branch 'main' of github.com:DataDog/dd-trace-py into evan.li/ra…

d849067

…gas-skeleton

Merge branch 'evan.li/ragas-skeleton' of github.com:DataDog/dd-trace-…

fd73621

…py into evan.li/ragas-skeleton

dont use self.choose_matcher

64df97b

think about rate limiting

f5b6518

wip tests

22cb049

Merge branch 'main' of github.com:DataDog/dd-trace-py into evan.li/ra…

2f48461

…gas-skeleton

lievan added 3 commits September 27, 2024 16:00

change parsing error msg

177e7f3

fix test

8e4b376

add comment for default sampling

3ff78f7

lievan added the changelog/no-changelog A changelog entry is not required for this PR. label Sep 30, 2024

lievan added 3 commits September 30, 2024 14:28

fix wrong patch

f43dceb

Merge branch 'evan.li/ragas-skeleton' into evan.li/ragas-skeleton-wit…

0cf3312

…h-sampling

update key names for sampling rules

6b3c301

lievan commented Sep 30, 2024

View reviewed changes

ddtrace/llmobs/_evaluators/runner.py Show resolved Hide resolved

Base automatically changed from evan.li/ragas-skeleton to main September 30, 2024 19:48

lievan added 3 commits September 30, 2024 19:29

merge conf

b932e96

don't try to handle decoder errors

e9fdec6

rm some extra lines

7a85591

lievan marked this pull request as ready for review October 1, 2024 15:18

lievan requested a review from a team as a code owner October 1, 2024 15:18

sabrenner approved these changes Oct 3, 2024

View reviewed changes

ddtrace/llmobs/_evaluators/sampler.py Outdated Show resolved Hide resolved

ddtrace/llmobs/_evaluators/sampler.py Outdated Show resolved Hide resolved

lievan and others added 5 commits October 3, 2024 21:59

update type annotations

16c368b

Merge branch 'main' of github.com:DataDog/dd-trace-py into evan.li/ra…

c11d058

…gas-skeleton-with-sampling

Merge branch 'main' into evan.li/ragas-skeleton-with-sampling

c4cdb32

union type

4c699f3

Merge branch 'evan.li/ragas-skeleton-with-sampling' of github.com:Dat…

86b1e27

…aDog/dd-trace-py into evan.li/ragas-skeleton-with-sampling

Kyle-Verhoog reviewed Oct 7, 2024

View reviewed changes

ddtrace/llmobs/_evaluators/sampler.py Outdated Show resolved Hide resolved

tests/llmobs/test_llmobs_evaluator_runner.py Outdated Show resolved Hide resolved

Kyle-Verhoog approved these changes Oct 7, 2024

View reviewed changes

lievan and others added 5 commits October 7, 2024 15:39

soft fail for invalid floats

18a7dae

rm newline

8e521cb

Merge branch 'main' of github.com:DataDog/dd-trace-py into evan.li/ra…

d0718a1

…gas-skeleton-with-sampling

add a dummy evaluator and a casette

098d8fa

Merge branch 'main' into evan.li/ragas-skeleton-with-sampling

2ae2917

lievan enabled auto-merge (squash) October 9, 2024 13:35

lievan merged commit 24389a7 into main Oct 9, 2024
568 of 569 checks passed

lievan deleted the evan.li/ragas-skeleton-with-sampling branch October 9, 2024 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(llmobs): add sampling for ragas skeleton code #10719

chore(llmobs): add sampling for ragas skeleton code #10719

lievan commented Sep 19, 2024 •

edited

Loading

github-actions bot commented Sep 19, 2024 •

edited

Loading

datadog-dd-trace-py-rkomorn bot commented Sep 19, 2024 •

edited

Loading

pr-commenter bot commented Sep 19, 2024 •

edited

Loading

sabrenner left a comment

Kyle-Verhoog left a comment

Kyle-Verhoog left a comment

chore(llmobs): add sampling for ragas skeleton code #10719

chore(llmobs): add sampling for ragas skeleton code #10719

Conversation

lievan commented Sep 19, 2024 • edited Loading

Checklist

Reviewer Checklist

github-actions bot commented Sep 19, 2024 • edited Loading

datadog-dd-trace-py-rkomorn bot commented Sep 19, 2024 • edited Loading

Datadog Report

pr-commenter bot commented Sep 19, 2024 • edited Loading

Benchmarks

sabrenner left a comment

Choose a reason for hiding this comment

Kyle-Verhoog left a comment

Choose a reason for hiding this comment

Kyle-Verhoog left a comment

Choose a reason for hiding this comment

lievan commented Sep 19, 2024 •

edited

Loading

github-actions bot commented Sep 19, 2024 •

edited

Loading

datadog-dd-trace-py-rkomorn bot commented Sep 19, 2024 •

edited

Loading

pr-commenter bot commented Sep 19, 2024 •

edited

Loading