Skip to content

Conversation

@0xmrree
Copy link

@0xmrree 0xmrree commented Dec 10, 2025

Adding a new telemetry-trace-sample-rate cli arg for both VC and BN to adjust sampling rate for OpenTelemetry tracing spans. Default will now be 1%.

#8554

Proposed Changes

Simply add the cli arg then set the sampler arg in the open telemetry object constructor via below variant
https://docs.rs/opentelemetry_sdk/0.30.0/opentelemetry_sdk/trace/enum.Sampler.html#variant.TraceIdRatioBased

Additional Info

Didn't add unit tests for this given low risk and low ROI to added meaningfully test in this case

Testing

ran cargo nextest run -p lighthouse --release and all tests passed, see below

...
        PASS [   0.021s] lighthouse::lighthouse_tests validator_manager::validator_import_defaults
        PASS [   0.022s] lighthouse::lighthouse_tests validator_manager::validator_import_misc_flags
        PASS [   0.013s] lighthouse::lighthouse_tests validator_manager::validator_import_missing_both_file_flags
        PASS [   0.017s] lighthouse::lighthouse_tests validator_manager::validator_import_missing_token
        PASS [   0.014s] lighthouse::lighthouse_tests validator_manager::validator_import_using_both_file_flags
        PASS [   0.019s] lighthouse::lighthouse_tests validator_manager::validator_list_defaults
        PASS [   0.019s] lighthouse::lighthouse_tests validator_manager::validator_move_count
        PASS [   0.019s] lighthouse::lighthouse_tests validator_manager::validator_move_defaults
        PASS [   0.021s] lighthouse::lighthouse_tests validator_manager::validator_move_misc_flags_0
        PASS [   0.021s] lighthouse::lighthouse_tests validator_manager::validator_move_misc_flags_1
        PASS [   0.023s] lighthouse::lighthouse_tests validator_manager::validator_move_misc_flags_2
        PASS [  25.886s] lighthouse::lighthouse_tests beacon_node::validator_monitor_file_flag
        PASS [  26.837s] lighthouse::lighthouse_tests beacon_node::validator_monitor_metrics_threshold_custom
        PASS [  26.960s] lighthouse::lighthouse_tests beacon_node::validator_monitor_metrics_threshold_default
        PASS [  23.081s] lighthouse::lighthouse_tests beacon_node::validator_monitor_pubkeys_flag
        PASS [  47.206s] lighthouse::lighthouse_tests beacon_node::test_builder_disable_ssz_flag
        PASS [  20.507s] lighthouse::lighthouse_tests beacon_node::wss_checkpoint_flag
        PASS [  15.336s] lighthouse::lighthouse_tests beacon_node::zero_ports_flag
────────────
     Summary [ 696.662s] 310 tests run: 310 passed, 0 skipped

Copy link
Member

@eserilev eserilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few minor things and it should be ready for another round of review. Thanks!

Comment on lines 695 to 696
let sample_rate = matches
.get_one::<f64>("telemetry-trace-sample-rate")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should keep this a usize and allow values between 0 - 100. we try to avoid using f64 in our codebase in general and i dont think we need more granularity here than full percentage points

Copy link
Author

@0xmrree 0xmrree Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok i'm down to convert it to u8 but this needs to be convert to ratio in f64 when we set the variant.

see below

TraceIdRatioBased([f64](https://doc.rust-lang.org/nightly/std/primitive.f64.html))

Sample a given fraction of traces. Fractions >= 1 will always sample. If the parent span is sampled, then it’s child spans will automatically be sampled. Fractions < 0 are treated as zero, but spans may still be sampled if their parent is. Note: If this is used then all Spans in a trace will become sampled assuming that the first span is sampled as it is based on the trace_id not the span_id

Copy link
Author

@0xmrree 0xmrree Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gonna use u8 instead of usize given ValueParserFactory does not support usize for range, if its cool hehe

@eserilev eserilev added UX-and-logs waiting-on-author The reviewer has suggested changes and awaits thier implementation. tracing backwards-incompat Backwards-incompatible API change labels Dec 11, 2025
@eserilev
Copy link
Member

tagging as backwards incompatible because the default trace span sampling is reduced from 100% to 1%

@0xmrree
Copy link
Author

0xmrree commented Dec 12, 2025

@eserilev should be good to go for another review. Just used the config in the BN to keep the tests simple and following similar pattern as the other cli flag tests.

@0xmrree 0xmrree requested a review from eserilev December 12, 2025 21:06
Copy link
Member

@eserilev eserilev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really close! just one small thing and it should be good to go in my opinion

Comment on lines 696 to 702
// Calculate sample percent as a ratio (percentage / 100)
telemetry_sample_ratio = Some(matches
.get_one::<u8>("telemetry-trace-sample-rate")
.copied()
.unwrap_or(1) as f64 / 100.0);
let sampler = opentelemetry_sdk::trace::Sampler::ParentBased(Box::new(
opentelemetry_sdk::trace::Sampler::TraceIdRatioBased(telemetry_sample_ratio.unwrap()),
Copy link
Member

@eserilev eserilev Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: safe math is probably overkill here tbh, i think maybe just removing the Some wrapping and unwrapping is fine

I think we can remove the Some wrapping and unwrapping and also introduce safe math here

maybe something like:

Suggested change
// Calculate sample percent as a ratio (percentage / 100)
telemetry_sample_ratio = Some(matches
.get_one::<u8>("telemetry-trace-sample-rate")
.copied()
.unwrap_or(1) as f64 / 100.0);
let sampler = opentelemetry_sdk::trace::Sampler::ParentBased(Box::new(
opentelemetry_sdk::trace::Sampler::TraceIdRatioBased(telemetry_sample_ratio.unwrap()),
// Calculate sample percent as a ratio (percentage / 100)
telemetry_sample_ratio = matches
.get_one::<u8>("telemetry-trace-sample-rate")
.copied()
.unwrap_or(1) as f64.safe_div(100.0).unwrap_or(0.01);
let sampler = opentelemetry_sdk::trace::Sampler::ParentBased(Box::new(
opentelemetry_sdk::trace::Sampler::TraceIdRatioBased(telemetry_sample_ratio),

you'll also need to add the safe math crate to lighthouse/Cargo.toml

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think safe_arith only implements traits for integers and not f64, probably because / 0 is handled by f64:inf so ya ill leave / 100 - i would also assume the sdk handles the f64:inf case but we are fine since we are hard coding the 100.0.

but ya I can remove the some on the ratio for sure

@eserilev
Copy link
Member

eserilev commented Jan 8, 2026

i think you might also need to run cargo fmt --all (see the CI failures for more details)

@0xmrree 0xmrree force-pushed the trace_sampling_default branch from 3cf9c1a to 62c8f0d Compare January 11, 2026 22:53
@0xmrree
Copy link
Author

0xmrree commented Jan 11, 2026

@eserilev changed from Option to f64, removed unused module, formatted, and ran cargo nextest run --release -p lighthouse telemetry_sample_rate - should be good to go

@0xmrree 0xmrree requested a review from eserilev January 11, 2026 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backwards-incompat Backwards-incompatible API change tracing UX-and-logs waiting-on-author The reviewer has suggested changes and awaits thier implementation.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants