This document describes the tail-call unwinding implementation that enables profile-bee to capture stack traces deeper than the BPF verifier's instruction limit allows in a single program invocation.
The BPF verifier limits the complexity of eBPF programs to prevent infinite loops and ensure termination. This limit is approximately 4,096 instructions per program. The DWARF unwinding loop in dwarf_copy_stack() has nested loops:
- Outer loop:
MAX_DWARF_STACK_DEPTHiterations (~21) - Mapping lookup loop:
MAX_PROC_MAPSiterations (~8) - Binary search loop:
MAX_BIN_SEARCH_DEPTHiterations (~16)
Total worst-case complexity: ~21 × 8 × 16 = 2,688 instructions, which approaches the verifier limit. Increasing MAX_DWARF_STACK_DEPTH beyond 21-22 causes the verifier to reject the program on kernel 5.10.
Use bpf_tail_call() to split the unwinding loop across multiple program invocations. Each tail call resets the verifier's instruction budget, allowing us to:
- Unwind 5 frames per invocation (keeping verifier happy)
- Support up to 33 tail calls (kernel limit)
- Achieve total depth of 5 × 33 = 165 frames
This is the approach used by production profilers like opentelemetry-ebpf-profiler and parca-agent.
##Implementation
Per-CPU state that persists across tail calls:
pub struct DwarfUnwindState {
pub pointers: [u64; 1024], // Frame array (RIP values)
pub frame_count: usize, // Current depth
pub current_ip: u64, // RIP for next frame
pub sp: u64, // Stack pointer
pub bp: u64, // Base pointer
pub tgid: u32, // Process ID
pub mapping_count: u32, // Number of mappings
}#[map(name = "unwind_state")]
pub static UNWIND_STATE: PerCpuArray<DwarfUnwindState> = PerCpuArray::with_max_entries(1, 0);Stores the unwinding state in a per-CPU map, accessible across tail calls.
#[map(name = "prog_array")]
pub static PROG_ARRAY: ProgramArray = ProgramArray::with_max_entries(4, 0);Holds file descriptors of eBPF programs for tail-call dispatch.
pub const FRAMES_PER_TAIL_CALL: usize = 5;
pub const MAX_TAIL_CALLS: usize = 33;
pub const MAX_DWARF_STACK_DEPTH: usize = FRAMES_PER_TAIL_CALL * MAX_TAIL_CALLS; // 165
pub const LEGACY_MAX_DWARF_STACK_DEPTH: usize = 21;fn collect_trace(ctx: &PerfEventContext) {
// ... existing code ...
if dwarf_enabled() {
dwarf_copy_stack_init(ctx, tgid);
// Tail call handles the rest
}
}fn dwarf_copy_stack_init(ctx: &PerfEventContext, tgid: u32) {
let regs = ctx.as_ptr() as *const pt_regs;
let state = UNWIND_STATE.get_ptr_mut(0).unwrap();
// Initialize state
state.pointers[0] = regs.rip;
state.current_ip = regs.rip;
state.sp = regs.rsp;
state.bp = regs.rbp;
state.tgid = tgid;
state.frame_count = 1;
// Tail-call into step function
unsafe { PROG_ARRAY.tail_call(ctx, 0); }
}#[perf_event]
fn dwarf_unwind_step(ctx: PerfEventContext) {
let state = UNWIND_STATE.get_ptr_mut(0).unwrap();
// Unwind up to FRAMES_PER_TAIL_CALL frames
// (each frame does an LPM trie lookup to find its exec mapping)
for _ in 0..FRAMES_PER_TAIL_CALL {
if !dwarf_unwind_one_frame(state) {
// Done unwinding
return;
}
}
// More frames to unwind - tail-call back into ourselves
unsafe { PROG_ARRAY.tail_call(&ctx, 0); }
}fn dwarf_unwind_one_frame(state: &mut DwarfUnwindState) -> bool {
// LPM trie lookup for exec mapping containing current_ip
// Binary search unwind table
// Compute CFA
// Read return address
// Update state (sp, bp, ip, frame_count)
// Return true if more frames, false if done
}The userspace code must register the step program in the PROG_ARRAY after loading:
pub fn setup_tail_call_unwinding(bpf: &mut Ebpf) -> Result<(), anyhow::Error> {
let step_prog: &Program = bpf.program("dwarf_unwind_step")?;
let step_fd = step_prog.fd()?;
let mut prog_array: ProgramArray = bpf.map_mut("prog_array")?.try_into()?;
prog_array.set(0, step_fd, 0)?;
Ok(())
}DwarfUnwindStatestructure addedUNWIND_STATEandPROG_ARRAYmaps added- Constants updated (
MAX_DWARF_STACK_DEPTH = 165) - Legacy implementation preserved (
LEGACY_MAX_DWARF_STACK_DEPTH = 21)
Intermediate step: replaced nested outer/inner loop with a single flat for _ in 0..LEGACY_MAX_DWARF_STACK_DEPTH loop. This was superseded by true tail-call chaining (see below).
Implemented real PROG_ARRAY.tail_call() dispatch, achieving the full MAX_DWARF_STACK_DEPTH = 165 frames. Verified with 200-level recursion (captures 166 frames, hitting the limit exactly).
Changes made:
-
dwarf_try_tail_call()initializes per-CPUDwarfUnwindStateand tail-calls into step program -
dwarf_unwind_stepas#[perf_event]tail-callable program (unwinds 5 frames per invocation) -
dwarf_finalize_stack()copies results from state to STORAGE and submits to maps/ring buffer -
DwarfUnwindStateextended with finalization context (stack_ids, cmd, cpu, initial regs) -
setup_tail_call_unwinding()in userspace loads step program and registers it inPROG_ARRAY - Legacy 21-frame inline path (
dwarf_copy_stack_regs) preserved as automatic fallback when tail calls are unavailable (kprobe/uprobe contexts, older kernels, registration failure)
- Add
setup_tail_call_unwinding()toebpf.rs - Call registration after eBPF load (in main binary when
--dwarfenabled) - Add runtime detection of tail-call support (currently: graceful fallback if tail call fails)
- Fallback to legacy mode on older kernels (automatic: failed tail call falls through to inline path)
- Add test for 50+ frame depth (
test_dwarf_deepstack— captures full 57 frames from 50-level recursion) - Verify 165-frame ceiling (200-level recursion captures 166 frames)
- Verify performance (should be similar to legacy)
- Test on kernels 5.4, 5.10, 5.15, 6.x
5 frames keeps each invocation well under the verifier limit while maximizing total depth:
- Each frame requires: mapping lookup (8 iterations) + binary search (16 iterations) + CFA computation
- 5 frames ≈ 5 × (8 + 16 + overhead) ≈ 600-800 instructions
- Well under 4,096 limit with safety margin
- eBPF stack is limited (~512 bytes)
DwarfUnwindStateis ~8.2 KB (1024 frames × 8 bytes + metadata)- Per-CPU maps are fast (no locking) and persist across tail calls
- Each CPU gets its own state, no contention
Signal frames (CFA_REG_DEREF_RSP) work the same in tail-call mode:
- Read RIP/RSP/RBP from
ucontext_tat fixed offsets - Continue unwinding from interrupted frame
- No special tail-call handling needed
- Tail call adds ~20-50 CPU cycles vs direct function call
- For 21-frame stack: 21/5 = 4 tail calls ≈ 80-200 cycles
- Negligible compared to
bpf_probe_read_usercost (~500 cycles each)
UNWIND_STATE: 1 entry × 8.2 KB = 8.2 KB per CPU- For 128 CPUs: ~1 MB total
- Acceptable for production use
- Tail calls: Linux 4.2+ (introduced in commit 04fd61ab36ec)
- Program arrays: Linux 4.2+
- Per-CPU arrays: Linux 3.19+
fn dwarf_copy_stack(...) -> ... {
if tail_call_supported() {
dwarf_copy_stack_with_tail_calls(...)
} else {
dwarf_copy_stack_legacy(...) // 21-frame limit
}
}Runtime detection:
fn tail_call_supported() -> bool {
// Check kernel version or try to load a test tail-call program
kernel_version() >= Version::new(4, 2, 0)
}- opentelemetry-ebpf-profiler native_stack_trace.ebpf.c
- eBPF tail calls documentation
- BPF verifier documentation
- Parca Agent DWARF unwinding
- Elastic blog on profiling without frame pointers
- Adaptive frame count: Adjust
FRAMES_PER_TAIL_CALLbased on program complexity - Stack truncation marker: Add sentinel value when hitting 165-frame limit
- Metrics: Track actual tail-call depth distribution
- Hybrid mode: Use FP unwinding for known-FP binaries, DWARF for others