Skip to content

Conversation

@gabotechs
Copy link
Collaborator

@gabotechs gabotechs commented Feb 2, 2026

I think I messed something up in #306 and I see some leaf stages are missing their metrics.

This PR solves it and adds a new test that fails on main but succeeds in this PR.

@gabotechs gabotechs force-pushed the gabrielmusat/fix-metrics-display-on-leaf-nodes branch from a2311e6 to a33ebd5 Compare February 2, 2026 19:06
Comment on lines -69 to -71
inner: self.inner.clone(),
inner: Arc::clone(&self.inner).with_new_children(children.clone())?,
metrics: self.metrics.clone(),
children: Some(children),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem I found was that, when this method was called, the self.children where being updated, but not the actual children from self.inner.

This is fine as long as you access MetricsWrapperExec with the children() method, as then, the ones just inject will be returned, but it's not fine if you access the children through input_stage(), which is what was happening in display_plan_ascii.

I think it should be fine with just removing this self.children in favor of just applying the new children to self.inner directly, but I may be missing something.

WDYT @jayshrivastava?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds okay.

I forget why with_new_children is even called on MetricsWrapperExec. It means we're transforming a node after its wrapped...

I made this change intentionally though - we used to error in this method and I remember updating it so we don't error. Might be in the git blame.

assert_not_contains!(display, "metrics=[]");

let display = display_plan_ascii(d_physical.as_ref(), true);
assert_not_contains!(display, "metrics=[]");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well use the node_metrics helper. You can also assert the exact output rows since that's very deterministic (or at least, assert they are equal).


fn metrics(&self) -> Option<MetricsSet> {
Some(self.worker_connections.metrics.clone_inner())
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the other network boundary nodes too? Like shuffle, coalese.

self.input
.execute(*actual_partition_number, context)?
.inspect_ok(move |v| metric.add(v.num_rows())),
)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's some assertions in #314 which are made weaker to skip over PartitionIsolatorExec nodes. Do you think you can revisit those assertions?

Ok((s_physical, d_physical))
}

fn node_metrics<T: ExecutionPlan + 'static>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we missed this bug because we only call this for DataSourceExec. Want to update the tests in this file to call this for Network*Exec?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants