-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add trace of consumers to OOM error messages #17943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d703964
to
cf6f807
Compare
…eservations to make child reservations
cf6f807
to
c7e869f
Compare
"Resources exhausted: Additional allocation failed for ParquetSink(ArrowColumnWriter(col=1)) with top memory consumers (across reservations) as: | ||
ParquetSink(ArrowColumnWriter(col=8))#ID(can spill: false) consumed x KB, peak x KB: | ||
stack backtrace: | ||
0: ParquetSink(ArrowColumnWriter(col=8))#ID(can spill: false) consumed x KB, peak x KB | ||
1: ParquetSink(ParallelColumnWriters)#ID(can spill: false) consumed x B, peak x B | ||
2: ParquetSink(ParallelWriter)#ID(can spill: false) consumed x B, peak x B | ||
, | ||
ParquetSink(ArrowColumnWriter(col=14))#ID(can spill: false) consumed x KB, peak x KB: | ||
stack backtrace: | ||
0: ParquetSink(ArrowColumnWriter(col=14))#ID(can spill: false) consumed x KB, peak x KB | ||
1: ParquetSink(ParallelColumnWriters)#ID(can spill: false) consumed x B, peak x B | ||
2: ParquetSink(ParallelWriter)#ID(can spill: false) consumed x B, peak x B | ||
, | ||
ParquetSink(ArrowColumnWriter(col=0))#ID(can spill: false) consumed x KB, peak x KB: | ||
stack backtrace: | ||
0: ParquetSink(ArrowColumnWriter(col=0))#ID(can spill: false) consumed x KB, peak x KB | ||
1: ParquetSink(ParallelColumnWriters)#ID(can spill: false) consumed x B, peak x B | ||
2: ParquetSink(ParallelWriter)#ID(can spill: false) consumed x B, peak x B | ||
, | ||
ParquetSink(ArrowColumnWriter(col=2))#ID(can spill: false) consumed x KB, peak x KB: | ||
stack backtrace: | ||
0: ParquetSink(ArrowColumnWriter(col=2))#ID(can spill: false) consumed x KB, peak x KB | ||
1: ParquetSink(ParallelColumnWriters)#ID(can spill: false) consumed x B, peak x B | ||
2: ParquetSink(ParallelWriter)#ID(can spill: false) consumed x B, peak x B | ||
, | ||
ParquetSink(ArrowColumnWriter(col=1))#ID(can spill: false) consumed x KB, peak x KB: | ||
stack backtrace: | ||
0: ParquetSink(ArrowColumnWriter(col=1))#ID(can spill: false) consumed x KB, peak x KB | ||
1: ParquetSink(ParallelColumnWriters)#ID(can spill: false) consumed x B, peak x B | ||
2: ParquetSink(ParallelWriter)#ID(can spill: false) consumed x B, peak x B | ||
. | ||
Error: Failed to allocate additional x KB for ParquetSink(ArrowColumnWriter(col=1)) with x KB already allocated for this reservation - x KB remain available for the total pool", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of using the parent/child relationship to build a trace of consumers.
Currently, this approach is limited to the current way that memory reservations work. Meaning, the parent's bytes (consumed & peak) do NOT include cummulative from all the children. If this is desired, we can make this change at report generation time using the snapshot ReportedConsumer
(to not hold the lock).
Thank you for this awesome work! Making memory consumers hierarchical makes sense to me. There is one consideration: the existing memory pool strategies (e.g. Regarding the error message format, IIUC now it's displaying the lineage of the consumer that triggered OOM, for instance OOM error happened in
I think this lineage information might not be the most straightforward for troubleshooting. How about just printing the whole picture instead? It would be showing top memory consumers, along with all its child consumers recursively down to the leaf. Also, here is a related issue, I think we want to make sure the changes are compatible (otherwise choose one to proceed): #17901 |
@andygrove FYI |
Thanks @wiedld my question similar to @2010YOUY01 how to use the lineage, it sounds very interesting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @wiedld
pub fn new_with_parent(name: impl Into<String>, parent_id: usize) -> Self { | ||
Self { | ||
name: name.into(), | ||
can_spill: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be always false? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's meant to be an alternative to MemoryConsumer::new
, and it's expected that the caller would switch from:
MemoryConsumer::new("foo").with_can_spill(true)
To have lineage with:
MemoryConsumer::new_wth_parent("foo", parent_id).with_can_spill(true)
Although, this does point out that the added MemoryReservation::new_child_reservation
should provide a spill config option. It's currently:
datafusion/datafusion/execution/src/memory_pool/mod.rs
Lines 485 to 492 in 7f45ee8
/// Create a new [`MemoryReservation`] with a new [`MemoryConsumer] that | |
/// is a child of this reservation's consumer. | |
/// | |
/// This is useful for creating memory consumers with lineage tracking. | |
pub fn new_child_reservation(&self, name: impl Into<String>) -> MemoryReservation { | |
MemoryConsumer::new_with_parent(name, self.consumer().id()) | |
.register(&self.registration.pool) | |
} |
I'll go add that now. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added: a77e316
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than add a bunch of new APIs for creating new with parent id, how about just adding a method like
fn with_parent_id(mut self, parent_id: ..) -> Self {
..
}
That mirrors the other fields?
That might also make the defaults less confusing
The lineage is created by using child reservations. I made an example commit here, using the parallel parquet writing: That is why in the following commit, when I switch to using the reservation stacktrace (based on lineage) for the OOM error reporting (a.k.a. If I started using the child reservations for other physical plan nodes, I would also expect to see a similar change in the OOM error messaging -- to include these traces. |
I should also clarify. The request (in the issue) was to help better debug OOM ing, but also ways to see how the different memory consumers are related (at least in this example the lineage was apparent to me 🤷🏼 ). I choose specific abstractions in hopes that we could repurpose the bits. We have a lineage of reservations, and a new snapshot construct (to perhaps grab snapshots of the locked state during a running query). I then created the stack trace, and used it for the OOM error message, as one specific example. Although we could also use it for other memory analysis/visualization tooling. 🤔 |
@2010YOUY01 - Should I make this change to the OOM messages? Or should we use the snapshots & lineage trees with another debug tool which shows all consumers?
The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks like a nice change -- thank you @wiedld
@2010YOUY01 and @rluvaton perhaps you might also be interested in this one
I left a few API comments, but the overall idea of introducing parent ids to memory reservations seems like a great first step towards more fine grained control
let parallel_options_clone = parallel_options.clone(); | ||
let pool = Arc::clone(context.memory_pool()); | ||
// Create a reservation for the parallel parquet writing | ||
let reservation = MemoryConsumer::new("ParquetSink(ParallelWriter)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a new child reservation of the reservation
above ? Or maybe I am misreading the diff and it already is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the MemoryConsumer::new(format("ParquetSink[path={path}]"))
on line 1273? That one is for the non-parallel writing.
Whereas this reservation here is the root of the parallel path writing. Perhaps I should change naming?
name: String, | ||
can_spill: bool, | ||
id: usize, | ||
parent_id: Option<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to expand out the memory consumer documentation to mention how the parent/child relationship was used
pub fn new_with_parent(name: impl Into<String>, parent_id: usize) -> Self { | ||
Self { | ||
name: name.into(), | ||
can_spill: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than add a bunch of new APIs for creating new with parent id, how about just adding a method like
fn with_parent_id(mut self, parent_id: ..) -> Self {
..
}
That mirrors the other fields?
That might also make the defaults less confusing
pub fn new_child_reservation( | ||
&self, | ||
name: impl Into<String>, | ||
can_spill: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about always passing can_spill
here -- I think we can use the existing with_can_spill
API instead of forcing a bunch of bool
parameters
/// | ||
/// This is useful for creating memory consumers with lineage tracking, | ||
/// while dealing with multithreaded scenarios. | ||
pub fn cloned_reservation(&self) -> MemoryReservation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is strange to me that clone_with_new_id
doesn't also register the reservation 🤔
As now users have two APIs to think about and figure out which to use
Could we unify them somehow (e.g. maybe deprecate clone_with_new_id
(as a follow on PR)?)
Which issue does this PR close?
Drafted/proposed solution for #16904
Rationale for this change
Various changes to make the OOM error messages more readable.
If we agree with the basic approach, then I'll breakup this draft into smaller PRs for code review.
What changes are included in this PR?
General changes, NOT having to do with the OOM error stack:
Changes for the OOM consumer stack:
ReportedConsumer
which represents a snapshot: 0bc7630ConsumerStackTrace
: ab62fcbExample usage:
TrackConsumersPool::report_top
: c7e869fAre these changes tested?
Yes.
Are there any user-facing changes?
Only nicer error messages.