-
Notifications
You must be signed in to change notification settings - Fork 927
Add close() to producer #2039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add close() to producer #2039
Conversation
🎉 All Contributor License Agreements have been signed. Ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a close()
method to the Kafka Producer class to enable explicit cleanup of producer resources. This change supports proper resource management, live credential rotation use cases, and prevents memory leaks by allowing users to explicitly destroy producer instances.
- Implements Producer_close() function with proper resource cleanup
- Adds method definition to Producer_methods array with documentation
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
…on into add-producer-close
…on into add-producer-close
…on into add-producer-close
Hi @peter-lucia thanks for the PR and the checks. Going through the comments: The
librdkafka internally has a reference count for some C objects so even if you call destroy in a different place and there are references left it would hang, so we have any of those problem before any API change.
I tried reproducing it with your example but couldn't, this is the code I tried: import gc
import sys
import json
from time import sleep
from confluent_kafka import Producer
class Kafka:
def __init__(self, conf):
def error_cb(*args, **kwargs):
print(args, kwargs)
self.conf = {
'socket.timeout.ms': 10,
'error_cb': error_cb,
'message.timeout.ms': 10
,**conf
}
self.producer = Producer(self.conf)
def recreate_producer(self):
print('Create new producer')
prior_producer = self.producer
prior_producer.flush(10)
# prior_producer.close() # This would be preferred
del prior_producer
gc.collect() # technically redundant, since garbage collection will happen after the method returns
self.producer = Producer(self.conf)
while True:
handler = Kafka({"bootstrap.servers": sys.argv[1]})
msg = {"test": "test"}
handler.producer.produce(json.dumps(msg))
handler.recreate_producer()
sleep(0.5) A simple and reliable way to see if there are active librdkafka instances it to check process threads, if they're growing, but after many iterations they're the same:
If you find some changes to the reproducer code that make it happen please tell us or if it needs to be run for a longer time. |
Hi @emasab, Thank you for taking a look at this PR and going through the investigation details. The early investigatory writings around librdkafka internals were included here to detail out background on what's happening under the hood in librdkafka and how that relates to the python package as we were digging into it. They were intended to document possible reasons for why producer resources could be living on longer than they should be. Related to this is how memory leaks could happen, or how the python garbage collection against the producer object is adjusted. Your explanations and comments on our investigation notes generally match our latest understanding as we are digging into this. Going through them:
Yes, initially it was more of a concern but given that it doesn't appear to be used by the dealloc, we're in agreement here that while it leaks memory, as long as it is not used now or in the future, it should be okay.
Yes, we didn't mean that it was disabling the GC entirely (if so, we'd have bigger issues!). We're aligned on this: the garbage collector is disabled for the Producer object ( Follow up on this -- the dealloc method doesn't confirm the kafka objects are destroyed. The call to wait before they are confirmed to be destroyed can block up until the provided timeout, so we could see why that might not have been done as dealloc's won't generally have timeouts provided -- a default would have to be provided which might not apply to all use cases. For the most certainty during app-initiated cleanup operations where timing is important, a confirmation that the cleanup was actually successful is helpful, which is why we added it to Producer.close().
Yes, makes sense.
The newly added unit test Can you reproduce this by running the unit tests but commenting out the p.close() calls we've added in all other unit tests except for Expanding on this: One of the additions to producer.close() is a call to
Please let us know if you can reproduce our findings on your end through the unit testing or if you see any discrepancies. We will look into the additional mechanisms of detecting lingering librdkafka instances but believe the |
Some observations:
|
…on into add-producer-close
I checked the failing tests and they're failing because the new code is calling There's no need to call So if you don't call That doesn't mean that for other reasons the About the telemetry push in |
Let's dig into this a bit more. The close() call returns once all underlying kafka objects are destroyed or if the timeout (default=5s) is reached before that happens. In the two failing tests, the timeout was reached first, indicating that either there were lingering producers still running or the current number of active producers was greater than 0 (not cleaned up). GC will run right after the completion of the other test functions. GC should be handling the cleanup of any lingering consumers or producers left over from other tests before the start of new tests. The fact that these two tests fail fairly consistently when we don't call Relying on the GC alone may be allowing the python to move on and asynchronously create new producers before the previous ones have been cleared. This may explain why GC alone doesn't clean up all underlying producers. |
That isn't happening for sure, the GC cyclic garbage collector isn't triggered automatically there and neither it's triggered at specific time intervals but it happens when there is a certain number of
But there's one additional thing, the cyclic GC is needed because the producer isn't automatically destroyed for reference count zero. I check if there's some cycle that can prevent the destruction by refcount or some dangling reference count that prevents it from reaching zero and calling the destructor immediately after the previous tests return. |
Yes, I think our understandings align here: the GC isn't happening and like you are saying it isn't triggered automatically after the completion of the previous tests.
Yes
Did your checks reveal anything or is this still an open question? I have updated the PR to no longer include the |
@peter-lucia I finished my checks and found that in failing tests what was preventing the immediate garbage collection was the exceptions thrown in I understand there's need for a more predictable producer disposal just this way after calling if (!self->rk) {
PyErr_SetString(PyExc_RuntimeError,
"Consumer closed");
return NULL;
} If you prefer we can continue and finish these changes. |
That is fine by me, go ahead! |
What
Before this change:
producer.close()
method existed for producers.After this change:
producer.close()
to enable the producer to be shut down on demand.Checklist
Test & Review
Other Information