-
Notifications
You must be signed in to change notification settings - Fork 235
Address goroutine leak with dynamically determined log destinations #1848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
} | ||
|
||
func (cd *cwDest) AddEvent(e logs.LogEvent) { | ||
if cd.stopped { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
questionable nit: possible race condition here? let's say this reads false which can be stale, it'll try to push to a stopped pusher.
though the consequences aren't probably too high (?) I haven't dug into the code so I'm not entirely sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. We could accidentally write to a closed channel which will panic. We'll need to lock the cwDest for the duration of this call so it doesn't close in the middle.
if cd.stopped { | ||
return fmt.Errorf("cannot publish events: destination has been stopped") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed there's already a cd.stopped
check at the end of the function. I don't know why it the check is at the end, I think we want to avoid handling any events if it's already stopped.
} | ||
|
||
func (cd *cwDest) AddEvent(e logs.LogEvent) { | ||
if cd.stopped { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. We could accidentally write to a closed channel which will panic. We'll need to lock the cwDest for the duration of this call so it doesn't close in the middle.
@@ -168,14 +166,17 @@ func (l *LogAgent) runSrcToDest(src LogSrc, dest LogDest) { | |||
for e := range eventsCh { | |||
err := dest.Publish([]LogEvent{e}) | |||
if err == ErrOutputStopped { | |||
log.Printf("I! [logagent] Log destination %v has stopped, finalizing %v/%v", l.destNames[dest], src.Group(), src.Stream()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was actually a concurrent map write condition here. This non-sync map is written to in the Run
routine and is potentially read from the runSrcToDest
goroutine.
destination := fileconfig.Destination | ||
if destination == "" { | ||
destination = t.Destination | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is unused
// start is the main loop for processing events and managing the queue. | ||
func (q *queue) start() { | ||
defer q.wg.Done() | ||
mergeChan := make(chan logs.LogEvent) | ||
|
||
// Merge events from both blocking and non-blocking channel | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this to a named function just to make the pprof output a little nicer. Since this is an unnamed function, the name of the function in the output is a bit obfuscated start.func1+0xcf
:
github.com/aws/amazon-cloudwatch-agent/plugins/outputs/cloudwatchlogs/internal/pusher.(*queue).start.func1+0xcf github.com/aws/amazon-cloudwatch-agent/plugins/outputs/cloudwatchlogs/internal/pusher/queue.go:117
Still findable from the line number, just not super obvious
52f58b5
to
2114e27
Compare
2114e27
to
a2c0747
Compare
func (c *CloudWatchLogs) Write(metrics []telegraf.Metric) error { | ||
for _, m := range metrics { | ||
c.writeMetricAsStructuredLog(m) | ||
} | ||
return nil | ||
// we no longer expect this to be used. We now use the OTel awsemfexporter for sending EMF metrics to CloudWatch Logs | ||
return fmt.Errorf("unexpected call to Write") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make cwDest thread-safe more easily, I removed this functionality. This function is here to adhere to the Telegraf interface, but it is no longer used. This functionality was for pushing EMF metrics to CloudWatch Logs using this telegraf-based output plugin. However, the agent no longer uses this plugin to push EMF metrics. EMF metrics now go through the OTel awsemfexporter plugin.
@@ -186,166 +199,107 @@ func (c *CloudWatchLogs) createClient(retryer aws.RequestRetryer) *cloudwatchlog | |||
return client | |||
} | |||
|
|||
func (c *CloudWatchLogs) writeMetricAsStructuredLog(m telegraf.Metric) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only used by Write
which is no longer used
} | ||
|
||
func (c *CloudWatchLogs) getLogEventFromMetric(metric telegraf.Metric) *structuredLogEvent { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only used by Write
which is no longer used
|
||
type structuredLogEvent struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only used by Write
which is no longer used
@@ -366,44 +318,10 @@ func (cd *cwDest) switchToEMF() { | |||
} | |||
} | |||
|
|||
// Description returns a one-sentence description on the Output | |||
func (c *CloudWatchLogs) Description() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved these two CloudWatchLogs functions to the CloudWatchLogs block (above cwDest). Didn't make much sense to define these these functions separately from the rest.
func (cd *cwDest) NotifySourceStopped() { | ||
cd.Lock() | ||
defer cd.Unlock() | ||
cd.refCount-- | ||
if cd.refCount <= 0 { | ||
cd.stop() | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refCount being negative is probably unexpected, we could add a debug/warning message here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea
Also, when possible, can you run the integ tests? |
Was having trouble with the log tailer unit test on Windows again... Eventually worked after a couple of retries. Added the tag and kicked off integ tests. |
Description of the issue
There are two primary issues causing leaking goroutines to drive up CloudWatch Agent memory usage over time.
Everliving Destinations
The CloudWatch Agent can publish logs from log files to CloudWatch Log Streams determined by the file name. Each log file the agent is reading from creates a "source" object (specifically
tailerSrc
type) with several running goroutines, and each log stream that the CloudWatch Agent is pushing logs to creates a "destination" object (specificallycwDest
type) with several running goroutines. The source objects' goroutines are closed when the associated log file is closed, but the destination objects are never subsequently cleaned up. This causes a memory leak as the goroutines are never closed and keep piling up.Dynamically generated log stream names can be generated when using the
publish_multi_logs
flag. Here's is a sample entry in the collect list:For this config, the agent will periodically look for log files that match the
/tmp/test_logs/publish_multi_logs_*.log
glob pattern. For each one that it finds, it will write the contents of that file totest-log-rotation/rotiatest-test-stream-<logfilename>
loggroup/logstream. So each file it finds will create one new source object and one new destination object.It's important to note that there may be several source objects referencing one destination. For example, a customer could use the following in their collect list to collect logs from different files but push to the same destination:
Duplicate cloudwatch logs clients
Each time a destination object is created, a new cloudwatch logs client is created. Request/response handlers are injected into the client to collect agent health metrics. These handlers have more underlying clients which have their own goroutines and caches. The underlying cache is a way to associate request data with response data, like Payload size and latency, which are eventually forwarded to the agent health metrics extension. The handlers are created by a middleware configurer. Each time a new cloudwatch logs client is created, a new middleware configurer is created, which creates new request/response handlers and spins up more goroutines. These clients have no way to be closed so the goroutines are never closed and keep piling up.
Description of changes
There are two sets of changes to address the two issues noted above.
Reference Counting Destinations
Add reference counting to the destination objects and a notification system for the source objects to tell the destinations it's no longer being used. When the destination object is no longer used, it stops itself and tells the CloudWatchLogs plugin it's no longer usable.
This implementation assumes that nothing other than the source objects are using the destination objects. There is some vestigial code left over from when the CloudWatchLogs telegraf plugin was used to process EMF metrics which could use the destination objects, but that functionality has been moved to the OTel awsemfexporter plugin. So that code path is more or less unreachable. I removed all functionality in the unused code path to make it easier to make cwDest thread-safe.
As mentioned previously, it is important to note that it is possible for multiple source objects to reference one destination. This possibility of sharing means source objects cannot close the destinations directly and a signaling mechanism is necessary.
Using the above JSON as an example, the agent will create two sources for each of the log files and one destination object for the
test-log-rotation/shared-destination-stream
loggroup/logstream. When the fileshared_destination_to_close.txt
is closed, the source object will notify the destination object that the source has closed and it's no longer being referenced. The destination will remain open since it knows theshared_destination.txt
is still using it. Ifshared_destination.txt
is closed, then the destination will then know it's no longer being used and will close itself, terminating the associated goroutines.Single Middleware Configurer
Create one middleware configurer (per CloudWatchLogs instance, but it's a singleton, so there's only ever one configurer). These request/response handlers are all identical, so there's really no need to create new ones, nor is there a need to create a new middleware configurer.
The one side effect is that destination objects will share request/response handlers, which means there is only one cache to support all request/responses. There are no concurrency concerns here as the request/response handlers already support concurrent request/responses.
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
Configure agent with the following configuration:
Use the following python script to generate log files that match the configuration:
Requirements
Before commiting your code, please do the following steps.
make fmt
andmake fmt-sh
make lint
Integration Tests
To run integration tests against this PR, add the
ready for testing
label.