Skip to content

Conversation

dricross
Copy link
Contributor

@dricross dricross commented Sep 5, 2025

Description of the issue

There are two primary issues causing leaking goroutines to drive up CloudWatch Agent memory usage over time.

Everliving Destinations

The CloudWatch Agent can publish logs from log files to CloudWatch Log Streams determined by the file name. Each log file the agent is reading from creates a "source" object (specifically tailerSrc type) with several running goroutines, and each log stream that the CloudWatch Agent is pushing logs to creates a "destination" object (specifically cwDest type) with several running goroutines. The source objects' goroutines are closed when the associated log file is closed, but the destination objects are never subsequently cleaned up. This causes a memory leak as the goroutines are never closed and keep piling up.

Dynamically generated log stream names can be generated when using the publish_multi_logs flag. Here's is a sample entry in the collect list:

          {
            "publish_multi_logs": true,
            "file_path": "/tmp/test_logs/publish_multi_logs_*.log",
            "log_group_name": "test-log-rotation",
            "log_stream_name": "rotation-test-stream",
            "timezone": "UTC",
            "timestamp_format": "%Y-%m-%dT%H:%M:%S",
            "multi_line_start_pattern": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}",
            "retention_in_days": 1
          },

For this config, the agent will periodically look for log files that match the /tmp/test_logs/publish_multi_logs_*.log glob pattern. For each one that it finds, it will write the contents of that file to test-log-rotation/rotiatest-test-stream-<logfilename> loggroup/logstream. So each file it finds will create one new source object and one new destination object.

It's important to note that there may be several source objects referencing one destination. For example, a customer could use the following in their collect list to collect logs from different files but push to the same destination:

          {
            "file_path": "/tmp/test_logs/shared_destination.txt",
            "log_group_name": "test-log-rotation",
            "log_stream_name": "shared-destination-stream",
            "timezone": "UTC",
            "timestamp_format": "%Y-%m-%dT%H:%M:%S",
            "multi_line_start_pattern": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}",
            "retention_in_days": 1
          },
          {
            "file_path": "/tmp/test_logs/shared_destination_to_close.txt",
            "log_group_name": "test-log-rotation",
            "log_stream_name": "shared-destination-stream",
            "timezone": "UTC",
            "timestamp_format": "%Y-%m-%dT%H:%M:%S",
            "multi_line_start_pattern": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}",
            "retention_in_days": 1
          }

Duplicate cloudwatch logs clients

Each time a destination object is created, a new cloudwatch logs client is created. Request/response handlers are injected into the client to collect agent health metrics. These handlers have more underlying clients which have their own goroutines and caches. The underlying cache is a way to associate request data with response data, like Payload size and latency, which are eventually forwarded to the agent health metrics extension. The handlers are created by a middleware configurer. Each time a new cloudwatch logs client is created, a new middleware configurer is created, which creates new request/response handlers and spins up more goroutines. These clients have no way to be closed so the goroutines are never closed and keep piling up.

Description of changes

There are two sets of changes to address the two issues noted above.

Reference Counting Destinations

Add reference counting to the destination objects and a notification system for the source objects to tell the destinations it's no longer being used. When the destination object is no longer used, it stops itself and tells the CloudWatchLogs plugin it's no longer usable.

This implementation assumes that nothing other than the source objects are using the destination objects. There is some vestigial code left over from when the CloudWatchLogs telegraf plugin was used to process EMF metrics which could use the destination objects, but that functionality has been moved to the OTel awsemfexporter plugin. So that code path is more or less unreachable. I removed all functionality in the unused code path to make it easier to make cwDest thread-safe.

As mentioned previously, it is important to note that it is possible for multiple source objects to reference one destination. This possibility of sharing means source objects cannot close the destinations directly and a signaling mechanism is necessary.

Using the above JSON as an example, the agent will create two sources for each of the log files and one destination object for the test-log-rotation/shared-destination-stream loggroup/logstream. When the file shared_destination_to_close.txt is closed, the source object will notify the destination object that the source has closed and it's no longer being referenced. The destination will remain open since it knows the shared_destination.txt is still using it. If shared_destination.txt is closed, then the destination will then know it's no longer being used and will close itself, terminating the associated goroutines.

Single Middleware Configurer

Create one middleware configurer (per CloudWatchLogs instance, but it's a singleton, so there's only ever one configurer). These request/response handlers are all identical, so there's really no need to create new ones, nor is there a need to create a new middleware configurer.

The one side effect is that destination objects will share request/response handlers, which means there is only one cache to support all request/responses. There are no concurrency concerns here as the request/response handlers already support concurrent request/responses.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Configure agent with the following configuration:

{
  "agent": {"debug":true},
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "publish_multi_logs": true,
            "file_path": "/tmp/test_logs/publish_multi_logs_*.log",
            "log_group_name": "test-log-rotation",
            "log_stream_name": "rotation-test-stream",
            "timezone": "UTC",
            "timestamp_format": "%Y-%m-%dT%H:%M:%S",
            "multi_line_start_pattern": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}",
            "retention_in_days": 1
          },
          {
            "file_path": "/tmp/test_logs/shared_destination.txt",
            "log_group_name": "test-log-rotation",
            "log_stream_name": "shared-destination-stream",
            "timezone": "UTC",
            "timestamp_format": "%Y-%m-%dT%H:%M:%S",
            "multi_line_start_pattern": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}",
            "retention_in_days": 1
          },
            {
            "file_path": "/tmp/test_logs/shared_destination_to_close.txt",
            "log_group_name": "test-log-rotation",
            "log_stream_name": "shared-destination-stream",
            "timezone": "UTC",
            "timestamp_format": "%Y-%m-%dT%H:%M:%S",
            "multi_line_start_pattern": "^\\d{4}-\\d{2}-\\d{2}T\\d{2}:\\d{2}:\\d{2}",
            "retention_in_days": 1
          }
        ]
      }
    }
  }

Use the following python script to generate log files that match the configuration:

#!/usr/bin/env python3

import time
import os
from datetime import datetime
from threading import Thread

log_dir = "/tmp/test_logs"

def generate_shared_destination_logs():
    os.makedirs(log_dir, exist_ok=True)
    
    shared_file = open(f"{log_dir}/shared_destination.txt", 'w')
    shared_to_close_file = open(f"{log_dir}/shared_destination_to_close.txt", 'w')
    
    try:
        start_time = time.time()
        line_counter = 1
        
        while True:
            current_time = time.time()
            elapsed = current_time - start_time
            
            log_line = f"{datetime.now().isoformat()} INFO Processing request {line_counter}\n"
            
            # Write to shared_destination throughout
            shared_file.write(log_line)
            shared_file.flush()
            
            # Write to shared_destination_to_close only for first 20 seconds
            if elapsed < 20:
                shared_to_close_file.write(log_line)
                shared_to_close_file.flush()
            elif elapsed >= 20 and shared_to_close_file:
                shared_to_close_file.close()
                shared_to_close_file = None
                print("Closed shared_destination_to_close.txt after 20 seconds")
            
            line_counter += 1
            time.sleep(0.5)
            
    except KeyboardInterrupt:
        print("\nStopping log generation...")
    finally:
        shared_file.close()
        if shared_to_close_file:
            shared_to_close_file.close()


def generate_publish_multi_logs():
    log_dir = "/tmp/test_logs"
    os.makedirs(log_dir, exist_ok=True)

    file_counter = 1
    current_file = None

    try:
        while True:
            # Close previous file and open new one
            if current_file:
               # current_file.close()
                os.remove(current_file.name)

            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"{log_dir}/publish_multi_logs_{timestamp}_{file_counter:03d}.log"
            current_file = open(filename, 'w')
            print(f"Created new log file: {filename}")

            # Write logs for 10 seconds
            start_time = time.time()
            line_counter = 1

            while time.time() - start_time < 10:
                log_line = f"{datetime.now().isoformat()} INFO [app_{file_counter:03d}] Processing request {line_counter}\n"
                current_file.write(log_line)
                current_file.flush()
                line_counter += 1
                time.sleep(0.5)  # Write every 500ms

            file_counter += 1

    except KeyboardInterrupt:
        print("\nStopping log generation...")
    finally:
        if current_file:
            current_file.close()

if __name__ == "__main__":
    multi_logs_thread = Thread(target = generate_publish_multi_logs, args = ())
    shared_destination_thread = Thread(target = generate_shared_destination_logs, args = ())
    multi_logs_thread.start()
    shared_destination_thread.start()

    multi_logs_thread.join()
    shared_destination_thread.join()

Requirements

Before commiting your code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

Integration Tests

To run integration tests against this PR, add the ready for testing label.

@dricross dricross marked this pull request as ready for review September 5, 2025 14:40
@dricross dricross requested a review from a team as a code owner September 5, 2025 14:40
}

func (cd *cwDest) AddEvent(e logs.LogEvent) {
if cd.stopped {
Copy link
Contributor

@duhminick duhminick Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

questionable nit: possible race condition here? let's say this reads false which can be stale, it'll try to push to a stopped pusher.

though the consequences aren't probably too high (?) I haven't dug into the code so I'm not entirely sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. We could accidentally write to a closed channel which will panic. We'll need to lock the cwDest for the duration of this call so it doesn't close in the middle.

Comment on lines 352 to 262
if cd.stopped {
return fmt.Errorf("cannot publish events: destination has been stopped")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed there's already a cd.stopped check at the end of the function. I don't know why it the check is at the end, I think we want to avoid handling any events if it's already stopped.

}

func (cd *cwDest) AddEvent(e logs.LogEvent) {
if cd.stopped {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. We could accidentally write to a closed channel which will panic. We'll need to lock the cwDest for the duration of this call so it doesn't close in the middle.

@@ -168,14 +166,17 @@ func (l *LogAgent) runSrcToDest(src LogSrc, dest LogDest) {
for e := range eventsCh {
err := dest.Publish([]LogEvent{e})
if err == ErrOutputStopped {
log.Printf("I! [logagent] Log destination %v has stopped, finalizing %v/%v", l.destNames[dest], src.Group(), src.Stream())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was actually a concurrent map write condition here. This non-sync map is written to in the Run routine and is potentially read from the runSrcToDest goroutine.

Comment on lines -251 to -255
destination := fileconfig.Destination
if destination == "" {
destination = t.Destination
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is unused

// start is the main loop for processing events and managing the queue.
func (q *queue) start() {
defer q.wg.Done()
mergeChan := make(chan logs.LogEvent)

// Merge events from both blocking and non-blocking channel
go func() {
Copy link
Contributor Author

@dricross dricross Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to a named function just to make the pprof output a little nicer. Since this is an unnamed function, the name of the function in the output is a bit obfuscated start.func1+0xcf:

github.com/aws/amazon-cloudwatch-agent/plugins/outputs/cloudwatchlogs/internal/pusher.(*queue).start.func1+0xcf github.com/aws/amazon-cloudwatch-agent/plugins/outputs/cloudwatchlogs/internal/pusher/queue.go:117

Still findable from the line number, just not super obvious

@dricross dricross force-pushed the dricross/logfiletailerdebug branch from 52f58b5 to 2114e27 Compare September 8, 2025 11:56
@dricross dricross force-pushed the dricross/logfiletailerdebug branch from 2114e27 to a2c0747 Compare September 8, 2025 12:06
Comment on lines 107 to +109
func (c *CloudWatchLogs) Write(metrics []telegraf.Metric) error {
for _, m := range metrics {
c.writeMetricAsStructuredLog(m)
}
return nil
// we no longer expect this to be used. We now use the OTel awsemfexporter for sending EMF metrics to CloudWatch Logs
return fmt.Errorf("unexpected call to Write")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make cwDest thread-safe more easily, I removed this functionality. This function is here to adhere to the Telegraf interface, but it is no longer used. This functionality was for pushing EMF metrics to CloudWatch Logs using this telegraf-based output plugin. However, the agent no longer uses this plugin to push EMF metrics. EMF metrics now go through the OTel awsemfexporter plugin.

@@ -186,166 +199,107 @@ func (c *CloudWatchLogs) createClient(retryer aws.RequestRetryer) *cloudwatchlog
return client
}

func (c *CloudWatchLogs) writeMetricAsStructuredLog(m telegraf.Metric) {
Copy link
Contributor Author

@dricross dricross Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only used by Write which is no longer used

}

func (c *CloudWatchLogs) getLogEventFromMetric(metric telegraf.Metric) *structuredLogEvent {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only used by Write which is no longer used


type structuredLogEvent struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only used by Write which is no longer used

@@ -366,44 +318,10 @@ func (cd *cwDest) switchToEMF() {
}
}

// Description returns a one-sentence description on the Output
func (c *CloudWatchLogs) Description() string {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved these two CloudWatchLogs functions to the CloudWatchLogs block (above cwDest). Didn't make much sense to define these these functions separately from the rest.

Comment on lines +275 to +282
func (cd *cwDest) NotifySourceStopped() {
cd.Lock()
defer cd.Unlock()
cd.refCount--
if cd.refCount <= 0 {
cd.stop()
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refCount being negative is probably unexpected, we could add a debug/warning message here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

@duhminick
Copy link
Contributor

Also, when possible, can you run the integ tests?

@dricross dricross added the ready for testing Indicates this PR is ready for integration tests to run label Sep 8, 2025
@dricross
Copy link
Contributor Author

dricross commented Sep 8, 2025

Also, when possible, can you run the integ tests?

Was having trouble with the log tailer unit test on Windows again... Eventually worked after a couple of retries. Added the tag and kicked off integ tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for testing Indicates this PR is ready for integration tests to run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants