Skip to content

Conversation

@the-mann
Copy link
Contributor

@the-mann the-mann commented Nov 12, 2025

Blocked on amazon-contributing/opentelemetry-collector-contrib#385 before merging. will update go.mod after that PR has been merged in

Description of the issue

Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 instances that enables customers to run applications requiring high levels of inter-node communications at scale on AWS. EFA provides lower and more consistent latency and higher throughput than traditional TCP transport, making it ideal for high-performance computing (HPC), machine learning (ML) training, and other distributed workloads.

Currently, the CloudWatch Agent lacks visibility into some new EFA metrics, which are critical for:

  • Detecting connectivity issues between nodes
  • Identifying packet retransmission patterns that may impact training performance

Description of changes

This PR adds collection of five key EFA metrics at the node, container, and pod levels:

  1. unresponsive_remote_events - Number of times remote endpoints became unresponsive
  2. impaired_remote_conn_events - Number of impaired remote connection events
  3. retrans_timeout_events - Number of retransmission timeout events
  4. retrans_pkts - Total number of retransmitted packets
  5. retrans_bytes - Total bytes retransmitted

These metrics are collected from the Linux sysfs interface exposed by the EFA driver and are aggregated at:

  • Node level: Overall EFA performance for the instance
  • Pod level: EFA metrics for specific workloads
  • Container level: Granular metrics per container

The implementation reads hardware counters from the sysfs filesystem and exposes them through the CloudWatch Agent's container insights metrics pipeline.

Tests

Manual Testing:

  1. Created an EKS 1.33 cluster with EFA-enabled node group (c6in.32xlarge instances)
  2. Installed CloudWatch Observability EKS addon
  3. Deployed a test DaemonSet requesting EFA devices (vpc.amazonaws.com/efa: 1)
  4. Built and deployed a development version of the CloudWatch Agent with these changes
  5. Verified all five EFA metrics are successfully collected and visible in CloudWatch Metrics console (screenshot provided)

Testing Configuration:

  • Instance Type: c6in.32xlarge (EFA-enabled)
  • Region: us-west-2
  • Availability Zones: us-west-2a, us-west-2c
  • EKS Version: 1.33

The metrics were successfully published to CloudWatch and are queryable with appropriate dimensions (node, pod, container).

@the-mann the-mann requested a review from a team as a code owner November 12, 2025 18:04
@github-actions
Copy link
Contributor

This PR was marked stale due to lack of activity.

"container_efa_rdma_read_bytes",
"container_efa_rdma_write_bytes",
"container_efa_rdma_write_recv_bytes",
"container_efa_retrans_bytes",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need to update translation test aws/amazon-cloudwatch-agent/translator/tocwconfig/sampleConfig/emf_and_kubernetes_with_gpu_config.yaml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does emf_and_kubernetes_with_gpu_high_frequency_config.yaml need to be updated too?

@github-actions github-actions bot removed the Stale label Nov 21, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Dec 1, 2025

This PR was marked stale due to lack of activity.

@github-actions github-actions bot added the Stale label Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for testing Indicates this PR is ready for integration tests to run Stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants