Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new awsebsnvmereceiver #1603

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

duhminick
Copy link
Contributor

@duhminick duhminick commented Mar 18, 2025

Note to Reviewers

This PR does not enable this new receiver. This is strictly just adding a new receiver that will be enabled later.

Changes 2:

  1. Use fmt.Sscanf instead of the custom parsing logic for the device file name
  2. Enable only one metric by default. This is for the upcoming translation changes -- this just simplifies some of the logic. We need to keep at least one metric enabled by default or else the generated tests from mdatagen fail
  3. Rename Resources to Devices in the receiver config

Description of the issue

Elastic Block Storage (EBS) exposes performance statistics for EBS volumes attached to EC2 instances as NVMe devices in a vendor unique log page. The log page can be retrieved by making a system call to the NVMe device. CloudWatch Agent (CWA) is going to collect the retrieved metrics and emit them to CloudWatch.

Description of changes

  1. Main Scraper Implementation (scraper.go):

    • Implements the core scraping logic through the nvmeScraper struct
    • Uses a MetricsBuilder to construct standardized metrics
      • MetricsBuilder is generated using mdatagen with the schema defined in metadata.yaml
    • Handles device discovery and metric collection for EBS NVMe devices
      • Includes the ability to scrape all devices, or specific devices.
    • Implements safety checks for integer overflow protection
  2. NVMe Metrics Collection (internal/nvme):

    • Implements low-level NVMe device interaction through Linux ioctl calls
    • Provides structured metric collection through EBSMetrics struct
    • Collects key metrics including:
      • Read/Write operations and bytes
      • Total read/write times
      • IOPS and throughput exceeded counters
      • Queue length
      • Read/Write latency histograms (though this is not being collected atm)
  3. Generated components (internal/metadata);

  • All generated by mdatagen using the schema defined in metadata.yaml

The receiver collects the following metrics from EBS NVMe devices:

  • diskio_ebs_total_read_ops
  • diskio_ebs_total_write_ops
  • diskio_ebs_total_read_bytes
  • diskio_ebs_total_write_bytes
  • diskio_ebs_total_read_time
  • diskio_ebs_total_write_time
  • diskio_ebs_volume_performance_exceeded_iops
  • diskio_ebs_volume_performance_exceeded_tp
  • diskio_ebs_ec2_instance_performance_exceeded_iops
  • diskio_ebs_ec2_instance_performance_exceeded_tp
  • diskio_ebs_volume_queue_length

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

  1. Manually updated the YAML config to have a new instance of the EBS receiver as well as update the existing host delta metrics pipeline to have the new receiver.
  • Attached an EBS volume to the instance
image

The EC2 instance that the manual tests were ran on have two EBS volumes attached (nvme0 and nvme1)

Resource is explicitly empty

2025-03-18T19:35:03Z D! {"caller":"awsebsnvmereceiver/scraper.go:54","msg":"Began scraping for NVMe metrics","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics"}
2025-03-18T19:35:03Z D! {"caller":"awsebsnvmereceiver/scraper.go:133","msg":"skipping un-allowed device","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","device":"nvme0"}
2025-03-18T19:35:03Z D! {"caller":"awsebsnvmereceiver/scraper.go:133","msg":"skipping un-allowed device","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","device":"nvme0n1"}

One device (nvme0) is in resources

2025-03-18T19:36:24Z D! {"caller":"awsebsnvmereceiver/scraper.go:54","msg":"Began scraping for NVMe metrics","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics"}
2025-03-18T19:36:24Z D! {"caller":"awsebsnvmereceiver/scraper.go:133","msg":"skipping un-allowed device","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","device":"nvme0n1"}
2025-03-18T19:36:24Z D! {"caller":"awsebsnvmereceiver/scraper.go:133","msg":"skipping un-allowed device","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","device":"nvme0n1p1"}
2025-03-18T19:36:24Z D! {"caller":"awsebsnvmereceiver/scraper.go:101","msg":"emitted metrics for nvme device with controller id","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","controllerID":0}

* for resources

2025-03-18T19:38:30Z D! {"caller":"awsebsnvmereceiver/scraper.go:54","msg":"Began scraping for NVMe metrics","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics"}
2025-03-18T19:38:30Z D! {"caller":"awsebsnvmereceiver/scraper.go:101","msg":"emitted metrics for nvme device with controller id","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","controllerID":0}
2025-03-18T19:38:30Z D! {"caller":"awsebsnvmereceiver/scraper.go:101","msg":"emitted metrics for nvme device with controller id","kind":"receiver","name":"awsebsnvmereceiver","data_type":"metrics","controllerID":1}

Sample Config

receivers:
    awsebsnvmereceiver:
        collection_interval: 1m0s
        initial_delay: 1s
        metrics:
            diskio_ebs_ec2_instance_performance_exceeded_iops:
                enabled: false
            diskio_ebs_ec2_instance_performance_exceeded_tp:
                enabled: false
            diskio_ebs_total_read_bytes:
                enabled: true
            diskio_ebs_total_read_ops:
                enabled: false
            diskio_ebs_total_read_time:
                enabled: false
            diskio_ebs_total_write_bytes:
                enabled: true
            diskio_ebs_total_write_ops:
                enabled: false
            diskio_ebs_total_write_time:
                enabled: false
            diskio_ebs_volume_performance_exceeded_iops:
                enabled: false
            diskio_ebs_volume_performance_exceeded_tp:
                enabled: false
            diskio_ebs_volume_queue_length:
                enabled: false
        resource_attributes:
            VolumeId:
                enabled: true
        devices:
            - nvme0n1
        timeout: 0s

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@duhminick duhminick force-pushed the dominic-nvme-receiver branch from 66fb288 to 546fb3e Compare March 18, 2025 15:04
@duhminick duhminick changed the base branch from main to ebs March 18, 2025 15:19
@duhminick duhminick force-pushed the dominic-nvme-receiver branch from 546fb3e to be8e93b Compare March 18, 2025 15:23
@duhminick duhminick marked this pull request as ready for review March 18, 2025 19:39
@duhminick duhminick requested a review from a team as a code owner March 18, 2025 19:39
@duhminick duhminick changed the base branch from ebs to main March 18, 2025 19:40
@lisguo
Copy link
Contributor

lisguo commented Mar 25, 2025

"This PR should be merged into the ebs branch"

Is this still true? We are just adding the receiver not enabling it so putting it to main seems fine

@duhminick
Copy link
Contributor Author

Is this still true? We are just adding the receiver not enabling it so putting it to main seems fine

Oh, got it. Then I'll leave the destination branch as is

@duhminick
Copy link
Contributor Author

agent.json
output.txt

Attaching the agent config and translated configs

Copy link
Contributor

@lisguo lisguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also provide a sample yaml config of the receiver?

namespace := -1
partition := -1

fmt.Sscanf(device, "nvme%dn%dp%d", &controller, &namespace, &partition)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check if this returns an error? if so then we are unable to parse the device

Copy link
Contributor Author

@duhminick duhminick Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to ignore the error and chose to throw an error if we couldn't at least get the controller ID. The reason is that this function parses the three different device file name patterns (nvme{id}, nvme{id}n{namespace}, nvme{id}n{namespace}p{partition}). If the input isn't the third pattern, then there's an "EOF" error.

I could instead just do something like check for nvme%dn%dp%d, then nvme%dn%d, and lastly nvme%d. I guess that'll give a better error message but it seems like a lot of excess work. In which then I kind of like the implementation I had before.

Or, we could do a regular expression like: ^nvme(\d+)(?:n(\d+))?(?:p(\d+))?$. https://regex101.com/r/5TjfEE/1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Make it clear that it can throw an error and we're explicitly ignoring it.

Suggested change
fmt.Sscanf(device, "nvme%dn%dp%d", &controller, &namespace, &partition)
_, _ = fmt.Sscanf(device, "nvme%dn%dp%d", &controller, &namespace, &partition)

}

// Check if all devices should be collected. Otherwise check if defined by user
_, hasAsterisk := s.allowedDevices["*"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have plans on filtering by devices?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, it's possible to filter by devices already. it's just below the line you highlighted 😄

if _, isAllowed := s.allowedDevices[deviceName]; !isAllowed {
s.logger.Debug("skipping un-allowed device", zap.String("device", deviceName))
continue
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Typically, in Go conventon, the file would be named something like util_notunix.go like https://github.com/aws/amazon-cloudwatch-agent/blob/main/plugins/inputs/windows_event_log/windows_event_log_notwindows.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I was following what I found in the Go src. Example: https://github.com/golang/go/blob/master/src/net/cgo_stub.go. I can change it to what we do for the agent to keep it consistent though.

IsEbsDevice(device *DeviceFileAttributes) (bool, error)
}

type Util struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I know we use Util as a name all over the place, but it doesn't help me understand what this is. Would rather it be called like DeviceInfoProvider or something more descriptive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why is this a struct if it doesn't have a state? Why not just have the functions exported as package level functions?


package nvme

type UtilInterface interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In Go, Interface typically isn't included in interface names. The main purpose of an interface is to define a set of functions and use the interface to expose it. One common pattern is for the interface to be exported and the struct to not be.

type Util interface {
}

type util struct {
}

import "errors"

func (u *Util) GetAllDevices() ([]DeviceFileAttributes, error) {
return nil, errors.New("nvme stub: nvme not supported")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Don't include the "file name" in the error.

Suggested change
return nil, errors.New("nvme stub: nvme not supported")
return nil, errors.New("nvme not supported")

)
}

func arrayToSet(arr []string) map[string]struct{} {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if foundWorkingDevice {
s.logger.Debug("emitted metrics for nvme device with controller id", zap.Int("controllerID", id))
} else {
s.logger.Info("unable to get metrics for nvme device with controller id", zap.Int("controllerID", id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Why is this an info log?

Comment on lines +7 to +11
NvmeDevicePrefix = "nvme"
DevDirectoryPath = "/dev"
NvmeSysDirectoryPath = "/sys/class/nvme"

EbsNvmeModelName = "Amazon Elastic Block Store"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Most of these are only used within the package and should not be exported.

Suggested change
NvmeDevicePrefix = "nvme"
DevDirectoryPath = "/dev"
NvmeSysDirectoryPath = "/sys/class/nvme"
EbsNvmeModelName = "Amazon Elastic Block Store"
devicePrefix = "nvme"
devDirectoryPath = "/dev"
sysDirectoryPath = "/sys/class/nvme"
ebsModelName = "Amazon Elastic Block Store"

Even DevDirectoryPath could avoid being exported if you had a simple function like

func DevPath(device string) string {
  return filepath.Join(devDirectoryPath, device)
}


devices := []DeviceFileAttributes{}
for _, entry := range entries {
if strings.HasPrefix(entry.Name(), NvmeDevicePrefix) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Likely not an issue in this directory, but you could check entry.IsDir() first and ignore those.

Comment on lines +57 to +60
var (
ErrInvalidEBSMagic = errors.New("invalid EBS magic number")
ErrParseLogPage = errors.New("failed to parse log page")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If they aren't used outside of the package, don't export them.

Comment on lines +88 to +94
func nvmeReadLogPage(fd uintptr, logID uint8) ([]byte, error) {
data := make([]byte, 4096) // 4096 bytes is the length of the log page.
bufferLen := len(data)

if bufferLen > math.MaxUint32 {
return nil, errors.New("nvmeReadLogPage: bufferLen exceeds MaxUint32")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this check. If we define the length of the slice, how would this ever return an error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants