Skip to content

feat(baremetal_validator): Add Validator for Process, Container Metrics #1878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2,604 commits into from

Conversation

KaiyiLiu1234
Copy link
Collaborator

This commit includes validation setups for Process and Container RAPL related Baremetal power metrics (node, process, container).

vprashar2929 and others added 30 commits August 6, 2024 11:39
…r-refactor

chore(validator): store test result in an object
…puting-io/dependabot/go_modules/go-dependencies-258b52e974

build(deps): bump the go-dependencies group with 2 updates
Add a workflow that runs the pre-commit jobs to
cover the case where a developer hasn't opted into
pre-commit. Also add the conventional commit message
hook and enable more golang linters. The commits also
tidies up the markdown files that had line length
issues.

Signed-off-by: Maryam Tahhan <[email protected]>
check metrics before and after e2e tests
This reverts commit cde7833.

Signed-off-by: Dave Tucker <[email protected]>
This reverts commit ca4cdf6.

Signed-off-by: Dave Tucker <[email protected]>
…ert-ringbuf

Revert "Merge pull request sustainable-computing-io#1628 from dave-tucker/new-ebpf"
…dator-node-info

chore(validator): fix missing node-info in md
By default the name chosen to define the Virtual Machine ID is the one
found in /proc/<pid>/cgroup/.

Many IaaS platform like Openstack use metadata in libvirt to customise
among other things the instance's ID.

Allow the user to use the metadata in libvirt to set the name of the
vm metrics and gives a better user experience.

Set "LIBVIRT_METADATA_URI" with the uri of the XML namespace identifier.
Set "LIBVIRT_METADATA_TOKEN" to choose the string used for VM ID. By
default, the token "name" is used.

Signed-off-by: Anthony Harivel <[email protected]>
…move-model-globals

chore: cleanup globals in pkg/model
…gister-new-process

fix(bpf): use prev_tgid to register process
Update the register counters one step before metrics sample is taken
  instead of updating registers every time which is increasing overhead

Signed-off-by: Vimal Kumar <[email protected]>
…1 updates

Bumps the go-dependencies group with 8 updates in the / directory:

| Package | From | To |
| --- | --- | --- |
| [github.com/beevik/etree](https://github.com/beevik/etree) | `1.4.0` | `1.4.1` |
| [github.com/cilium/ebpf](https://github.com/cilium/ebpf) | `0.15.0` | `0.16.0` |
| [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) | `2.19.1` | `2.20.0` |
| [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) | `1.19.1` | `1.20.0` |
| [github.com/prometheus/prometheus](https://github.com/prometheus/prometheus) | `0.53.1` | `0.54.0` |
| [golang.org/x/time](https://github.com/golang/time) | `0.5.0` | `0.6.0` |
| [k8s.io/api](https://github.com/kubernetes/api) | `0.29.7` | `0.29.8` |
| [k8s.io/client-go](https://github.com/kubernetes/client-go) | `0.29.7` | `0.29.8` |



Updates `github.com/beevik/etree` from 1.4.0 to 1.4.1
- [Release notes](https://github.com/beevik/etree/releases)
- [Changelog](https://github.com/beevik/etree/blob/main/RELEASE_NOTES.md)
- [Commits](beevik/etree@v1.4.0...v1.4.1)

Updates `github.com/cilium/ebpf` from 0.15.0 to 0.16.0
- [Release notes](https://github.com/cilium/ebpf/releases)
- [Commits](cilium/ebpf@v0.15.0...v0.16.0)

Updates `github.com/onsi/ginkgo/v2` from 2.19.1 to 2.20.0
- [Release notes](https://github.com/onsi/ginkgo/releases)
- [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md)
- [Commits](onsi/ginkgo@v2.19.1...v2.20.0)

Updates `github.com/prometheus/client_golang` from 1.19.1 to 1.20.0
- [Release notes](https://github.com/prometheus/client_golang/releases)
- [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md)
- [Commits](prometheus/client_golang@v1.19.1...v1.20.0)

Updates `github.com/prometheus/prometheus` from 0.53.1 to 0.54.0
- [Release notes](https://github.com/prometheus/prometheus/releases)
- [Changelog](https://github.com/prometheus/prometheus/blob/main/CHANGELOG.md)
- [Commits](prometheus/prometheus@v0.53.1...v0.54.0)

Updates `golang.org/x/sys` from 0.22.0 to 0.23.0
- [Commits](golang/sys@v0.22.0...v0.23.0)

Updates `golang.org/x/time` from 0.5.0 to 0.6.0
- [Commits](golang/time@v0.5.0...v0.6.0)

Updates `k8s.io/api` from 0.29.7 to 0.29.8
- [Commits](kubernetes/api@v0.29.7...v0.29.8)

Updates `k8s.io/apimachinery` from 0.29.7 to 0.29.8
- [Commits](kubernetes/apimachinery@v0.29.7...v0.29.8)

Updates `k8s.io/client-go` from 0.29.7 to 0.29.8
- [Changelog](https://github.com/kubernetes/client-go/blob/master/CHANGELOG.md)
- [Commits](kubernetes/client-go@v0.29.7...v0.29.8)

Updates `k8s.io/klog/v2` from 2.120.1 to 2.130.1
- [Release notes](https://github.com/kubernetes/klog/releases)
- [Changelog](https://github.com/kubernetes/klog/blob/main/RELEASE.md)
- [Commits](kubernetes/klog@v2.120.1...v2.130.1)

---
updated-dependencies:
- dependency-name: github.com/beevik/etree
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: go-dependencies
- dependency-name: github.com/cilium/ebpf
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: github.com/onsi/ginkgo/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: github.com/prometheus/client_golang
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: github.com/prometheus/prometheus
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: golang.org/x/sys
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: golang.org/x/time
  dependency-type: indirect
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
- dependency-name: k8s.io/api
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: go-dependencies
- dependency-name: k8s.io/apimachinery
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: go-dependencies
- dependency-name: k8s.io/client-go
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: go-dependencies
- dependency-name: k8s.io/klog/v2
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
...

Signed-off-by: dependabot[bot] <[email protected]>
…puting-io/dependabot/go_modules/go-dependencies-bb1f50d887

build(deps): bump the go-dependencies group across 1 directory with 11 updates
…updates

Bumps the github-actions group with 5 updates in the / directory:

| Package | From | To |
| --- | --- | --- |
| [actions/checkout](https://github.com/actions/checkout) | `3` | `4` |
| [anchore/sbom-action](https://github.com/anchore/sbom-action) | `0.16.1` | `0.17.1` |
| [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4.3.4` | `4.3.6` |
| [actions/setup-python](https://github.com/actions/setup-python) | `3` | `5` |
| [ossf/scorecard-action](https://github.com/ossf/scorecard-action) | `2.3.3` | `2.4.0` |



Updates `actions/checkout` from 3 to 4
- [Release notes](https://github.com/actions/checkout/releases)
- [Commits](actions/checkout@v3...v4)

Updates `anchore/sbom-action` from 0.16.1 to 0.17.1
- [Release notes](https://github.com/anchore/sbom-action/releases)
- [Commits](anchore/sbom-action@v0.16.1...v0.17.1)

Updates `actions/upload-artifact` from 4.3.4 to 4.3.6
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v4.3.4...v4.3.6)

Updates `actions/setup-python` from 3 to 5
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@v3...v5)

Updates `ossf/scorecard-action` from 2.3.3 to 2.4.0
- [Release notes](https://github.com/ossf/scorecard-action/releases)
- [Changelog](https://github.com/ossf/scorecard-action/blob/main/RELEASE.md)
- [Commits](ossf/scorecard-action@dc50aa9...62b2cac)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: anchore/sbom-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: github-actions
- dependency-name: actions/upload-artifact
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: github-actions
- dependency-name: actions/setup-python
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: ossf/scorecard-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: github-actions
...

Signed-off-by: dependabot[bot] <[email protected]>
…puting-io/dependabot/github_actions/github-actions-5a7b011f50

build(deps): bump the github-actions group across 1 directory with 5 updates
…server-patch-1

feat: add model_name attribute to ComponentModelWeights
sthaha and others added 12 commits November 16, 2024 08:14
…ore-prom-v

chore(compose): use fixed version of prometheus
Run pre-commit fixes

Signed-off-by: Kaiyi <[email protected]>
…computing-io#1850)

This commit now allows default hard-coded config directory to be passed
as an argument. This allow quickly changing between different
configurations to be stored separately and to switch between them
(especially during development).

The commit also
* simplifies global config initialization by ensuring  it is initialised
  at the time kepler's main function is executed and fail with error that
  step fails.

* It also cleans up use of config object to read CGroup info by creating
  a `realSystem` struct that handles this functionality.

Signed-off-by: Sunil Thaha <[email protected]>
…d-process-exporter

feat(process-exporter): Add process-exporter to dev and metal
…updates (sustainable-computing-io#1853)

Bumps the github-actions group with 2 updates in the / directory: [anchore/sbom-action](https://github.com/anchore/sbom-action) and [codecov/codecov-action](https://github.com/codecov/codecov-action).


Updates `anchore/sbom-action` from 0.17.6 to 0.17.7
- [Release notes](https://github.com/anchore/sbom-action/releases)
- [Changelog](https://github.com/anchore/sbom-action/blob/main/RELEASE.md)
- [Commits](anchore/sbom-action@v0.17.6...v0.17.7)

Updates `codecov/codecov-action` from 4.6.0 to 5.0.2
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](codecov/codecov-action@v4.6.0...v5.0.2)

---
updated-dependencies:
- dependency-name: anchore/sbom-action
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: github-actions
- dependency-name: codecov/codecov-action
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…tainable-computing-io#1855)

* [fix]: update ginkgo version in CI process keep same with go mod

Signed-off-by: Sam Yuan <[email protected]>

* [fix]: Test Panicked with nil pointer

Signed-off-by: Sam Yuan <[email protected]>

---------

Signed-off-by: Sam Yuan <[email protected]>
* [fix]: base image elfutils version issue

Signed-off-by: Sam Yuan <[email protected]>

* [fix]: missing base image check in CI

Signed-off-by: Sam Yuan <[email protected]>

* [fix]: libbpf version upgrade in base image

Signed-off-by: Sam Yuan <[email protected]>

* [fix]: update libbpf version

Signed-off-by: Sam Yuan <[email protected]>

---------

Signed-off-by: Sam Yuan <[email protected]>
Run pre-commit autoupdate and add a monthly workflow
to update hooks.

Signed-off-by: Maryam Tahhan <[email protected]>
…etal Metrics

This commit includes validation setups for Process and Container RAPL related power metrics
(node, process, container).

Signed-off-by: Kaiyi <[email protected]>
@KaiyiLiu1234 KaiyiLiu1234 marked this pull request as draft December 3, 2024 00:46
Copy link
Contributor

github-actions bot commented Dec 3, 2024

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Summary: This pull request introduces a baremetal validator for process and container RAPL-related power metrics, adding a new Bash script and several Python classes for stress testing. Key modifications include:

  • New baremetal_validator script and Python classes (ProcessOutput, ContainerOutput, StresserError, Local, Process, and Container) for process and container stress testing.
  • Introduction of new named tuples (BMValidator, Local, LocalProcess, LocalContainer, LocalPrometheus) for configuration data.
  • Changes to the validate_acpi command to accept a duration argument and addition of a new regression command to the bm_validator group.
  • Updates to the bload function to parse and initialize new fields from the YAML configuration file, with an updated configuration file format for baremetal validations.

Impact: These changes may affect the external interface and behavior of the code, particularly in the areas of stress testing and configuration data handling.

Observations/Suggestions:

  • It would be beneficial to include unit tests for the new Python classes and Bash script to ensure their correctness and robustness.
  • Consider adding documentation for the new regression command and its usage.
  • The updated configuration file format may require changes to existing integrations or scripts that rely on the old format; ensure that these changes are properly communicated and addressed.

Overall, this pull request introduces significant new functionality for baremetal validations, but it's essential to thoroughly review and test these changes to ensure they do not introduce unintended consequences or break existing functionality.

Added bload to load configuration classes for process, container,
and node level baremetal validation.

Signed-off-by: Kaiyi <[email protected]>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we modify the existing stressor script instead of adding new ones?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not want to modify the other scripts because it might break the metal ci. I think this script is the best to replace the other stressor scripts (as it enables a custom load curve). A future PR should be made for modifying validator to have just one goto stressor script.

node = Local(
load_curve=node_config.get("load_curve", default_config["load_curve"]),
iterations=node_config.get("iterations", default_config["iterations"]),
mount_dir=os.path.expanduser(node_config.get("mount_dir", default_config["mount_dir"]))
Copy link
Collaborator

@vprashar2929 vprashar2929 Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to mount the directory? Also how are you planning to deploy Kepler on BM?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, kepler is deployed with docker compose for simplicity. however, this should still work the same even in a kubernetes environment. The mount directory is specifically to get the start and end times of the stressor script as a log file from the container. In a non container, the log file will be saved at the mount directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we target a single environment for now and focus on expanding it further? Currently, we use docker compose for our development and validations. Let's stick with that for now and afterwards add support for k8s
Wdyt?

process_config = {}
process = LocalProcess(
isolated_cpu=process_config.get("isolated_cpu", default_config["isolated_cpu"]),
load_curve=process_config.get("load_curve", default_config["load_curve"]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference in the load curve between the process and the container?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No difference. the logic is the same

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why we require separate configs for both?

> "$TIME_INTERVAL_LOG"

start_time=$(date +%s)
echo "Stress Start Time: $start_time" >> "$TIME_INTERVAL_LOG"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK we do something similar in our existing validator source by checking the start and end time of the stressor.sh script. I think we can do similar in this case as well instead of computing start and end time in the bash script itself

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this is necessary is primarily for container validation and in future if we attempt pods (which should fall under container still anyway). Container setup requires an image installation (in this case, fedora and stressng need to be installed). If we get start and end time outside we will include the time for installing which will skew the results. The easiest solution I can think of was to actually compute it in the bash script so we do not include the installation time.

KaiyiLiu1234 and others added 3 commits December 9, 2024 22:11
Added process, container, node/local validation to the cli.

Signed-off-by: Kaiyi <[email protected]>
Added relevant config files for baremetal validation including
formatted prom validation metrics and baremetal configuration.

Signed-off-by: Kaiyi <[email protected]>
Added most important component rapl metrics to validate on bm

Signed-off-by: Kaiyi Liu <[email protected]>
@KaiyiLiu1234 KaiyiLiu1234 marked this pull request as ready for review December 16, 2024 04:04

config: # default settings
isolated_cpu: "15" # Logical processor that is fully isolated from scheduler (ex. isolcpus)
load_curve: "0:15,10:20,25:20,50:20,75:20,100:30,75:20,50:20,25:20,10:20,0:15"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we add a new load curve in the existing stressor and not make it part of the config as it overcomplicated the config part How's this different from what default load curve gives us?

@KaiyiLiu1234
Copy link
Collaborator Author

@vprashar2929 Since we are now refocusing on baremetal validation again, I will create new PRs breaking down this problem into smaller pieces for you to review. When that is complete, I will close this PR.

@vprashar2929
Copy link
Collaborator

@KaiyiLiu1234 Can you take a look at this? I believe this somewhat tries to achieve what we want from BM validations. Also please give it a go as well

@KaiyiLiu1234
Copy link
Collaborator Author

Sure I can take a look. We should probably organize tasks and merge the work between our PRs (I have no issue with directly merging your work and then adding what I did here on to yours - for container validation). In the meantime, I am going to just see if i can get some results asap from the metal ci.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.