-
Notifications
You must be signed in to change notification settings - Fork 213
feat(baremetal_validator): Add Validator for Process, Container Metrics #1878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(baremetal_validator): Add Validator for Process, Container Metrics #1878
Conversation
…r-refactor chore(validator): store test result in an object
…puting-io/dependabot/go_modules/go-dependencies-258b52e974 build(deps): bump the go-dependencies group with 2 updates
Signed-off-by: Maryam Tahhan <[email protected]>
Add a workflow that runs the pre-commit jobs to cover the case where a developer hasn't opted into pre-commit. Also add the conventional commit message hook and enable more golang linters. The commits also tidies up the markdown files that had line length issues. Signed-off-by: Maryam Tahhan <[email protected]>
Signed-off-by: Huamin Chen <[email protected]>
…ing js Signed-off-by: Parul Singh <[email protected]>
check metrics before and after e2e tests
…uting-io#1671)" This reverts commit c99e399. Signed-off-by: Dave Tucker <[email protected]>
This reverts commit cde7833. Signed-off-by: Dave Tucker <[email protected]>
This reverts commit ca4cdf6. Signed-off-by: Dave Tucker <[email protected]>
Signed-off-by: Dave Tucker <[email protected]>
…ert-ringbuf Revert "Merge pull request sustainable-computing-io#1628 from dave-tucker/new-ebpf"
Signed-off-by: Sunil Thaha <[email protected]>
…dator-node-info chore(validator): fix missing node-info in md
Signed-off-by: Maryam Tahhan <[email protected]>
Signed-off-by: Anthony Harivel <[email protected]>
By default the name chosen to define the Virtual Machine ID is the one found in /proc/<pid>/cgroup/. Many IaaS platform like Openstack use metadata in libvirt to customise among other things the instance's ID. Allow the user to use the metadata in libvirt to set the name of the vm metrics and gives a better user experience. Set "LIBVIRT_METADATA_URI" with the uri of the XML namespace identifier. Set "LIBVIRT_METADATA_TOKEN" to choose the string used for VM ID. By default, the token "name" is used. Signed-off-by: Anthony Harivel <[email protected]>
…t-metadata Libvirt metadata vm_id usage
Signed-off-by: Maryam Tahhan <[email protected]>
…move-model-globals chore: cleanup globals in pkg/model
Signed-off-by: Vimal Kumar <[email protected]>
…gister-new-process fix(bpf): use prev_tgid to register process
Update the register counters one step before metrics sample is taken instead of updating registers every time which is increasing overhead Signed-off-by: Vimal Kumar <[email protected]>
Fixes sustainable-computing-io#1659 Signed-off-by: Maryam Tahhan <[email protected]>
…1 updates Bumps the go-dependencies group with 8 updates in the / directory: | Package | From | To | | --- | --- | --- | | [github.com/beevik/etree](https://github.com/beevik/etree) | `1.4.0` | `1.4.1` | | [github.com/cilium/ebpf](https://github.com/cilium/ebpf) | `0.15.0` | `0.16.0` | | [github.com/onsi/ginkgo/v2](https://github.com/onsi/ginkgo) | `2.19.1` | `2.20.0` | | [github.com/prometheus/client_golang](https://github.com/prometheus/client_golang) | `1.19.1` | `1.20.0` | | [github.com/prometheus/prometheus](https://github.com/prometheus/prometheus) | `0.53.1` | `0.54.0` | | [golang.org/x/time](https://github.com/golang/time) | `0.5.0` | `0.6.0` | | [k8s.io/api](https://github.com/kubernetes/api) | `0.29.7` | `0.29.8` | | [k8s.io/client-go](https://github.com/kubernetes/client-go) | `0.29.7` | `0.29.8` | Updates `github.com/beevik/etree` from 1.4.0 to 1.4.1 - [Release notes](https://github.com/beevik/etree/releases) - [Changelog](https://github.com/beevik/etree/blob/main/RELEASE_NOTES.md) - [Commits](beevik/etree@v1.4.0...v1.4.1) Updates `github.com/cilium/ebpf` from 0.15.0 to 0.16.0 - [Release notes](https://github.com/cilium/ebpf/releases) - [Commits](cilium/ebpf@v0.15.0...v0.16.0) Updates `github.com/onsi/ginkgo/v2` from 2.19.1 to 2.20.0 - [Release notes](https://github.com/onsi/ginkgo/releases) - [Changelog](https://github.com/onsi/ginkgo/blob/master/CHANGELOG.md) - [Commits](onsi/ginkgo@v2.19.1...v2.20.0) Updates `github.com/prometheus/client_golang` from 1.19.1 to 1.20.0 - [Release notes](https://github.com/prometheus/client_golang/releases) - [Changelog](https://github.com/prometheus/client_golang/blob/main/CHANGELOG.md) - [Commits](prometheus/client_golang@v1.19.1...v1.20.0) Updates `github.com/prometheus/prometheus` from 0.53.1 to 0.54.0 - [Release notes](https://github.com/prometheus/prometheus/releases) - [Changelog](https://github.com/prometheus/prometheus/blob/main/CHANGELOG.md) - [Commits](prometheus/prometheus@v0.53.1...v0.54.0) Updates `golang.org/x/sys` from 0.22.0 to 0.23.0 - [Commits](golang/sys@v0.22.0...v0.23.0) Updates `golang.org/x/time` from 0.5.0 to 0.6.0 - [Commits](golang/time@v0.5.0...v0.6.0) Updates `k8s.io/api` from 0.29.7 to 0.29.8 - [Commits](kubernetes/api@v0.29.7...v0.29.8) Updates `k8s.io/apimachinery` from 0.29.7 to 0.29.8 - [Commits](kubernetes/apimachinery@v0.29.7...v0.29.8) Updates `k8s.io/client-go` from 0.29.7 to 0.29.8 - [Changelog](https://github.com/kubernetes/client-go/blob/master/CHANGELOG.md) - [Commits](kubernetes/client-go@v0.29.7...v0.29.8) Updates `k8s.io/klog/v2` from 2.120.1 to 2.130.1 - [Release notes](https://github.com/kubernetes/klog/releases) - [Changelog](https://github.com/kubernetes/klog/blob/main/RELEASE.md) - [Commits](kubernetes/klog@v2.120.1...v2.130.1) --- updated-dependencies: - dependency-name: github.com/beevik/etree dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies - dependency-name: github.com/cilium/ebpf dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/onsi/ginkgo/v2 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/prometheus/client_golang dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: github.com/prometheus/prometheus dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: golang.org/x/sys dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: golang.org/x/time dependency-type: indirect update-type: version-update:semver-minor dependency-group: go-dependencies - dependency-name: k8s.io/api dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies - dependency-name: k8s.io/apimachinery dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies - dependency-name: k8s.io/client-go dependency-type: direct:production update-type: version-update:semver-patch dependency-group: go-dependencies - dependency-name: k8s.io/klog/v2 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: go-dependencies ... Signed-off-by: dependabot[bot] <[email protected]>
…puting-io/dependabot/go_modules/go-dependencies-bb1f50d887 build(deps): bump the go-dependencies group across 1 directory with 11 updates
…updates Bumps the github-actions group with 5 updates in the / directory: | Package | From | To | | --- | --- | --- | | [actions/checkout](https://github.com/actions/checkout) | `3` | `4` | | [anchore/sbom-action](https://github.com/anchore/sbom-action) | `0.16.1` | `0.17.1` | | [actions/upload-artifact](https://github.com/actions/upload-artifact) | `4.3.4` | `4.3.6` | | [actions/setup-python](https://github.com/actions/setup-python) | `3` | `5` | | [ossf/scorecard-action](https://github.com/ossf/scorecard-action) | `2.3.3` | `2.4.0` | Updates `actions/checkout` from 3 to 4 - [Release notes](https://github.com/actions/checkout/releases) - [Commits](actions/checkout@v3...v4) Updates `anchore/sbom-action` from 0.16.1 to 0.17.1 - [Release notes](https://github.com/anchore/sbom-action/releases) - [Commits](anchore/sbom-action@v0.16.1...v0.17.1) Updates `actions/upload-artifact` from 4.3.4 to 4.3.6 - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@v4.3.4...v4.3.6) Updates `actions/setup-python` from 3 to 5 - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@v3...v5) Updates `ossf/scorecard-action` from 2.3.3 to 2.4.0 - [Release notes](https://github.com/ossf/scorecard-action/releases) - [Changelog](https://github.com/ossf/scorecard-action/blob/main/RELEASE.md) - [Commits](ossf/scorecard-action@dc50aa9...62b2cac) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: anchore/sbom-action dependency-type: direct:production update-type: version-update:semver-minor dependency-group: github-actions - dependency-name: actions/upload-artifact dependency-type: direct:production update-type: version-update:semver-patch dependency-group: github-actions - dependency-name: actions/setup-python dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: ossf/scorecard-action dependency-type: direct:production update-type: version-update:semver-minor dependency-group: github-actions ... Signed-off-by: dependabot[bot] <[email protected]>
…puting-io/dependabot/github_actions/github-actions-5a7b011f50 build(deps): bump the github-actions group across 1 directory with 5 updates
Signed-off-by: Sunyanan Choochotkaew <[email protected]>
…server-patch-1 feat: add model_name attribute to ComponentModelWeights
…ore-prom-v chore(compose): use fixed version of prometheus
Run pre-commit fixes Signed-off-by: Kaiyi <[email protected]>
…computing-io#1850) This commit now allows default hard-coded config directory to be passed as an argument. This allow quickly changing between different configurations to be stored separately and to switch between them (especially during development). The commit also * simplifies global config initialization by ensuring it is initialised at the time kepler's main function is executed and fail with error that step fails. * It also cleans up use of config object to read CGroup info by creating a `realSystem` struct that handles this functionality. Signed-off-by: Sunil Thaha <[email protected]>
…d-process-exporter feat(process-exporter): Add process-exporter to dev and metal
…updates (sustainable-computing-io#1853) Bumps the github-actions group with 2 updates in the / directory: [anchore/sbom-action](https://github.com/anchore/sbom-action) and [codecov/codecov-action](https://github.com/codecov/codecov-action). Updates `anchore/sbom-action` from 0.17.6 to 0.17.7 - [Release notes](https://github.com/anchore/sbom-action/releases) - [Changelog](https://github.com/anchore/sbom-action/blob/main/RELEASE.md) - [Commits](anchore/sbom-action@v0.17.6...v0.17.7) Updates `codecov/codecov-action` from 4.6.0 to 5.0.2 - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](codecov/codecov-action@v4.6.0...v5.0.2) --- updated-dependencies: - dependency-name: anchore/sbom-action dependency-type: direct:production update-type: version-update:semver-patch dependency-group: github-actions - dependency-name: codecov/codecov-action dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…tainable-computing-io#1855) * [fix]: update ginkgo version in CI process keep same with go mod Signed-off-by: Sam Yuan <[email protected]> * [fix]: Test Panicked with nil pointer Signed-off-by: Sam Yuan <[email protected]> --------- Signed-off-by: Sam Yuan <[email protected]>
Signed-off-by: Vimal Kumar <[email protected]>
fix(cpu): fixed reading cpus.yaml
* [fix]: base image elfutils version issue Signed-off-by: Sam Yuan <[email protected]> * [fix]: missing base image check in CI Signed-off-by: Sam Yuan <[email protected]> * [fix]: libbpf version upgrade in base image Signed-off-by: Sam Yuan <[email protected]> * [fix]: update libbpf version Signed-off-by: Sam Yuan <[email protected]> --------- Signed-off-by: Sam Yuan <[email protected]>
Signed-off-by: Sam Yuan <[email protected]>
Run pre-commit autoupdate and add a monthly workflow to update hooks. Signed-off-by: Maryam Tahhan <[email protected]>
…etal Metrics This commit includes validation setups for Process and Container RAPL related power metrics (node, process, container). Signed-off-by: Kaiyi <[email protected]>
🤖 SeineSailor Here's a concise summary of the pull request changes: Summary: This pull request introduces a baremetal validator for process and container RAPL-related power metrics, adding a new Bash script and several Python classes for stress testing. Key modifications include:
Impact: These changes may affect the external interface and behavior of the code, particularly in the areas of stress testing and configuration data handling. Observations/Suggestions:
Overall, this pull request introduces significant new functionality for baremetal validations, but it's essential to thoroughly review and test these changes to ensure they do not introduce unintended consequences or break existing functionality. |
Added bload to load configuration classes for process, container, and node level baremetal validation. Signed-off-by: Kaiyi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we modify the existing stressor script instead of adding new ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not want to modify the other scripts because it might break the metal ci. I think this script is the best to replace the other stressor scripts (as it enables a custom load curve). A future PR should be made for modifying validator to have just one goto stressor script.
node = Local( | ||
load_curve=node_config.get("load_curve", default_config["load_curve"]), | ||
iterations=node_config.get("iterations", default_config["iterations"]), | ||
mount_dir=os.path.expanduser(node_config.get("mount_dir", default_config["mount_dir"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to mount the directory? Also how are you planning to deploy Kepler on BM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, kepler is deployed with docker compose for simplicity. however, this should still work the same even in a kubernetes environment. The mount directory is specifically to get the start and end times of the stressor script as a log file from the container. In a non container, the log file will be saved at the mount directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we target a single environment for now and focus on expanding it further? Currently, we use docker compose for our development and validations. Let's stick with that for now and afterwards add support for k8s
Wdyt?
process_config = {} | ||
process = LocalProcess( | ||
isolated_cpu=process_config.get("isolated_cpu", default_config["isolated_cpu"]), | ||
load_curve=process_config.get("load_curve", default_config["load_curve"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference in the load curve between the process and the container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No difference. the logic is the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why we require separate configs for both?
> "$TIME_INTERVAL_LOG" | ||
|
||
start_time=$(date +%s) | ||
echo "Stress Start Time: $start_time" >> "$TIME_INTERVAL_LOG" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK we do something similar in our existing validator source by checking the start and end time of the stressor.sh
script. I think we can do similar in this case as well instead of computing start and end time in the bash script itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this is necessary is primarily for container validation and in future if we attempt pods (which should fall under container still anyway). Container setup requires an image installation (in this case, fedora and stressng need to be installed). If we get start and end time outside we will include the time for installing which will skew the results. The easiest solution I can think of was to actually compute it in the bash script so we do not include the installation time.
Added process, container, node/local validation to the cli. Signed-off-by: Kaiyi <[email protected]>
Added relevant config files for baremetal validation including formatted prom validation metrics and baremetal configuration. Signed-off-by: Kaiyi <[email protected]>
Added most important component rapl metrics to validate on bm Signed-off-by: Kaiyi Liu <[email protected]>
|
||
config: # default settings | ||
isolated_cpu: "15" # Logical processor that is fully isolated from scheduler (ex. isolcpus) | ||
load_curve: "0:15,10:20,25:20,50:20,75:20,100:30,75:20,50:20,25:20,10:20,0:15" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we add a new load curve in the existing stressor and not make it part of the config as it overcomplicated the config part How's this different from what default load curve gives us?
@vprashar2929 Since we are now refocusing on baremetal validation again, I will create new PRs breaking down this problem into smaller pieces for you to review. When that is complete, I will close this PR. |
@KaiyiLiu1234 Can you take a look at this? I believe this somewhat tries to achieve what we want from BM validations. Also please give it a go as well |
Sure I can take a look. We should probably organize tasks and merge the work between our PRs (I have no issue with directly merging your work and then adding what I did here on to yours - for container validation). In the meantime, I am going to just see if i can get some results asap from the metal ci. |
This commit includes validation setups for Process and Container RAPL related Baremetal power metrics (node, process, container).