Skip to content

Conversation

@KaiyiLiu1234
Copy link
Collaborator

Introduces enhancement proposal for adding Machine Learning models to estimate kepler power metrics in a Virtual Machine environment when hardware power measurement interfaces like RAPL are not available.

Introduces enhancement proposal for adding Machine Learning models to estimate
kepler power metrics in a Virtual Machine environment when hardware power measurement
interfaces like RAPL are not available.

Signed-off-by: Kaiyi Liu <[email protected]>
@github-actions github-actions bot added the docs Documentation changes label Aug 25, 2025
@github-actions
Copy link
Contributor

�[1m 🔆🔆🔆 Validating 🔆🔆🔆 �[0m
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📊 Profiling reports are ready to be viewed

⚠️ Variability in pprof CPU and Memory profiles
When comparing pprof profiles of Kepler versions, expect variability in CPU and memory. Focus only on significant, consistent differences.

💻 CPU Comparison with base Kepler
File: kepler
Type: cpu
Time: 2025-08-25 11:29:51 UTC
Duration: 120s, Total samples = 450ms ( 0.37%)
Active filters:
   show=github.com/sustainable-computing-io
Showing nodes accounting for -70ms, 15.56% of 450ms total
      flat  flat%   sum%        cum   cum%
         0     0%     0%      -40ms  8.89%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).scheduleNextCollection.func1
     -30ms  6.67%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).collectProcessMetrics
         0     0%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).calculatePower
         0     0%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).refreshSnapshot
         0     0%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh
         0     0%  6.67%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh.func1
     -10ms  2.22%  8.89%      -30ms  6.67%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).Refresh
         0     0%  8.89%      -20ms  4.44%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).Collect
         0     0%  8.89%      -20ms  4.44%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).refreshProcesses
         0     0%  8.89%       10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).Snapshot
         0     0%  8.89%       10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).ensureFreshData
     -10ms  2.22% 11.11%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/monitor.(*TerminatedResourceTracker[go.shape.*uint8]).Add
      10ms  2.22%  8.89%       10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/monitor.newProcess (inline)
     -10ms  2.22% 11.11%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.(*procFSReader).AllProcs
     -10ms  2.22% 13.33%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.(*procWrapper).CPUTime
     -10ms  2.22% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.(*procWrapper).Cgroups
         0     0% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).updateProcessCache
         0     0% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.computeTypeInfoFromProc.func1
         0     0% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.containerInfoFromProc
         0     0% 15.56%      -10ms  2.22%  github.com/sustainable-computing-io/kepler/internal/resource.populateProcessFields
💾 Memory Comparison with base Kepler (Inuse)
File: kepler
Type: inuse_space
Time: 2025-08-25 11:31:51 UTC
Duration: 120.01s, Total samples = 4780.43kB 
Active filters:
   show=github.com/sustainable-computing-io
Showing nodes accounting for -1027.99kB, 21.50% of 4780.43kB total
      flat  flat%   sum%        cum   cum%
         0     0%     0% -1027.99kB 21.50%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).Collect
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).Snapshot
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).calculatePower
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).ensureFreshData
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).refreshSnapshot
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh
         0     0%     0%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh.func1
 -516.01kB 10.79% 10.79%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/resource.(*procFSReader).AllProcs
         0     0% 10.79%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).Refresh
         0     0% 10.79%  -516.01kB 10.79%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).refreshProcesses
 -511.98kB 10.71% 21.50%  -511.98kB 10.71%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).collectProcessMetrics
💾 Memory Comparison with base Kepler (Alloc)
File: kepler
Type: alloc_space
Time: 2025-08-25 11:31:51 UTC
Duration: 120.01s, Total samples = 37784.13kB 
Active filters:
   show=github.com/sustainable-computing-io
Showing nodes accounting for -7187.87kB, 19.02% of 37784.13kB total
Dropped 2 nodes (cum <= 188.92kB)
      flat  flat%   sum%        cum   cum%
         0     0%     0% -3588.92kB  9.50%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).Refresh
         0     0%     0% -3588.92kB  9.50%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).refreshProcesses
         0     0%     0% -3581.06kB  9.48%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).calculatePower
         0     0%     0% -3581.06kB  9.48%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).refreshSnapshot
         0     0%     0% -3581.06kB  9.48%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh
         0     0%     0% -3581.06kB  9.48%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).synchronizedPowerRefresh.func1
         0     0%     0% -3069.68kB  8.12%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).Snapshot
-2571.04kB  6.80%  6.80% -2571.04kB  6.80%  github.com/sustainable-computing-io/kepler/internal/resource.(*procFSReader).CPUUsageRatio
         0     0%  6.80% -2571.04kB  6.80%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).Refresh.func3
         0     0%  6.80% -2571.04kB  6.80%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).refreshNode
-2560.90kB  6.78% 13.58% -2560.90kB  6.78%  github.com/sustainable-computing-io/kepler/internal/resource.(*procWrapper).CPUTime
         0     0% 13.58% -2560.90kB  6.78%  github.com/sustainable-computing-io/kepler/internal/resource.(*resourceInformer).updateProcessCache
         0     0% 13.58% -2560.90kB  6.78%  github.com/sustainable-computing-io/kepler/internal/resource.populateProcessFields
         0     0% 13.58% -2051.55kB  5.43%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).scheduleNextCollection.func1
         0     0% 13.58% -2045.54kB  5.41%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).Collect
 -516.01kB  1.37% 14.95% -1540.17kB  4.08%  github.com/sustainable-computing-io/kepler/internal/monitor.(*Snapshot).Clone
         0     0% 14.95% -1529.51kB  4.05%  github.com/sustainable-computing-io/kepler/internal/monitor.(*PowerMonitor).ensureFreshData
-1028.02kB  2.72% 17.67% -1028.02kB  2.72%  github.com/sustainable-computing-io/kepler/internal/resource.(*procFSReader).AllProcs
         0     0% 17.67%  1026.38kB  2.72%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*cpuInfoCollector).Collect
 1026.38kB  2.72% 14.95%  1026.38kB  2.72%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*realProcFS).CPUInfo
 -512.02kB  1.36% 16.31% -1024.16kB  2.71%  github.com/sustainable-computing-io/kepler/internal/monitor.(*Process).Clone (inline)
-1024.16kB  2.71% 19.02% -1024.16kB  2.71%  github.com/sustainable-computing-io/kepler/internal/monitor.newProcess (inline)
 1024.06kB  2.71% 16.31%  1024.06kB  2.71%  github.com/sustainable-computing-io/kepler/internal/exporter/prometheus/collector.(*PowerCollector).collectNodeMetrics
         0     0% 16.31%     -514kB  1.36%  github.com/sustainable-computing-io/kepler/internal/resource.computeTypeInfoFromProc.func1
    -514kB  1.36% 17.67%     -514kB  1.36%  github.com/sustainable-computing-io/kepler/internal/resource.containerInfoFromCgroupPaths
         0     0% 17.67%     -514kB  1.36%  github.com/sustainable-computing-io/kepler/internal/resource.containerInfoFromProc
 -512.14kB  1.36% 19.02%  -512.14kB  1.36%  maps.Copy[go.shape.map[github.com/sustainable-computing-io/kepler/internal/device.EnergyZone]github.com/sustainable-computing-io/kepler/internal/monitor.Usage,go.shape.map[github.com/sustainable-computing-io/kepler/internal/device.EnergyZone]github.com/sustainable-computing-io/kepler/internal/monitor.Usage,go.shape.interface { Energy ; Index int; MaxEnergy github.com/sustainable-computing-io/kepler/internal/device.Energy; Name string; Path string },go.shape.struct { EnergyTotal github.com/sustainable-computing-io/kepler/internal/device.Energy; Power github.com/sustainable-computing-io/kepler/internal/device.Power }] (inline)

⬇️ Download the Profiling artifacts from the Actions Summary page

📦 Artifact name: profile-artifacts-2291

🔧 Or use GitHub CLI to download artifacts:

gh run download 17207440019 -n profile-artifacts-2291


## Problem Statement

Virtual machines lack direct access to hardware power measurement interfaces (RAPL, IPMI, etc.) that are essential for energy monitoring in cloud and virtualized environments. Current Kepler deployments in VMs cannot provide accurate power consumption estimates because they cannot access the underlying hardware power consumption data. This creates a significant gap in energy monitoring capabilities for the growing virtualized infrastructure landscape.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we mention that PMU is usually disabled for VMs by providers, so VMs do not show any performance counter also.

- **Primary Goal**: Develop zone-specific machine learning models (package, core, DRAM, uncore) for CPU power estimation in VMs
- **Secondary Goal**: Create a production-ready deployment system for VM power models in Go environments
- **Tertiary Goal**: Establish best practices for VM power modeling including CPU pinning and isolation requirements
- **Performance Goal**: Achieve <10% mean absolute percentage error compared to baremetal measurements
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer RMS error over MAPE

### Functional Requirements

- **FR1**: Train separate ML models for each power zone (package, core, DRAM, uncore)
- **FR2**: Use only VM-accessible OS and memory counters as input features
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code seems to show it, but pls add a section showing list of features used for model training.

kepler_vm_last_training_timestamp{zone="package"} 1692984532
```

## Implementation Plan
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typically enhancement proposals do not contain implementation plan.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add some detail about

  • Model Architecture, mentioning model type (regression, neural network, xgboost etc.) with list of input features and feature engineering approach
  • Training Data Requirements, mentioning training data source, if any preprocessing on data, training data size requirements
  • how to detect/prevent model overfitting?
  • hyperparameters

is there any dependence on number of vCPU? if not why, if yes what does this mean for model training/selection
consider a typical case of 128 CPU baremetal machine running multiple VMs, some with 4 vCPUs, some with 8 vCPUs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Model Architecture, mentioning model type (regression, neural network, xgboost etc.) with list of input features and feature engineering approach

I think this would be highly data dependent. Dimensionality is a concern of course, but also the linearity or non-linearity of the data is also import.

If the data is highly linear then a neural network or tree boosting approach like GBDT or GBM (xgboost wraps these) is over kill for our needs and some regression model would be effiencent. If the data contains minor non-linearities then a model NN or XGBoost may even still be over kill, because SVM with an RBF kernal could get you there and be more efficent (old school I know, but there is no school like the old school). If it is highly non-linear then the NN or XGBoost is totally sensible. Also if explainabilty is important then NNs would struggle there.

The other three all related to the first question and the data, so the short version is it depends and it may be too early to say with any confidence (especially if we don't have the EDA done)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Documentation changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants