Feature Request: GPU Support #833

ZongqiangZhang · 2023-10-25T09:48:33Z

This feature request aims to enhance the Node Problem Detector with the ability to monitor GPUs on nodes and detect issues.

Currently NPD does not have direct visibility into GPUs. However, many workloads are GPU accelerated which makes GPU health an important part of node health. e.g. GPUs are widely used in machine learning training and inference. Especially for LLM training which may using tens of thousands of GPU cards. The entire training cluster should be restarted from previous checkpoint if any one of the GPUs in the cluster is gone bad.

This feature request adds the following capabilities:

GPU device monitoring: NPD will collect GPU device info periodically and look for crashes or errors via nvidia-smi/nvml/dcgm tools.
GPU device monitoring: NPD will check GPU device info periodically to detect if a GPU is "stuck" (e.g. nvidia-smi command hangs).
TBD: GPU runtime monitoring: NPD will check for crashes or OOM issues reported in nvidia logs.

Specifically, this feature request includes:

Code for the gpu_monitor plugin
A Dockerfile to build an NPD image with GPU support
Other dependencies

Looking forward to your feedback!

k8s-triage-robot · 2024-01-31T01:32:14Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-03-01T02:32:19Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

zz913922 · 2024-03-06T03:37:41Z

Exactly the same situation, did you resolve this issue already?

stmcginnis · 2024-03-25T16:13:10Z

/remove-lifecycle stale

stmcginnis · 2024-03-25T16:13:23Z

/remove-lifecycle rotten

AllenXu93 · 2024-04-03T03:14:03Z

I think NPD should support different config for different device or runtime;
For example , I have both GPU and none GPU worker node in one cluster, or containerd and docker nodes in one cluster; currently we need to deploy 2 npd daemonset for different type of node;

AllenXu93 · 2024-04-03T03:15:46Z

BTW, for GPU, we don't need to install more dependencies, it just add env NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES , and use nvidia-smi command to check GPU state

wangzhen127 · 2024-04-05T17:17:45Z

Thanks for filing the feature request! I think this totally makes sense. Do you have any more concrete proposal?

/cc @SergeyKanzhelev

SergeyKanzhelev · 2024-04-05T17:26:01Z

yes, accelerators health is an important functionality and would be great to have it in NPD

Need to design it carefully though. There is already some health checking in a device plugin (like https://github.com/NVIDIA/k8s-device-plugin/blob/bf58cc405af03d864b1502f147815d4c2271ab9a/internal/rm/health.go#L39) that we need to work nicely with. Even simple detection of a device plugin health is a good starting point here.

@AllenXu93 @ZongqiangZhang do you want to work on more detailed design? I definitely will be interested to join the effort

wangzhen127 · 2024-04-05T18:00:45Z

/kind feature

AllenXu93 · 2024-04-07T02:51:35Z

yes, accelerators health is an important functionality and would be great to have it in NPD

Need to design it carefully though. There is already some health checking in a device plugin (like https://github.com/NVIDIA/k8s-device-plugin/blob/bf58cc405af03d864b1502f147815d4c2271ab9a/internal/rm/health.go#L39) that we need to work nicely with. Even simple detection of a device plugin health is a good starting point here.

@AllenXu93 @ZongqiangZhang do you want to work on more detailed design? I definitely will be interested to join the effort

Of cource.
In our case, we use nvidia-smi to check GPU remapped row pending and failure (accroding to https://docs.nvidia.com/deploy/a100-gpu-mem-error-mgmt/#error-recovery-and-response-flags ), mark condition of node. When it occurr, we will create a job that drain node and execute gpu reset. So we need NPD to check GPU health and mark node condition.

xuchenCN · 2024-05-31T08:03:30Z

LGTM + 1

AllenXu93 · 2024-05-31T08:04:01Z

您好，邮件已收到，我会尽快给您回复。

k8s-triage-robot · 2024-08-29T08:58:37Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-09-28T09:48:23Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

SergeyKanzhelev · 2024-09-28T21:37:35Z

/remove-lifecycle rotten

SergeyKanzhelev · 2024-09-28T21:38:13Z

The important question here is what NPD will be collecting comparing to the device plugin. Some designing is needed here

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 31, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 1, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Mar 25, 2024

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 5, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 29, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 28, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: GPU Support #833

Feature Request: GPU Support #833

ZongqiangZhang commented Oct 25, 2023

k8s-triage-robot commented Jan 31, 2024

k8s-triage-robot commented Mar 1, 2024

zz913922 commented Mar 6, 2024

stmcginnis commented Mar 25, 2024

stmcginnis commented Mar 25, 2024

AllenXu93 commented Apr 3, 2024

AllenXu93 commented Apr 3, 2024

wangzhen127 commented Apr 5, 2024

SergeyKanzhelev commented Apr 5, 2024

wangzhen127 commented Apr 5, 2024

AllenXu93 commented Apr 7, 2024

xuchenCN commented May 31, 2024

AllenXu93 commented May 31, 2024 via email

k8s-triage-robot commented Aug 29, 2024

k8s-triage-robot commented Sep 28, 2024

SergeyKanzhelev commented Sep 28, 2024

SergeyKanzhelev commented Sep 28, 2024

Feature Request: GPU Support #833

Feature Request: GPU Support #833

Comments

ZongqiangZhang commented Oct 25, 2023

k8s-triage-robot commented Jan 31, 2024

k8s-triage-robot commented Mar 1, 2024

zz913922 commented Mar 6, 2024

stmcginnis commented Mar 25, 2024

stmcginnis commented Mar 25, 2024

AllenXu93 commented Apr 3, 2024

AllenXu93 commented Apr 3, 2024

wangzhen127 commented Apr 5, 2024

SergeyKanzhelev commented Apr 5, 2024

wangzhen127 commented Apr 5, 2024

AllenXu93 commented Apr 7, 2024

xuchenCN commented May 31, 2024

AllenXu93 commented May 31, 2024 via email

k8s-triage-robot commented Aug 29, 2024

k8s-triage-robot commented Sep 28, 2024

SergeyKanzhelev commented Sep 28, 2024

SergeyKanzhelev commented Sep 28, 2024