-
Couldn't load subscription status.
- Fork 212
Description
Describe the bug
We have a K8s cluster with 2 nodes and 8 GPU Max 1100s each. We installed the GPU plugin v0.34.0 to access the GPU in the cluster. However, when we run this command in the pod python -c "import torch; print(torch.xpu.mem_get_info(torch.xpu.current_device())[1])" it fails with the following error
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/opt/vllm/lib64/python3.12/site-packages/torch/xpu/memory.py", line 194, in mem_get_info
return torch._C._xpu_getMemoryInfo(device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The device (Intel(R) Data Center GPU Max 1100) doesn't support querying the available free memory. You can file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize its implementation.
However, the above command works in a container on the same node. I have attached the pod definition and container command below.
sudo nerdctl run -it \
--name xpu \
-e VLLM_LOGGING_LEVEL=DEBUG \
-e http_proxy=http://proxy-dmz.intel.com:912 \
-e https_proxy=http://proxy-dmz.intel.com:912 \
-e no_proxy="localhost,127.0.0.1,.maas,10.0.0.0/8,172.16.0.0/16,192.168.0.0/16,134.134.0.0/16,.maas-internal,.svc" \
--device /dev/dri \
--entrypoint="bash" \
--privileged \
ghcr.io/llm-d/llm-d-xpu:v0.3.0To Reproduce
apiVersion: v1
kind: Pod
metadata:
name: xpu
namespace: default
spec:
containers:
- name: xpu
image: ghcr.io/llm-d/llm-d-xpu:v0.3.0
imagePullPolicy: Always
command:
- bash
stdin: true
tty: true
env:
- name: VLLM_LOGGING_LEVEL
value: DEBUG
- name: http_proxy
value: http://proxy-dmz.intel.com:912
- name: https_proxy
value: http://proxy-dmz.intel.com:912
- name: no_proxy
value: "localhost,127.0.0.1,.maas,10.0.0.0/8,172.16.0.0/16,192.168.0.0/16,134.134.0.0/16,.maas-internal,.svc"
resources:
limits:
gpu.intel.com/i915: "8"
requests:
gpu.intel.com/i915: "8"
restartPolicy: NeverThis same pod works with privileged: true and removing the i915 request and limits.
Expected behavior
The GPU memory command should work if it works in containers.
System (please complete the following information):
- OS version: [e.g. Ubuntu 22.04]
- Kernel version: [e.g. Linux 5.15]
- Device plugins version: [e.g. v0.34.0]
- Hardware info: [e.g. Intel dGPU Max 1100]