Skip to content

GPU Max 1100 doesn't support querying the available free memory #2158

@sharvil10

Description

@sharvil10

Describe the bug
We have a K8s cluster with 2 nodes and 8 GPU Max 1100s each. We installed the GPU plugin v0.34.0 to access the GPU in the cluster. However, when we run this command in the pod python -c "import torch; print(torch.xpu.mem_get_info(torch.xpu.current_device())[1])" it fails with the following error

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/vllm/lib64/python3.12/site-packages/torch/xpu/memory.py", line 194, in mem_get_info
    return torch._C._xpu_getMemoryInfo(device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The device (Intel(R) Data Center GPU Max 1100) doesn't support querying the available free memory. You can file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize its implementation.

However, the above command works in a container on the same node. I have attached the pod definition and container command below.

sudo nerdctl run -it \
  --name xpu \
  -e VLLM_LOGGING_LEVEL=DEBUG \
  -e http_proxy=http://proxy-dmz.intel.com:912 \
  -e https_proxy=http://proxy-dmz.intel.com:912 \
  -e no_proxy="localhost,127.0.0.1,.maas,10.0.0.0/8,172.16.0.0/16,192.168.0.0/16,134.134.0.0/16,.maas-internal,.svc" \
  --device /dev/dri \
  --entrypoint="bash" \
  --privileged \
  ghcr.io/llm-d/llm-d-xpu:v0.3.0

To Reproduce

apiVersion: v1
kind: Pod
metadata:
  name: xpu
  namespace: default
spec:
  
  containers:
  - name: xpu
    image: ghcr.io/llm-d/llm-d-xpu:v0.3.0
    imagePullPolicy: Always
    command:
    - bash
    stdin: true
    tty: true
    env:
    - name: VLLM_LOGGING_LEVEL
      value: DEBUG
    - name: http_proxy
      value: http://proxy-dmz.intel.com:912
    - name: https_proxy
      value: http://proxy-dmz.intel.com:912
    - name: no_proxy
      value: "localhost,127.0.0.1,.maas,10.0.0.0/8,172.16.0.0/16,192.168.0.0/16,134.134.0.0/16,.maas-internal,.svc"
    resources:
      limits:
        gpu.intel.com/i915: "8"
      requests:
        gpu.intel.com/i915: "8"
  restartPolicy: Never

This same pod works with privileged: true and removing the i915 request and limits.

Expected behavior
The GPU memory command should work if it works in containers.

System (please complete the following information):

  • OS version: [e.g. Ubuntu 22.04]
  • Kernel version: [e.g. Linux 5.15]
  • Device plugins version: [e.g. v0.34.0]
  • Hardware info: [e.g. Intel dGPU Max 1100]

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggpuGPU device plugin related issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions