GPU Max 1100 doesn't support querying the available free memory

**Describe the bug**
We have a K8s cluster with 2 nodes and 8 GPU Max 1100s each. We installed the GPU plugin v0.34.0 to access the GPU in the cluster. However, when we run this command in the pod python -c "import torch; print(torch.xpu.mem_get_info(torch.xpu.current_device())[1])" it fails with the following error
 
 
```
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/vllm/lib64/python3.12/site-packages/torch/xpu/memory.py", line 194, in mem_get_info
    return torch._C._xpu_getMemoryInfo(device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The device (Intel(R) Data Center GPU Max 1100) doesn't support querying the available free memory. You can file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize its implementation.
```
 
However, the above command works in a container on the same node. I have attached the pod definition and container command below.
 
 
```Shell
sudo nerdctl run -it \
  --name xpu \
  -e VLLM_LOGGING_LEVEL=DEBUG \
  -e http_proxy=http://proxy-dmz.intel.com:912 \
  -e https_proxy=http://proxy-dmz.intel.com:912 \
  -e no_proxy="localhost,127.0.0.1,.maas,10.0.0.0/8,172.16.0.0/16,192.168.0.0/16,134.134.0.0/16,.maas-internal,.svc" \
  --device /dev/dri \
  --entrypoint="bash" \
  --privileged \
  ghcr.io/llm-d/llm-d-xpu:v0.3.0
```

**To Reproduce**
```YAML
apiVersion: v1
kind: Pod
metadata:
  name: xpu
  namespace: default
spec:
  
  containers:
  - name: xpu
    image: ghcr.io/llm-d/llm-d-xpu:v0.3.0
    imagePullPolicy: Always
    command:
    - bash
    stdin: true
    tty: true
    env:
    - name: VLLM_LOGGING_LEVEL
      value: DEBUG
    - name: http_proxy
      value: http://proxy-dmz.intel.com:912
    - name: https_proxy
      value: http://proxy-dmz.intel.com:912
    - name: no_proxy
      value: "localhost,127.0.0.1,.maas,10.0.0.0/8,172.16.0.0/16,192.168.0.0/16,134.134.0.0/16,.maas-internal,.svc"
    resources:
      limits:
        gpu.intel.com/i915: "8"
      requests:
        gpu.intel.com/i915: "8"
  restartPolicy: Never
```

This same pod works with `privileged: true` and removing the i915 request and limits. 

**Expected behavior**
The GPU memory command should work if it works in containers.


**System (please complete the following information):**
 - OS version: [e.g. Ubuntu 22.04]
 - Kernel version: [e.g. Linux 5.15]
 - Device plugins version: [e.g. v0.34.0]
 - Hardware info: [e.g. Intel dGPU Max 1100]


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

GPU Max 1100 doesn't support querying the available free memory #2158

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

GPU Max 1100 doesn't support querying the available free memory #2158

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions