Feature: Add integrated GPU support via torch-directml (Intel/AMD iGPU on Windows)

## Summary

AirLLM currently hardcodes CUDA (NVIDIA) throughout its codebase, making it unusable on machines with only integrated GPUs (Intel UHD / Iris Xe / Arc, AMD Radeon integrated). This issue proposes adding support for Intel and AMD integrated GPUs on Windows via [torch-directml](https://pypi.org/project/torch-directml/).

## Problem

Every CUDA-specific call in the codebase prevents non-NVIDIA users from running AirLLM:

- `torch.cuda.empty_cache()` in `utils.py`
- `v.cuda()` in `compress_layer_state_dict` / `uncompress_layer_state_dict`
- `device.startswith("cuda")` gating prefetch stream in `airllm_base.py`
- `torch.cuda.is_available()` gating `pin_memory` in `airllm_base.py`
- `torch.cuda.mem_get_info()` in `profiler.py`
- Default `device="cuda:0"` with no fallback

This affects a large number of users — laptops and budget desktops with no discrete GPU are extremely common.

## Proposed Solution

1. **New `device_utils.py` module** — device-agnostic helpers:
   - `get_device_type(device)` → `"cuda"` | `"directml"` | `"mps"` | `"cpu"`
   - `empty_cache(device)` — safe for any device
   - `can_pin_memory(device)` — only true for CUDA
   - `supports_bitsandbytes(device)` — only true for CUDA (guards compression)
   - `is_directml_available()` — detects torch-directml install

2. **Replace all CUDA-specific calls** in `utils.py`, `airllm_base.py`, `profiler.py` with device-agnostic equivalents.

3. **Clear error message** when user tries to use `compression=` on a non-CUDA device (bitsandbytes is NVIDIA-only), rather than a confusing crash.

4. **Optional dependency** — `torch-directml` added as `pip install airllm[directml]`.

## Hardware Tested

- **GPU**: Intel Iris Xe Graphics (integrated)
- **OS**: Windows 11
- **Python**: 3.12.0, PyTorch 2.4.1, torch-directml 0.2.5

Verified:
- ✅ Device detection works correctly
- ✅ Tensor operations (matmul 1024×1024) run on iGPU via DirectML
- ✅ Layer load → move to iGPU → unload cycle works
- ✅ `clean_memory()` / `empty_cache()` safe on DirectML
- ✅ Compression correctly blocked with clear error on non-CUDA

## What Still Requires CUDA

`bitsandbytes` (used for 4-bit/8-bit compression) only supports NVIDIA GPUs. Compression is disabled for DirectML/MPS devices with a clear `ValueError`. This is an acceptable limitation and is documented.

## Usage After This Change

```python
# Install
pip install torch-directml
pip install airllm[directml]

# Run on Intel/AMD integrated GPU
from airllm import AutoModel

model = AutoModel.from_pretrained(
    "your-model-id",
    device="privateuseone:0",  # Intel/AMD iGPU via DirectML
    # compression= must not be used on iGPU
)
```

## Are you open to a PR?

I have a working implementation ready. Happy to open a PR if this direction is acceptable to the maintainers. I can also extend this to support ROCm (AMD discrete) and Intel IPEX if useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: Add integrated GPU support via torch-directml (Intel/AMD iGPU on Windows) #270

Summary

Problem

Proposed Solution

Hardware Tested

What Still Requires CUDA

Usage After This Change

Are you open to a PR?

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Feature: Add integrated GPU support via torch-directml (Intel/AMD iGPU on Windows) #270

Description

Summary

Problem

Proposed Solution

Hardware Tested

What Still Requires CUDA

Usage After This Change

Are you open to a PR?

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions