Keep GPU

Keep GPU keeps shared GPUs from being reclaimed while you prep data, debug, or coordinate multi-stage pipelines. It allocates just enough VRAM and issues lightweight CUDA work so schedulers observe an “active” device—without running a full training job.

🧾 License: MIT
📚 Docs: https://keepgpu.readthedocs.io

Why it exists

On many clusters, idle GPUs are reaped or silently shared after a short grace period. The cost of losing your reservation (or discovering another job has taken your card) can dwarf the cost of a tiny keep-alive loop. KeepGPU is a minimal, auditable guardrail:

Predictable – Single-purpose controller with explicit resource knobs (VRAM size, interval, utilization backoff).
Polite – Uses NVML to read utilization and backs off when the GPU is busy.
Portable – Typer/Rich CLI for humans; Python API for orchestrators and notebooks.
Observable – Structured logging and optional file logs for auditing what kept the GPU alive.
Power-aware – Uses intervalled elementwise ops instead of heavy matmul floods to present “busy” utilization while keeping power and thermals lower (see CudaGPUController._run_mat_batch for the loop).
NVML-backed – GPU telemetry comes from nvidia-ml-py (the pynvml module), with optional rocm-smi support when you install the rocm extra.

Quick start (CLI)

pip install keep-gpu

# Hold GPU 0 with 1 GiB VRAM and throttle if utilization exceeds 25%
keep-gpu --gpu-ids 0 --vram 1GiB --busy-threshold 25 --interval 60

Platform installs at a glance

CUDA (example: cu121)

pip install --index-url https://download.pytorch.org/whl/cu121 torch
pip install keep-gpu

ROCm (example: rocm6.1)

pip install --index-url https://download.pytorch.org/whl/rocm6.1 torch
pip install keep-gpu[rocm]

CPU-only
```
pip install torch
pip install keep-gpu
```

Flags that matter:

--vram (1GiB, 750MB, or bytes): how much memory to pin.
--interval (seconds): sleep between keep-alive bursts.
--busy-threshold: skip work when NVML reports higher utilization.
--gpu-ids: target a subset; otherwise all visible GPUs are guarded.

Embed in Python

from keep_gpu.single_gpu_controller.cuda_gpu_controller import CudaGPUController

with CudaGPUController(rank=0, interval=0.5, vram_to_keep="1GiB", busy_threshold=20):
    preprocess_dataset()   # GPU is marked busy while you run CPU-heavy code

train_model()              # GPU freed after exiting the context

Need multiple devices?

from keep_gpu.global_gpu_controller.global_gpu_controller import GlobalGPUController

with GlobalGPUController(gpu_ids=[0, 1], vram_to_keep="750MB", interval=90, busy_threshold=30):
    run_pipeline_stage()

What you get

Battle-tested keep-alive loop built on PyTorch.
NVML-based utilization monitoring (by way of nvidia-ml-py) to avoid hogging busy GPUs; optional ROCm SMI support by way of pip install keep-gpu[rocm].
CLI + API parity: same controllers power both code paths.
Continuous docs + CI: mkdocs + mkdocstrings build in CI to keep guidance up to date.

For developers

Install dev extras: pip install -e ".[dev]" (add .[rocm] if you need ROCm SMI).
Fast CUDA checks: pytest tests/cuda_controller tests/global_controller tests/utilities/test_platform_manager.py tests/test_cli_thresholds.py
ROCm-only tests carry @pytest.mark.rocm; run with pytest --run-rocm tests/rocm_controller.
Markers: rocm (needs ROCm stack) and large_memory (opt-in locally).

MCP endpoint (experimental)

Start a simple JSON-RPC server on stdin/stdout (default):
```
keep-gpu-mcp-server
```

Or expose it over HTTP (JSON-RPC 2.0 by way of POST):

keep-gpu-mcp-server --mode http --host 0.0.0.0 --port 8765

Example request (one per line):

{"id": 1, "method": "start_keep", "params": {"gpu_ids": [0], "vram": "512MB", "interval": 60, "busy_threshold": 20}}

Methods: start_keep, stop_keep (optional job_id, default stops all), status (optional job_id), list_gpus (basic info).

Minimal client config (stdio MCP):

servers:
  keepgpu:
    command: ["keep-gpu-mcp-server"]
    adapter: stdio

Minimal client config (HTTP MCP):

servers:
  keepgpu:
    url: http://127.0.0.1:8765/
    adapter: http

Remote/SSH tunnel example (HTTP):
```
keep-gpu-mcp-server --mode http --host 0.0.0.0 --port 8765
```
Client config (replace hostname/tunnel as needed):
```
servers:
  keepgpu:
    url: http://gpu-box.example.com:8765/
    adapter: http
```
For untrusted networks, put the server behind your own auth/reverse-proxy or tunnel by way of SSH (for example, ssh -L 8765:localhost:8765 gpu-box).

Contributing

Contributions are welcome—especially around ROCm support, platform fallbacks, and scheduler-specific recipes. Open an issue or PR if you hit edge cases on your cluster. See docs/contributing.md for dev setup, test commands, and PR tips.

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Contributors

📖 Citation

If you find KeepGPU useful in your research or work, please cite it as:

@software{Wangmerlyn_KeepGPU_2025,
  author       = {Wang, Siyuan and Shi, Yaorui and Liu, Yida and Yin, Yuqi},
  title        = {KeepGPU: a simple CLI app that keeps your GPUs running},
  year         = {2025},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17129114},
  url          = {https://github.com/Wangmerlyn/KeepGPU},
  note         = {GitHub repository},
  keywords     = {ai, hpc, gpu, cluster, cuda, torch, debug}
}

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.github		.github
docs		docs
src/keep_gpu		src/keep_gpu
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CODE_OF_CONDUCT.rst		CODE_OF_CONDUCT.rst
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements_dev.txt		requirements_dev.txt
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Keep GPU

Why it exists

Quick start (CLI)

Platform installs at a glance

Embed in Python

What you get

For developers

MCP endpoint (experimental)

Contributing

Credits

Contributors

📖 Citation

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors 6

Uh oh!

Languages

License

Wangmerlyn/KeepGPU

Folders and files

Latest commit

History

Repository files navigation

Keep GPU

Why it exists

Quick start (CLI)

Platform installs at a glance

Embed in Python

What you get

For developers

MCP endpoint (experimental)

Contributing

Credits

Contributors

📖 Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages