vLLM is a high-performance library for LLM inference and serving with OpenAI-compatible API.
To start the vLLM container, run
export HF_TOKEN=<your_huggingface_token_here>
podman compose up --detachOnce the container is running, you can access the OpenAI-compatible API at http://localhost:8000.
You can view the OpenAPI documentation at http://localhost:8000/docs.
To stop and remove the containers, run
podman compose down --volumes