Cog-vLLM: Run vLLM on Replicate

Cog is an open-source tool that lets you package machine learning models in a standard, production-ready container. vLLM is a fast and easy-to-use library for LLM inference and serving.

You can deploy your packaged model to your own infrastructure, or to Replicate.

Highlights

🚀 Run vLLM in the cloud with an API. Deploy any vLLM-supported language model at scale on Replicate.
🏭 Support multiple concurrent requests. Continuous batching works out of the box.
🐢 Open Source, all the way down. Look inside, take it apart, make it do exactly what you need.

Quickstart

Go to replicate.com/replicate/vllm and create a new vLLM model from a supported Hugging Face repo, such as google/gemma-2b

Important

Gated models require a Hugging Face API token, which you can set in the hf_token field of the model creation form.

Replicate downloads the model files, packages them into a .tar archive, and pushes a new version of your model that's ready to use.

From here, you can either use your model as-is, or customize it and push up your changes.

Local Development

If you're on a machine or VM with a GPU, you can try out changes before pushing them to Replicate.

Start by installing or upgrading Cog. You'll need Cog v0.10.0-alpha11:

$ sudo curl -o /usr/local/bin/cog -L "https://github.com/replicate/cog/releases/download/v0.10.0-alpha11/cog_$(uname -s)_$(uname -m)"
$ sudo chmod +x /usr/local/bin/cog

Then clone this repository:

$ git clone https://github.com/replicate/cog-vllm
$ cd cog-vllm

Go to the Replicate dashboard and navigate to the training for your vLLM model. From that page, copy the weights URL from the Download weights button.

Copy weights URL from Replicate training

Set the COG_WEIGHTS environment variable with that copied value:

$ export COG_WEIGHTS="..."

Now, make your first prediction against the model locally:

$ cog predict -e "COG_WEIGHTS=$COG_WEIGHTS" \ 
              -i prompt="Hello!"

The first time you run this command, Cog downloads the model weights and save them to the models subdirectory.

To make multiple predictions, start up the HTTP server and send it POST /predictions requests.

# Start the HTTP server
$ cog run -p 5000 -e "COG_WEIGHTS=$COG_WEIGHTS" python -m cog.server.http

# In a different terminal session, send requests to the server
$ curl http://localhost:5000/predictions -X POST \
    -H 'Content-Type: application/json' \
    -d '{"input": {"prompt": "Hello!"}}'

When you're finished working, you can push your changes to Replicate.

Grab your token from replicate.com/account and set it as an environment variable:

export REPLICATE_API_TOKEN=<your token>

$ echo $REPLICATE_API_TOKEN | cog login --token-stdin
$ cog push r8.im/<your-username>/<your-model-name>
--> ...
--> Pushing image 'r8.im/...'

After you push your model, you can try running it on Replicate.

Install the Replicate Python SDK:

$ pip install replicate

Create a prediction and stream its output:

import replicate

model = replicate.models.get("<your-username>/<your-model-name>")
prediction = replicate.predictions.create(
    version=model.latest_version,
    input={ "prompt": "Hello" },
    stream=True
)

for event in prediction.stream():
    print(str(event), end="")

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
.github/workflows		.github/workflows
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pylintrc		.pylintrc
.python-version		.python-version
README.md		README.md
cog.yaml		cog.yaml
predict.py		predict.py
prompt_templates.py		prompt_templates.py
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements-dev.txt		requirements-dev.txt
requirements.lock		requirements.lock
requirements.txt		requirements.txt
test.sh		test.sh
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cog-vLLM: Run vLLM on Replicate

Highlights

Quickstart

Local Development

About

Releases

Packages

Contributors 5

Languages

replicate/cog-vllm

Folders and files

Latest commit

History

Repository files navigation

Cog-vLLM: Run vLLM on Replicate

Highlights

Quickstart

Local Development

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages