Cog is an open-source tool that lets you package machine learning models in a standard, production-ready container. vLLM is a fast and easy-to-use library for LLM inference and serving.
You can deploy your packaged model to your own infrastructure, or to Replicate.
-
🚀 Run vLLM in the cloud with an API. Deploy any vLLM-supported language model at scale on Replicate.
-
🏭 Support multiple concurrent requests. Continuous batching works out of the box.
-
🐢 Open Source, all the way down. Look inside, take it apart, make it do exactly what you need.
Go to replicate.com/replicate/vllm and create a new vLLM model from a supported Hugging Face repo, such as google/gemma-2b
Important
Gated models require a Hugging Face API token,
which you can set in the hf_token
field of the model creation form.
Replicate downloads the model files, packages them into a .tar
archive,
and pushes a new version of your model that's ready to use.
From here, you can either use your model as-is, or customize it and push up your changes.
If you're on a machine or VM with a GPU, you can try out changes before pushing them to Replicate.
Start by installing or upgrading Cog. You'll need Cog v0.10.0-alpha11:
$ sudo curl -o /usr/local/bin/cog -L "https://github.com/replicate/cog/releases/download/v0.10.0-alpha11/cog_$(uname -s)_$(uname -m)"
$ sudo chmod +x /usr/local/bin/cog
Then clone this repository:
$ git clone https://github.com/replicate/cog-vllm
$ cd cog-vllm
Go to the Replicate dashboard and navigate to the training for your vLLM model. From that page, copy the weights URL from the Download weights button.
Set the COG_WEIGHTS
environment variable with that copied value:
$ export COG_WEIGHTS="..."
Now, make your first prediction against the model locally:
$ cog predict -e "COG_WEIGHTS=$COG_WEIGHTS" \
-i prompt="Hello!"
The first time you run this command,
Cog downloads the model weights and save them to the models
subdirectory.
To make multiple predictions,
start up the HTTP server and send it POST /predictions
requests.
# Start the HTTP server
$ cog run -p 5000 -e "COG_WEIGHTS=$COG_WEIGHTS" python -m cog.server.http
# In a different terminal session, send requests to the server
$ curl http://localhost:5000/predictions -X POST \
-H 'Content-Type: application/json' \
-d '{"input": {"prompt": "Hello!"}}'
When you're finished working, you can push your changes to Replicate.
Grab your token from replicate.com/account and set it as an environment variable:
export REPLICATE_API_TOKEN=<your token>
$ echo $REPLICATE_API_TOKEN | cog login --token-stdin
$ cog push r8.im/<your-username>/<your-model-name>
--> ...
--> Pushing image 'r8.im/...'
After you push your model, you can try running it on Replicate.
Install the Replicate Python SDK:
$ pip install replicate
Create a prediction and stream its output:
import replicate
model = replicate.models.get("<your-username>/<your-model-name>")
prediction = replicate.predictions.create(
version=model.latest_version,
input={ "prompt": "Hello" },
stream=True
)
for event in prediction.stream():
print(str(event), end="")