Google released Gemma and has made a big wave in the AI community. It opens the opportunity for the open-source community to serve and finetune private Gemini.
Serving Gemma on any cloud is easy with SkyPilot. With serve.yaml in this directory, you host the model on any cloud with a single command.
- Apply for access to the Gemma model
Go to the application page and click Acknowledge license to apply for access to the model weights.
- Get the access token from huggingface
Generate a read-only access token on huggingface here, and make sure your huggingface account can access the Gemma models here.
- Install SkyPilot
pip install "skypilot-nightly[all]"
For detailed installation instructions, please refer to the installation guide.
We can host the model with a single instance:
HF_TOKEN="xxx" sky launch -c gemma serve.yaml --env HF_TOKEN
After the cluster is launched, we can access the model with the following command:
IP=$(sky status --ip gemma)
curl http://$IP:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-7b-it",
"prompt": "My favourite condiment is",
"max_tokens": 25
}' | jq .
Chat API is also supported:
IP=$(sky status --ip gemma)
curl http://$IP:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-7b-it",
"messages": [
{
"role": "user",
"content": "Hello! What is your name?"
}
],
"max_tokens": 25
}'
Using the same YAML, we can easily scale the model serving across multiple instances, regions and clouds with SkyServe:
HF_TOKEN="xxx" sky serve up -n gemma serve.yaml --env HF_TOKEN
Notice the only change is from
sky launch
tosky serve up
. The same YAML can be used without changes.
After the cluster is launched, we can access the model with the following command:
ENDPOINT=$(sky serve status --endpoint gemma)
curl http://$ENDPOINT/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-7b-it",
"prompt": "My favourite condiment is",
"max_tokens": 25
}' | jq .
Chat API is also supported:
curl http://$ENDPOINT/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "google/gemma-7b-it",
"messages": [
{
"role": "user",
"content": "Hello! What is your name?"
}
],
"max_tokens": 25
}'