Skip to content

Commit

Permalink
Add examples for all the supported tasks for huggingface runtime (#377)
Browse files Browse the repository at this point in the history
* Add examples for all the supported tasks for huggingface runtime

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Update docs/modelserving/explainer/alibi/cifar10/README.md

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
Co-authored-by: Dan Sun <[email protected]>
  • Loading branch information
sivanantha321 and yuzisun authored Jun 23, 2024
1 parent 02b3d6d commit 43d43aa
Show file tree
Hide file tree
Showing 10 changed files with 890 additions and 146 deletions.
5 changes: 5 additions & 0 deletions docs/modelserving/explainer/alibi/cifar10/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,11 @@ spec:
limits:
memory: 10Gi
explainer:
containers:
- name: kserve-container
image: kserve/alibi-explainer:v0.12.1
args:
- --model_name=cifar10
alibi:
type: AnchorImages
storageUri: "gs://kfserving-examples/models/tensorflow/cifar/explainer-0.9.1"
Expand Down
177 changes: 32 additions & 145 deletions docs/modelserving/v1beta1/llm/huggingface/README.md
Original file line number Diff line number Diff line change
@@ -1,160 +1,46 @@
# Deploy the Llama3 model with Hugging Face LLM Serving Runtime
The Hugging Face serving runtime implements a runtime that can serve Hugging Face models out of the box.
The preprocess and post-process handlers are implemented based on different ML tasks, for example text classification,
# Hugging Face LLM Serving Runtime
The Hugging Face serving runtime implements two backends namely `Hugging Face` and `vLLM` that can serve Hugging Face models out of the box.
The preprocess and post-process handlers are already implemented based on different ML tasks, for example text classification,
token-classification, text-generation, text2text-generation, fill-mask.

Based on the performance requirement for large language models(LLM), KServe chooses to run the optimized inference engine [vLLM](https://github.com/vllm-project/vllm) for text generation tasks by default considering its ease-of-use and high performance.
KServe Hugging Face runtime by default uses [`vLLM`]((https://github.com/vllm-project/vllm)) backend to serve `text generation` and `text2text generation` LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API.
vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel.
If the Model is not supported by the vLLM engine, KServe falls back to the Hugging Face backend as a failsafe.

In this example, we deploy a Llama3 model from Hugging Face by deploying the `InferenceService` with [Hugging Face Serving runtime](https://github.com/kserve/kserve/tree/master/python/huggingfaceserver).
## Supported ML Tasks
The Hugging Face runtime supports the following ML tasks:

### Serve the Hugging Face LLM model using vLLM backend
- Text Generation
- Text2Text Generation
- Fill Mask
- Token Classification
- Sequence Classification (Text Classification)

KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel. If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe.
For, Models supported by the `vllm` backend, Please visit the [vLLM Supported Models page](https://docs.vllm.ai/en/latest/models/index.html).


=== "Yaml"
## API Endpoints
Both the backends supports serving generative models (text generation and text2text generation) using [OpenAI's Completion](https://platform.openai.com/docs/api-reference/completions) and [Chat Completion](https://platform.openai.com/docs/api-reference/chat) API.

```yaml
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama3
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=llama3
- --model_id=meta-llama/meta-llama-3-8b-instruct
resources:
limits:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
requests:
cpu: "6"
memory: 24Gi
nvidia.com/gpu: "1"
EOF
```
!!! note
1. `SAFETENSORS_FAST_GPU` is set by default to improve the model loading performance.
2. `HF_HUB_DISABLE_TELEMETRY` is set by default to disable the telemetry.

#### Perform Model Inference

The first step is to [determine the ingress IP and ports](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`.

```bash
MODEL_NAME=llama3
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama3 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
```

KServe Hugging Face vLLM runtime supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for inference

Sample OpenAI Completions request:

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "<prompt>", "stream":false, "max_tokens": 30 }'

```
!!! success "Expected Output"

```{ .json .no-copy }
{"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"<generated_text>"}],"created":1715353182,"model":"llama3","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}}
```

Sample OpenAI Chat request:

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions -d '{"model": "${MODEL_NAME}", "messages": [{"role": "user","content": "<message>"}], "stream":false }'

```
!!! success "Expected Output"

```{ .json .no-copy }
{"id":"cmpl-87ee252062934e2f8f918dce011e8484","choices":[{"finish_reason":"length","index":0,"message":{"content":"<generated_response>","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1715353461,"model":"llama3","system_fingerprint":null,"object":"chat.completion","usage":{"completion_tokens":30,"prompt_tokens":3,"total_tokens":33}}
```

### Serve the Hugging Face LLM model using HuggingFace Backend
You can use `--backend=huggingface` argument to perform the inference using Hugging Face API. KServe Hugging Face backend runtime also
supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for inference.

=== "Yaml"
The other types of tasks like token classification, sequence classification, fill mask are served using KServe's [Open Inference Protocol](../../../data_plane/v2_protocol.md) or [V1 API](../../../data_plane/v1_protocol.md).

```yaml
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-t5
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=t5
- --model_id=google-t5/t5-small
- --backend=huggingface
resources:
limits:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
EOF
```
## Examples
The following examples demonstrate how to deploy and perform inference using the Hugging Face runtime with different ML tasks:

#### Perform Model Inference

The first step is to [determine the ingress IP and ports](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`.

```bash
MODEL_NAME=t5
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-t5 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
```

Sample OpenAI Completions request:

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "translate English to German: The house is wonderful.", "stream":false, "max_tokens": 30 }'

```
!!! success "Expected Output"

```{ .json .no-copy }
{"id":"de53f527-9cb9-47a5-9673-43d180b704f2","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Das Haus ist wunderbar."}],"created":1717998661,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":7,"prompt_tokens":11,"total_tokens":18}}
```

Sample OpenAI Completions streaming request:

```bash
curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "translate English to German: The house is wonderful.", "stream":true, "max_tokens": 30 }'

```
!!! success "Expected Output"

```{ .json .no-copy }
data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Das "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}

data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Haus "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}

data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"ist "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}

data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"wunderbar.</s>"}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}

data: [DONE]
```
- [Text Generation using LLama3](text_generation/README.md)
- [Text2Text Generation using T5](text2text_generation/README.md)
- [Token Classification using BERT](token_classification/README.md)
- [Sequence Classification (Text Classification) using distilBERT](text_classification/README.md)
- [Fill Mask using BERT](fill_mask/README.md)

!!! note
The Hugging Face runtime image has the following environment variables set by default:

1. `SAFETENSORS_FAST_GPU` is set by default to improve the model loading performance.
2. `HF_HUB_DISABLE_TELEMETRY` is set by default to disable the telemetry.


### Hugging Face Runtime Arguments
## Hugging Face Runtime Arguments

Below, you can find an explanation of command line arguments which are supported for Hugging Face runtime. [vLLM backend engine arguments](https://docs.vllm.ai/en/latest/models/engine_args.html) can also be specified on the command line argument which is parsed by the Hugging Face runtime.

Expand All @@ -166,7 +52,8 @@ Below, you can find an explanation of command line arguments which are supported
- `--dtype`: Data type to load the weights in. One of 'auto', 'float16', 'float32', 'bfloat16', 'float', 'half'.
Defaults to float16 for GPU and float32 for CPU systems. 'auto' uses float16 if GPU is available and uses float32 otherwise to ensure consistency between vLLM and HuggingFace backends.
Encoder models defaults to 'float32'. 'float' is shorthand for 'float32'. 'half' is 'float16'. The rest are as the name reads.
- `--task`: The ML task name. Can be one of 'text_generation', 'text2text_generation', 'fill_mask', 'token_classification', 'sequence_classification'.
- `--task`: The ML task name. Can be one of 'text_generation', 'text2text_generation', 'fill_mask', 'token_classification', 'sequence_classification'.
If not provided, model server will try infer the task from model architecture.
- `--backend`: The backend to use to load the model. Can be one of 'auto', 'huggingface', 'vllm'.
- `--max_length`: Max sequence length for the tokenizer.
- `--disable_lower_case`: Disable lower case for the tokenizer.
Expand Down
146 changes: 146 additions & 0 deletions docs/modelserving/v1beta1/llm/huggingface/fill_mask/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
# Deploy the BERT model for fill mask task with Hugging Face LLM Serving Runtime
In this example, We demonstrate how to deploy `BERT model` for fill mask task from Hugging Face by deploying the `InferenceService` with [Hugging Face Serving runtime](https://github.com/kserve/kserve/tree/master/python/huggingfaceserver).

## Serve the Hugging Face LLM model using V1 Protocol
First, We will deploy the `BERT model` using the Hugging Face backend with V1 Protocol.

=== "Yaml"

```yaml
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-bert
spec:
predictor:
model:
modelFormat:
name: huggingface
args:
- --model_name=bert
- --model_id=google-bert/bert-base-uncased
resources:
limits:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
EOF
```

### Check `InferenceService` status.

```bash
kubectl get inferenceservices huggingface-bert
```

!!! success "Expected Output"
```{ .bash .no-copy }
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
huggingface-bert http://huggingface-bert.default.example.com True 100 huggingface-bert-predictor-default-47q2g 7d23h
```

### Perform Model Inference

The first step is to [determine the ingress IP and ports](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`.

```bash
MODEL_NAME=bert
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-bert -o jsonpath='{.status.url}' | cut -d "/" -f 3)
```

```bash
curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict \
-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-d '{"instances": ["The capital of France is [MASK].", "The capital of [MASK] is paris."]}'
```

!!! success "Expected Output"
```{ .json .no-copy .select }
{"predictions":["paris","france"]}
```

## Serve the Hugging Face LLM model using Open Inference Protocol(V2 Protocol)

First, We will deploy the `BERT model` using the Hugging Face backend with Open Inference Protocol(V2 Protocol).
For this, We need to set the **`protocolVersion` field to `v2`**.

=== "Yaml"

```yaml
kubectl apply -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-bert
spec:
predictor:
model:
modelFormat:
name: huggingface
protocolVersion: v2
args:
- --model_name=bert
- --model_id=google-bert/bert-base-uncased
resources:
limits:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
requests:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
EOF
```

### Check `InferenceService` status.

```bash
kubectl get inferenceservices huggingface-bert
```

!!! success "Expected Output"
```{ .bash .no-copy }
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
huggingface-bert http://huggingface-bert.default.example.com True 100 huggingface-bert-predictor-default-47q2g 7d23h
```

### Perform Model Inference

The first step is to [determine the ingress IP and ports](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`.

```bash
MODEL_NAME=bert
SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-bert -o jsonpath='{.status.url}' | cut -d "/" -f 3)
```

```bash
curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/${MODEL_NAME}/infer \
-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
-d '{"inputs": [{"name": "input-0", "shape": [2], "datatype": "BYTES", "data": ["The capital of France is [MASK].", "The capital of [MASK] is paris."]}]}'
```

!!! success "Expected Output"

```{ .json .no-copy }
{
"model_name": "bert",
"model_version": null,
"id": "fd206443-f58c-4c5f-a04b-e6babcf6c854",
"parameters": null,
"outputs": [
{
"name": "output-0",
"shape": [2],
"datatype": "BYTES",
"parameters": null,
"data": ["paris", "france"]
}
]
}
```
Loading

0 comments on commit 43d43aa

Please sign in to comment.