Add examples for all the supported tasks for huggingface runtime (#377)

* Add examples for all the supported tasks for huggingface runtime Signed-off-by: Sivanantham Chinnaiyan <[email protected]> * Update docs/modelserving/explainer/alibi/cifar10/README.md Signed-off-by: Dan Sun <[email protected]> --------- Signed-off-by: Sivanantham Chinnaiyan <[email protected]> Signed-off-by: Dan Sun <[email protected]> Co-authored-by: Dan Sun <[email protected]>
kserve · Jun 23, 2024 · 43d43aa · 43d43aa
1 parent 02b3d6d
commit 43d43aa
Show file tree

Hide file tree

Showing 10 changed files with 890 additions and 146 deletions.
diff --git a/docs/modelserving/explainer/alibi/cifar10/README.md b/docs/modelserving/explainer/alibi/cifar10/README.md
@@ -20,6 +20,11 @@ spec:
         limits:
           memory: 10Gi
   explainer:
+    containers:
+      - name: kserve-container
+        image: kserve/alibi-explainer:v0.12.1
+        args:
+          - --model_name=cifar10
     alibi:
       type: AnchorImages
       storageUri: "gs://kfserving-examples/models/tensorflow/cifar/explainer-0.9.1"

diff --git a/docs/modelserving/v1beta1/llm/huggingface/README.md b/docs/modelserving/v1beta1/llm/huggingface/README.md
@@ -1,160 +1,46 @@
-# Deploy the Llama3 model with Hugging Face LLM Serving Runtime
-The Hugging Face serving runtime implements a runtime that can serve Hugging Face models out of the box.
-The preprocess and post-process handlers are implemented based on different ML tasks, for example text classification,
+# Hugging Face LLM Serving Runtime
+The Hugging Face serving runtime implements two backends namely `Hugging Face` and `vLLM` that can serve Hugging Face models out of the box.
+The preprocess and post-process handlers are already implemented based on different ML tasks, for example text classification,
 token-classification, text-generation, text2text-generation, fill-mask.
 
-Based on the performance requirement for large language models(LLM), KServe chooses to run the optimized inference engine [vLLM](https://github.com/vllm-project/vllm) for text generation tasks by default considering its ease-of-use and high performance.
+KServe Hugging Face runtime by default uses [`vLLM`]((https://github.com/vllm-project/vllm)) backend to serve `text generation` and `text2text generation` LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. 
+vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel.
+If the Model is not supported by the vLLM engine, KServe falls back to the Hugging Face backend as a failsafe.
 
-In this example, we deploy a Llama3 model from Hugging Face by deploying the `InferenceService` with [Hugging Face Serving runtime](https://github.com/kserve/kserve/tree/master/python/huggingfaceserver). 
+## Supported ML Tasks
+The Hugging Face runtime supports the following ML tasks:
 
-### Serve the Hugging Face LLM model using vLLM backend
+- Text Generation
+- Text2Text Generation
+- Fill Mask
+- Token Classification
+- Sequence Classification (Text Classification)
 
-KServe Hugging Face runtime by default uses vLLM to serve the LLM models for faster time-to-first-token(TTFT) and higher token generation throughput than the Hugging Face API. vLLM is implemented with common inference optimization techniques, such as paged attention, continuous batching and an optimized CUDA kernel. If the model is not supported by vLLM, KServe falls back to HuggingFace backend as a failsafe.
+For, Models supported by the `vllm` backend, Please visit the [vLLM Supported Models page](https://docs.vllm.ai/en/latest/models/index.html).
 
 
-=== "Yaml"
+## API Endpoints
+Both the backends supports serving generative models (text generation and text2text generation) using [OpenAI's Completion](https://platform.openai.com/docs/api-reference/completions) and [Chat Completion](https://platform.openai.com/docs/api-reference/chat) API.
 
-    ```yaml
-    kubectl apply -f - <<EOF
-    apiVersion: serving.kserve.io/v1beta1
-    kind: InferenceService
-    metadata:
-      name: huggingface-llama3
-    spec:
-      predictor:
-        model:
-          modelFormat:
-            name: huggingface
-          args:
-          - --model_name=llama3
-          - --model_id=meta-llama/meta-llama-3-8b-instruct
-          resources:
-            limits:
-              cpu: "6"
-              memory: 24Gi
-              nvidia.com/gpu: "1"
-            requests:
-              cpu: "6"
-              memory: 24Gi
-              nvidia.com/gpu: "1"
-    EOF
-    ```
-!!! note
-    1. `SAFETENSORS_FAST_GPU` is set by default to improve the model loading performance.
-    2. `HF_HUB_DISABLE_TELEMETRY` is set by default to disable the telemetry.
-
-#### Perform Model Inference
-
-The first step is to [determine the ingress IP and ports](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`.
-
-```bash
-MODEL_NAME=llama3
-SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-llama3 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
-```
-
-KServe Hugging Face vLLM runtime supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for inference
-
-Sample OpenAI Completions request:
-
-```bash
-curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "<prompt>", "stream":false, "max_tokens": 30 }'
-
-```
-!!! success "Expected Output"
-
-  ```{ .json .no-copy }
-    {"id":"cmpl-7c654258ab4d4f18b31f47b553439d96","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"<generated_text>"}],"created":1715353182,"model":"llama3","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":26,"prompt_tokens":4,"total_tokens":30}}
-  ```
-
-Sample OpenAI Chat request:
-
-```bash
-curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/chat/completions -d '{"model": "${MODEL_NAME}", "messages": [{"role": "user","content": "<message>"}], "stream":false }'
-
-```
-!!! success "Expected Output"
-
-  ```{ .json .no-copy }
-    {"id":"cmpl-87ee252062934e2f8f918dce011e8484","choices":[{"finish_reason":"length","index":0,"message":{"content":"<generated_response>","tool_calls":null,"role":"assistant","function_call":null},"logprobs":null}],"created":1715353461,"model":"llama3","system_fingerprint":null,"object":"chat.completion","usage":{"completion_tokens":30,"prompt_tokens":3,"total_tokens":33}}
-  ```
-
-### Serve the Hugging Face LLM model using HuggingFace Backend
-You can use `--backend=huggingface` argument to perform the inference using Hugging Face API. KServe Hugging Face backend runtime also 
-supports the OpenAI `/v1/completions` and `/v1/chat/completions` endpoints for inference.
-
-=== "Yaml"
+The other types of tasks like token classification, sequence classification, fill mask are served using KServe's [Open Inference Protocol](../../../data_plane/v2_protocol.md) or [V1 API](../../../data_plane/v1_protocol.md). 
 
-    ```yaml
-    kubectl apply -f - <<EOF
-    apiVersion: serving.kserve.io/v1beta1
-    kind: InferenceService
-    metadata:
-      name: huggingface-t5
-    spec:
-      predictor:
-        model:
-          modelFormat:
-            name: huggingface
-          args:
-          - --model_name=t5
-          - --model_id=google-t5/t5-small
-          - --backend=huggingface
-          resources:
-            limits:
-              cpu: "1"
-              memory: 2Gi
-              nvidia.com/gpu: "1"
-            requests:
-              cpu: "1"
-              memory: 2Gi
-              nvidia.com/gpu: "1"
-    EOF
-    ```
+## Examples
+The following examples demonstrate how to deploy and perform inference using the Hugging Face runtime with different ML tasks:
 
-#### Perform Model Inference
-
-The first step is to [determine the ingress IP and ports](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`.
-
-```bash
-MODEL_NAME=t5
-SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-t5 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
-```
-
-Sample OpenAI Completions request:
-
-```bash
-curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "translate English to German: The house is wonderful.", "stream":false, "max_tokens": 30 }'
-
-```
-!!! success "Expected Output"
-
-  ```{ .json .no-copy }
-  {"id":"de53f527-9cb9-47a5-9673-43d180b704f2","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Das Haus ist wunderbar."}],"created":1717998661,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":{"completion_tokens":7,"prompt_tokens":11,"total_tokens":18}}
-  ```
-
-Sample OpenAI Completions streaming request:
-
-```bash
-curl -H "content-type:application/json" -H "Host: ${SERVICE_HOSTNAME}" -v http://${INGRESS_HOST}:${INGRESS_PORT}/openai/v1/completions -d '{"model": "${MODEL_NAME}", "prompt": "translate English to German: The house is wonderful.", "stream":true, "max_tokens": 30 }'
-
-```
-!!! success "Expected Output"
-
-  ```{ .json .no-copy }
-  data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Das "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
-
-  data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"Haus "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
-
-  data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"ist "}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
-
-  data: {"id":"70bb8bea-57d5-4b34-aade-da38970c917c","choices":[{"finish_reason":"length","index":0,"logprobs":null,"text":"wunderbar.</s>"}],"created":1717998767,"model":"t5","system_fingerprint":null,"object":"text_completion","usage":null}
-
-  data: [DONE]
-  ```
+- [Text Generation using LLama3](text_generation/README.md)
+- [Text2Text Generation using T5](text2text_generation/README.md)
+- [Token Classification using BERT](token_classification/README.md)
+- [Sequence Classification (Text Classification) using distilBERT](text_classification/README.md)
+- [Fill Mask using BERT](fill_mask/README.md)
 
+!!! note
+    The Hugging Face runtime image has the following environment variables set by default:
+
+    1. `SAFETENSORS_FAST_GPU` is set by default to improve the model loading performance.
+    2. `HF_HUB_DISABLE_TELEMETRY` is set by default to disable the telemetry.
 
 
-### Hugging Face Runtime Arguments
+## Hugging Face Runtime Arguments
 
 Below, you can find an explanation of command line arguments which are supported for Hugging Face runtime. [vLLM backend engine arguments](https://docs.vllm.ai/en/latest/models/engine_args.html) can also be specified on the command line argument which is parsed by the Hugging Face runtime.
 
@@ -166,7 +52,8 @@ Below, you can find an explanation of command line arguments which are supported
 - `--dtype`: Data type to load the weights in. One of 'auto', 'float16', 'float32', 'bfloat16', 'float', 'half'. 
              Defaults to float16 for GPU and float32 for CPU systems. 'auto' uses float16 if GPU is available and uses float32 otherwise to ensure consistency between vLLM and HuggingFace backends. 
              Encoder models defaults to 'float32'. 'float' is shorthand for 'float32'. 'half' is 'float16'. The rest are as the name reads.
-- `--task`: The ML task name. Can be one of 'text_generation', 'text2text_generation', 'fill_mask', 'token_classification', 'sequence_classification'.
+- `--task`: The ML task name. Can be one of 'text_generation', 'text2text_generation', 'fill_mask', 'token_classification', 'sequence_classification'. 
+            If not provided, model server will try infer the task from model architecture.
 - `--backend`: The backend to use to load the model. Can be one of 'auto', 'huggingface', 'vllm'.
 - `--max_length`: Max sequence length for the tokenizer.
 - `--disable_lower_case`: Disable lower case for the tokenizer.

diff --git a/docs/modelserving/v1beta1/llm/huggingface/fill_mask/README.md b/docs/modelserving/v1beta1/llm/huggingface/fill_mask/README.md
@@ -0,0 +1,146 @@
+# Deploy the BERT model for fill mask task with Hugging Face LLM Serving Runtime
+In this example, We demonstrate how to deploy `BERT model` for fill mask task from Hugging Face by deploying the `InferenceService` with [Hugging Face Serving runtime](https://github.com/kserve/kserve/tree/master/python/huggingfaceserver). 
+
+## Serve the Hugging Face LLM model using V1 Protocol
+First, We will deploy the `BERT model` using the Hugging Face backend with V1 Protocol.
+
+=== "Yaml"
+
+    ```yaml
+    kubectl apply -f - <<EOF
+    apiVersion: serving.kserve.io/v1beta1
+    kind: InferenceService
+    metadata:
+      name: huggingface-bert
+    spec:
+      predictor:
+        model:
+          modelFormat:
+            name: huggingface
+          args:
+            - --model_name=bert
+            - --model_id=google-bert/bert-base-uncased
+          resources:
+            limits:
+              cpu: "1"
+              memory: 2Gi
+              nvidia.com/gpu: "1"
+            requests:
+              cpu: "1"
+              memory: 2Gi
+              nvidia.com/gpu: "1"
+    EOF
+    ```
+
+### Check `InferenceService` status.
+
+```bash
+kubectl get inferenceservices huggingface-bert
+```
+
+!!! success "Expected Output"
+    ```{ .bash .no-copy }
+    NAME                 URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
+    huggingface-bert   http://huggingface-bert.default.example.com             True           100                              huggingface-bert-predictor-default-47q2g   7d23h
+    ```
+
+### Perform Model Inference
+
+The first step is to [determine the ingress IP and ports](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`.
+
+```bash
+MODEL_NAME=bert
+SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-bert -o jsonpath='{.status.url}' | cut -d "/" -f 3)
+```
+
+```bash
+curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/${MODEL_NAME}:predict \
+-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
+-d '{"instances": ["The capital of France is [MASK].", "The capital of [MASK] is paris."]}'
+```
+
+!!! success "Expected Output"
+    ```{ .json .no-copy .select }
+    {"predictions":["paris","france"]}
+    ```
+
+## Serve the Hugging Face LLM model using Open Inference Protocol(V2 Protocol)
+
+First, We will deploy the `BERT model` using the Hugging Face backend with Open Inference Protocol(V2 Protocol).
+For this, We need to set the **`protocolVersion` field to `v2`**.
+
+=== "Yaml"
+
+    ```yaml
+    kubectl apply -f - <<EOF
+    apiVersion: serving.kserve.io/v1beta1
+    kind: InferenceService
+    metadata:
+      name: huggingface-bert
+    spec:
+      predictor:
+        model:
+          modelFormat:
+            name: huggingface
+          protocolVersion: v2
+          args:
+            - --model_name=bert
+            - --model_id=google-bert/bert-base-uncased
+          resources:
+            limits:
+              cpu: "1"
+              memory: 2Gi
+              nvidia.com/gpu: "1"
+            requests:
+              cpu: "1"
+              memory: 2Gi
+              nvidia.com/gpu: "1"
+    EOF
+    ```
+
+### Check `InferenceService` status.
+
+```bash
+kubectl get inferenceservices huggingface-bert
+```
+
+!!! success "Expected Output"
+    ```{ .bash .no-copy }
+    NAME                 URL                                                   READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                          AGE
+    huggingface-bert   http://huggingface-bert.default.example.com             True           100                              huggingface-bert-predictor-default-47q2g   7d23h
+    ```
+
+### Perform Model Inference
+
+The first step is to [determine the ingress IP and ports](../../../../get_started/first_isvc.md#4-determine-the-ingress-ip-and-ports) and set `INGRESS_HOST` and `INGRESS_PORT`.
+
+```bash
+MODEL_NAME=bert
+SERVICE_HOSTNAME=$(kubectl get inferenceservice huggingface-bert -o jsonpath='{.status.url}' | cut -d "/" -f 3)
+```
+
+```bash
+curl -v http://${INGRESS_HOST}:${INGRESS_PORT}/v2/models/${MODEL_NAME}/infer \
+-H "content-type: application/json" -H "Host: ${SERVICE_HOSTNAME}" \
+-d '{"inputs": [{"name": "input-0", "shape": [2], "datatype": "BYTES", "data": ["The capital of France is [MASK].", "The capital of [MASK] is paris."]}]}'
+```
+
+!!! success "Expected Output"
+
+    ```{ .json .no-copy }
+    {
+      "model_name": "bert",
+      "model_version": null,
+      "id": "fd206443-f58c-4c5f-a04b-e6babcf6c854",
+      "parameters": null,
+      "outputs": [
+        {
+          "name": "output-0",
+          "shape": [2],
+          "datatype": "BYTES",
+          "parameters": null,
+          "data": ["paris", "france"]
+        }
+      ]
+    }
+    ```