diff --git a/docs/admin/serverless/kourier_networking/README.md b/docs/admin/serverless/kourier_networking/README.md index da140c5d4..195a993bc 100644 --- a/docs/admin/serverless/kourier_networking/README.md +++ b/docs/admin/serverless/kourier_networking/README.md @@ -1,5 +1,5 @@ # Deploy InferenceService with Alternative Networking Layer -KServe v0.9 and prior versions create the top level `Istio Virtual Service` for routing to `InferenceService` components based on the virtual host or path based routing. +KServe creates the top level `Istio Virtual Service` for routing to `InferenceService` components based on the virtual host or path based routing. Now KServe provides an option for disabling the top level virtual service to allow configuring other networking layers Knative supports. For example, [Kourier](https://developers.redhat.com/blog/2020/06/30/kourier-a-lightweight-knative-serving-ingress) is an alternative networking layer and the following steps show how you can deploy KServe with `Kourier`. diff --git a/docs/get_started/README.md b/docs/get_started/README.md index 5d88b5138..eb1b4ce19 100644 --- a/docs/get_started/README.md +++ b/docs/get_started/README.md @@ -19,6 +19,6 @@ The [Kubernetes CLI (`kubectl`)](https://kubernetes.io/docs/tasks/tools/install- You can get started with a local deployment of KServe by using _KServe Quick installation script on Kind_: ```bash -curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.9/hack/quick_install.sh" | bash +curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.10/hack/quick_install.sh" | bash ``` diff --git a/docs/modelserving/data_plane/data_plane.md b/docs/modelserving/data_plane/data_plane.md index 5de85ad5e..6e6624e11 100644 --- a/docs/modelserving/data_plane/data_plane.md +++ b/docs/modelserving/data_plane/data_plane.md @@ -8,8 +8,8 @@ By implementing this protocol both inference clients and servers will increase t portability by operating seamlessly on platforms that have standardized around this API. Kserve's inference protocol is endorsed by NVIDIA Triton Inference Server, TensorFlow Serving, and ONNX Runtime Server. -![Data Plane](../images/dataplane.jpg) -Note: Protocol V2 uses /infer instead of :predict +![Data Plane](../../images/dataplane.jpg) +
Note: Protocol V2 uses /infer instead of :predict ### Concepts **Component**: Each endpoint is composed of multiple components: "predictor", "explainer", and "transformer". The only required component is the predictor, which is the core of the system. As KServe evolves, we plan to increase the number of supported components to enable use cases like Outlier Detection. @@ -25,7 +25,7 @@ Note: Protocol V2 uses /infer instead of :predict Kserve supports two versions of its data plane, V1 and V2. V1 protocol offers a standard prediction workflow with HTTP/REST. The second version of the data-plane protocol addresses several issues found with the V1 data-plane protocol, including performance and generality across a large number of model frameworks and servers. Protocol V2 expands the capabilities of V1 by adding gRPC APIs. -### Main changes between V1 & V2 dataplane +### Main changes * V2 does not currently support the explain endpoint * V2 added Server Readiness/Liveness/Metadata endpoints @@ -39,23 +39,22 @@ Kserve supports two versions of its data plane, V1 and V2. V1 protocol offers a | API | Verb | Path | | ------------- | ------------- | ------------- | | List Models | GET | /v1/models | -| Model Ready | GET | /v1/models/ | -| Predict | POST | /v1/models/:predict | -| Explain | POST | /v1/models/:explain | +| Model Ready | GET | /v1/models/\ | +| Predict | POST | /v1/models/\:predict | +| Explain | POST | /v1/models/\:explain | ### V2 APIs | API | Verb | Path | | ------------- | ------------- | ------------- | -| Inference | POST | v2/models/[/versions/]/infer | - -| Model Metadata | GET | v2/models/[/versions/] | +| Inference | POST | v2/models/\[/versions/\]/infer | +| Model Metadata | GET | v2/models/\[/versions/\] | | Server Readiness | GET | v2/health/ready | | Server Liveness | GET | v2/health/live | | Server Metadata | GET | v2 | -| Inference | POST | v2/models/[/versions/]/infer | + ** path contents in `[]` are optional -Please see [V1 Protocol](/docs/modelserving/data_plane/v1_protocol.md) and [V2 Protocol](/docs/modelserving/data_plane/v2_protocol.md) documentation for more information. +Please see [V1 Protocol](./v1_protocol.md) and [V2 Protocol](./v2_protocol.md) documentation for more information. diff --git a/docs/modelserving/data_plane/v1_protocol.md b/docs/modelserving/data_plane/v1_protocol.md index 7d3e07508..7296b3abb 100644 --- a/docs/modelserving/data_plane/v1_protocol.md +++ b/docs/modelserving/data_plane/v1_protocol.md @@ -3,10 +3,10 @@ KServe's V1 protocol offers a standardized prediction workflow across all model | API | Verb | Path | Request Payload | Response Payload | | ------------- | ------------- | ------------- | ------------- | ------------- | -| List Models | GET | /v1/models | | {"models": []} | -| Model Ready| GET | /v1/models/ | | {"name":,"ready": $bool} | -| Predict | POST | /v1/models/:predict | {"instances": []} ** | {"predictions": []} | -| Explain | POST | /v1/models/:explain | {"instances": []} **| {"predictions": [], "explanations": []} | | +| List Models | GET | /v1/models | | {"models": \[\\]} | +| Model Ready| GET | /v1/models/\ | | {"name": \,"ready": $bo\ol} | +| Predict | POST | /v1/models/\:predict | {"instances": []} ** | {"predictions": []} | +| Explain | POST | /v1/models/\:explain | {"instances": []} **| {"predictions": [], "explanations": []} | | ** = payload is optional @@ -21,6 +21,6 @@ TODO: make sure list models/model ready is correct. | Predict | The "predict" API performs inference on a model. The response is the prediction result. All InferenceServices speak the [Tensorflow V1 HTTP API](https://www.tensorflow.org/tfx/serving/api_rest#predict_api). | | Explain | The "explain" API is an optional component that provides model explanations in addition to predictions. The standardized explainer interface is identical to the Tensorflow V1 HTTP API with the addition of an ":explain" verb.| | Model Ready | The “model ready” health API indicates if a specific model is ready for inferencing. If the model(s) is downloaded and ready to serve requests, the model ready endpoint returns the list of accessible (s). | +| List Models | The "models" API exposes a list of models in the model registry. | -## Examples -TODO + diff --git a/docs/modelserving/data_plane/v2_protocol.md b/docs/modelserving/data_plane/v2_protocol.md index a3bff2445..c76dd8672 100644 --- a/docs/modelserving/data_plane/v2_protocol.md +++ b/docs/modelserving/data_plane/v2_protocol.md @@ -1,10 +1,10 @@ ## Open Inference Protocol / V2 Protocol -**For an inference server to be compliant with this protocol the server must implement the health, metadata, and inference V2 APIs**. Optional features that are explicitly noted are not required. A compliant inference server may choose to implement the HTTP/REST API and/or the GRPC API. +**For an inference server to be compliant with this protocol the server must implement the health, metadata, and inference V2 APIs**. Optional features that are explicitly noted are not required. A compliant inference server may choose to implement the [HTTP/REST API](#httprest) and/or the [GRPC API](#grpc). The V2 protocol supports an extension mechanism as a required part of the API, but this document does not propose any specific extensions. Any specific extensions will be proposed separately. -Note: For the below API descriptions, all strings in all contexts are case-sensitive. +Note: For all API descriptions on this page, all strings in all contexts are case-sensitive. ### Note on changes between V1 & V2 @@ -18,15 +18,16 @@ language independent. In all JSON schemas shown in this document $number, $string, $boolean, $object and $array refer to the fundamental JSON types. #optional indicates an optional JSON field. +See also: The HTTP/REST endpoints are defined in [rest_predict_v2.yaml](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/rest_predict_v2.yaml) | API | Verb | Path | Request Payload | Response Payload | | ------------- | ------------- | ------------- | ------------- | ------------- | -| Inference | POST | v2/models/[/versions/]/infer | [$inference_request](#inference-request-json-object) | [$inference_response](#inference-response-json-object) | - -| Model Metadata | GET | v2/models/[/versions/] | | [$metadata_model_response](#model-metadata-response-json-object) | -| Server Ready | GET | v2/health/ready | [$ready_server_response](#server-ready-response-json-object) | -| Server Live | GET | v2/health/live | [$live_server_response](#server-live-response-json-objet)| +| Inference | POST | v2/models/[/versions/\]/infer | [$inference_request](#inference-request-json-object) | [$inference_response](#inference-response-json-object) | +| Model Metadata | GET | v2/models/\[/versions/\] | | [$metadata_model_response](#model-metadata-response-json-object) | +| Server Ready | GET | v2/health/ready | | [$ready_server_response](#server-ready-response-json-object) | +| Server Live | GET | v2/health/live | | [$live_server_response](#server-live-response-json-objet)| | Server Metadata | GET | v2 | | [$metadata_server_response](#server-metadata-response-json-object) | + ** path contents in `[]` are optional @@ -37,17 +38,16 @@ For example, if a model does not implement a version, the Model Metadata request -
- API Definitions +### **API Definitions** | API | Definition | | --- | --- | | Inference | The `/infer` endpoint performs inference on a model. The response is the prediction result.| - | Model Metadata | The "model metadata" API is a per-model endpoint that returns details about the model passed in the path. | | Server Ready | The “server ready” health API indicates if all the models are ready for inferencing. The “server ready” health API can be used directly to implement the Kubernetes readinessProbe | | Server Live | The “server live” health API indicates if the inference server is able to receive and respond to metadata and inference requests. The “server live” API can be used directly to implement the Kubernetes livenessProbe. | | Server Metadata | The "server metadata" API returns details describing the server. | + ### Health/Readiness/Liveness Probes @@ -56,11 +56,7 @@ The Model Readiness probe the question "Did the model download and is it able to To read more about liveness and readiness probe concepts, visit the [Configure Liveness, Readiness and Startup Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) Kubernetes documentation. -
- -
- Payload Contents - +### **Payload Contents** -### **Server Ready** +### Server Ready The server ready endpoint returns the readiness probe response for the server. @@ -86,7 +82,7 @@ The server ready endpoint returns the readiness probe response for the server. --- -### **Server Live** +### Server Live The server live endpoint returns the liveness probe response for the server. @@ -99,7 +95,7 @@ The server live endpoint returns the liveness probe response for the server. --- -### **Server Metadata** +### Server Metadata The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata @@ -152,7 +148,7 @@ based on its own policies or return an error. --- -### **Model Metadata** +### Model Metadata The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata @@ -221,7 +217,7 @@ status (typically 400). The HTTP body must contain the --- -### **Inference** +### Inference An inference request is made with an HTTP POST to an inference endpoint. In the request the HTTP body contains the @@ -332,11 +328,10 @@ code. The inference response object, identified as --- -
-
- Inference Request Examples - ## Inference Request Examples +### **Inference Request Examples** + +### Inference Request Examples The following example shows an inference request to a model with two inputs and one output. The HTTP Content-Length header gives the size @@ -388,8 +383,6 @@ type FP32 the following response would be returned. ] } -
- ## gRPC @@ -397,13 +390,11 @@ The GRPC API closely follows the concepts defined in the [HTTP/REST](#httprest) API. A compliant server must implement the health, metadata, and inference APIs described in this section. -All strings in all contexts are case-sensitive. - | API | rpc Endpoint | Request Message | Response Message | | --- | --- | --- | ---| | Inference | [ModelInfer](#inference) | ModelInferRequest | ModelInferResponse | -| Model Ready | [ModelReady](#model-ready) | ModelReadyRequest | ModelReadyResponse | +| Model Ready | [ModelReady](#model-ready) | [ModelReadyRequest] | ModelReadyResponse | | Model Metadata | [ModelMetadata](#model-metadata)| ModelMetadataRequest | ModelMetadataResponse | | Server Ready | [ServerReady](#server-ready) | ServerReadyRequest | ServerReadyResponse | | Server Live | [ServerLive](#server-live) | ServerLiveRequest | ServerLiveResponse | @@ -413,8 +404,7 @@ For more detailed information on each endpoint and its contents, see `API Defini See also: The gRPC endpoints, request/response messages and contents are defined in [grpc_predict_v2.proto](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/grpc_predict_v2.proto) -
- API Definitions +### **API Definitions** The GRPC definition of the service is: @@ -443,12 +433,9 @@ The GRPC definition of the service is: rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse) {} } -
+### **Message Contents** -
- Message Contents - -### **Health** +### Health A health request is made using the ServerLive, ServerReady, or ModelReady endpoint. For each of these endpoints errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. @@ -501,7 +488,11 @@ inferencing. The request and response messages for ModelReady are: bool ready = 1; } -#### Server Metadata +--- + +### Metadata + +#### Server Metadata The ServerMetadata API provides information about the server. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ServerMetadata are: @@ -568,7 +559,27 @@ request and response messages for ModelMetadata are: repeated TensorMetadata outputs = 5; } -### **Inference** +#### Platforms + +A platform is a string indicating a DL/ML framework or +backend. Platform is returned as part of the response to a +[Model Metadata](#model_metadata) request but is information only. The +proposed inference APIs are generic relative to the DL/ML framework +used by a model and so a client does not need to know the platform of +a given model to use the API. Platform names use the format +“_”. The following platform names are allowed: + +* tensorrt_plan : A TensorRT model encoded as a serialized engine or “plan”. +* tensorflow_graphdef : A TensorFlow model encoded as a GraphDef. +* tensorflow_savedmodel : A TensorFlow model encoded as a SavedModel. +* onnx_onnxv1 : A ONNX model encoded for ONNX Runtime. +* pytorch_torchscript : A PyTorch model encoded as TorchScript. +* mxnet_mxnet: An MXNet model +* caffe2_netdef : A Caffe2 model encoded as a NetDef. + +--- + +### Inference The ModelInfer API performs inference using the specified model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate @@ -700,7 +711,7 @@ failure. The request and response messages for ModelInfer are: repeated bytes raw_output_contents = 6; } -### **Parameters** +#### Parameters The Parameters message describes a “name”/”value” pair, where the “name” is the name of the parameter and the “value” is a boolean, @@ -728,8 +739,9 @@ Currently no parameters are defined. As required a future proposal may define on } } +--- -### **Tensor Data** +### Tensor Data In all representations tensor data must be flattened to a one-dimensional, row-major order of the tensor elements. Element @@ -818,25 +830,4 @@ of each type, in bytes. | FP64 | 8 | | BYTES | Variable (max 232) | - - -### **Platforms** - -A platform is a string indicating a DL/ML framework or -backend. Platform is returned as part of the response to a -[Model Metadata](#model_metadata) request but is information only. The -proposed inference APIs are generic relative to the DL/ML framework -used by a model and so a client does not need to know the platform of -a given model to use the API. Platform names use the format -“_”. The following platform names are allowed: - -* tensorrt_plan : A TensorRT model encoded as a serialized engine or “plan”. -* tensorflow_graphdef : A TensorFlow model encoded as a GraphDef. -* tensorflow_savedmodel : A TensorFlow model encoded as a SavedModel. -* onnx_onnxv1 : A ONNX model encoded for ONNX Runtime. -* pytorch_torchscript : A PyTorch model encoded as TorchScript. -* mxnet_mxnet: An MXNet model -* caffe2_netdef : A Caffe2 model encoded as a NetDef. - - -
+--- diff --git a/docs/modelserving/v1beta1/triton/torchscript/README.md b/docs/modelserving/v1beta1/triton/torchscript/README.md index 2a35318c7..43b394117 100644 --- a/docs/modelserving/v1beta1/triton/torchscript/README.md +++ b/docs/modelserving/v1beta1/triton/torchscript/README.md @@ -376,7 +376,7 @@ class ImageTransformerV2(kserve.Model): return {output["name"]: np.array(output["data"]).reshape(output["shape"]).tolist() for output in results["outputs"]} ``` -Please find [the code example](https://github.com/kserve/kserve/tree/release-0.9/docs/samples/v1beta1/triton/torchscript/image_transformer_v2) and [Dockerfile](https://github.com/kserve/kserve/blob/release-0.9/docs/samples/v1beta1/triton/torchscript/transformer.Dockerfile). +Please find [the code example](https://github.com/kserve/kserve/tree/release-0.10/docs/samples/v1beta1/triton/torchscript/image_transformer_v2) and [Dockerfile](https://github.com/kserve/kserve/blob/release-0.10/docs/samples/v1beta1/triton/torchscript/transformer.Dockerfile). ### Build Transformer docker image ```