diff --git a/docs/admin/serverless/kourier_networking/README.md b/docs/admin/serverless/kourier_networking/README.md index da140c5d4..195a993bc 100644 --- a/docs/admin/serverless/kourier_networking/README.md +++ b/docs/admin/serverless/kourier_networking/README.md @@ -1,5 +1,5 @@ # Deploy InferenceService with Alternative Networking Layer -KServe v0.9 and prior versions create the top level `Istio Virtual Service` for routing to `InferenceService` components based on the virtual host or path based routing. +KServe creates the top level `Istio Virtual Service` for routing to `InferenceService` components based on the virtual host or path based routing. Now KServe provides an option for disabling the top level virtual service to allow configuring other networking layers Knative supports. For example, [Kourier](https://developers.redhat.com/blog/2020/06/30/kourier-a-lightweight-knative-serving-ingress) is an alternative networking layer and the following steps show how you can deploy KServe with `Kourier`. diff --git a/docs/get_started/README.md b/docs/get_started/README.md index 5d88b5138..eb1b4ce19 100644 --- a/docs/get_started/README.md +++ b/docs/get_started/README.md @@ -19,6 +19,6 @@ The [Kubernetes CLI (`kubectl`)](https://kubernetes.io/docs/tasks/tools/install- You can get started with a local deployment of KServe by using _KServe Quick installation script on Kind_: ```bash -curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.9/hack/quick_install.sh" | bash +curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.10/hack/quick_install.sh" | bash ``` diff --git a/docs/modelserving/data_plane.md b/docs/modelserving/data_plane.md deleted file mode 100644 index 5367380a9..000000000 --- a/docs/modelserving/data_plane.md +++ /dev/null @@ -1,36 +0,0 @@ -# Data Plane -The InferenceService Data Plane architecture consists of a static graph of components which coordinate requests for a single model. Advanced features such as Ensembling, A/B testing, and Multi-Arm-Bandits should compose InferenceServices together. - -![Data Plane](../images/dataplane.jpg) - -### Concepts -**Component**: Each endpoint is composed of multiple components: "predictor", "explainer", and "transformer". The only required component is the predictor, which is the core of the system. As KServe evolves, we plan to increase the number of supported components to enable use cases like Outlier Detection. - -**Predictor**: The predictor is the workhorse of the InferenceService. It is simply a model and a model server that makes it available at a network endpoint. - -**Explainer**: The explainer enables an optional alternate data plane that provides model explanations in addition to predictions. Users may define their own explanation container, which configures with relevant environment variables like prediction endpoint. For common use cases, KServe provides out-of-the-box explainers like Alibi. - -**Transformer**: The transformer enables users to define a pre and post processing step before the prediction and explanation workflows. Like the explainer, it is configured with relevant environment variables too. For common use cases, KServe provides out-of-the-box transformers like Feast. - -### Data Plane (V1) -KServe has a standardized prediction workflow across all model frameworks. - -| API | Verb | Path | Payload | -| ------------- | ------------- | ------------- | ------------- | -| Readiness| GET | /v1/models/ | Response:{"name": , "ready": true/false} | -| Predict | POST | /v1/models/:predict | Request:{"instances": []} Response:{"predictions": []} | -| Explain | POST | /v1/models/:explain | Request:{"instances": []} Response:{"predictions": [], "explainations": []} || - -#### Predict -All InferenceServices speak the [Tensorflow V1 HTTP API](https://www.tensorflow.org/tfx/serving/api_rest#predict_api). - -Note: Only Tensorflow models support the fields "signature_name" and "inputs". - -#### Explain -All InferenceServices that are deployed with an Explainer support a standardized explanation API. This interface is identical to the Tensorflow V1 HTTP API with the addition of an ":explain" verb. - -### Data Plane (V2) -The second version of the data-plane protocol addresses several issues found with the V1 data-plane protocol, including performance and generality across a large number of model frameworks and servers. - -#### Predict -The V2 protocol proposes both HTTP/REST and GRPC APIs. See the [complete specification](./inference_api.md) for more information. diff --git a/docs/modelserving/data_plane/data_plane.md b/docs/modelserving/data_plane/data_plane.md new file mode 100644 index 000000000..40b79b456 --- /dev/null +++ b/docs/modelserving/data_plane/data_plane.md @@ -0,0 +1,60 @@ +# Data Plane +The InferenceService Data Plane architecture consists of a static graph of components which coordinate requests for a single model. Advanced features such as Ensembling, A/B testing, and Multi-Arm-Bandits should compose InferenceServices together. + +## Introduction +KServe's data plane protocol introduces an inference API that is independent of any specific ML/DL framework and model server. This allows for quick iterations and consistency across Inference Services and supports both easy-to-use and high-performance use cases. + +By implementing this protocol both inference clients and servers will increase their utility and +portability by operating seamlessly on platforms that have standardized around this API. Kserve's inference protocol is endorsed by NVIDIA +Triton Inference Server, TensorFlow Serving, and TorchServe. + +![Data Plane](../../images/dataplane.jpg) +
Note: Protocol V2 uses /infer instead of :predict + +### Concepts +**Component**: Each endpoint is composed of multiple components: "predictor", "explainer", and "transformer". The only required component is the predictor, which is the core of the system. As KServe evolves, we plan to increase the number of supported components to enable use cases like Outlier Detection. + +**Predictor**: The predictor is the workhorse of the InferenceService. It is simply a model and a model server that makes it available at a network endpoint. + +**Explainer**: The explainer enables an optional alternate data plane that provides model explanations in addition to predictions. Users may define their own explanation container, which configures with relevant environment variables like prediction endpoint. For common use cases, KServe provides out-of-the-box explainers like Alibi. + +**Transformer**: The transformer enables users to define a pre and post processing step before the prediction and explanation workflows. Like the explainer, it is configured with relevant environment variables too. For common use cases, KServe provides out-of-the-box transformers like Feast. + + +## Data Plane V1 & V2 + +KServe supports two versions of its data plane, V1 and V2. V1 protocol offers a standard prediction workflow with HTTP/REST. The second version of the data-plane protocol addresses several issues found with the V1 data-plane protocol, including performance and generality across a large number of model frameworks and servers. Protocol V2 expands the capabilities of V1 by adding gRPC APIs. + +### Main changes + +* V2 does not currently support the explain endpoint +* V2 added Server Readiness/Liveness/Metadata endpoints +* V2 endpoint paths contain `/` instead of `:` +* V2 renamed `:predict` endpoint to `/infer` +* V2 allows for model versions in the request path (optional) + + +### V1 APIs + +| API | Verb | Path | +| ------------- | ------------- | ------------- | +| List Models | GET | /v1/models | +| Model Ready | GET | /v1/models/\ | +| Predict | POST | /v1/models/\:predict | +| Explain | POST | /v1/models/\:explain | + +### V2 APIs + +| API | Verb | Path | +| ------------- | ------------- | ------------- | +| Inference | POST | v2/models/\[/versions/\]/infer | +| Model Metadata | GET | v2/models/\[/versions/\] | +| Server Readiness | GET | v2/health/ready | +| Server Liveness | GET | v2/health/live | +| Server Metadata | GET | v2 | +| Model Readiness| GET | v2/models/\[/versions/]/ready | + +** path contents in `[]` are optional + +Please see [V1 Protocol](./v1_protocol.md) and [V2 Protocol](./v2_protocol.md) documentation for more information. + diff --git a/docs/modelserving/data_plane/v1_protocol.md b/docs/modelserving/data_plane/v1_protocol.md new file mode 100644 index 000000000..a94f7a718 --- /dev/null +++ b/docs/modelserving/data_plane/v1_protocol.md @@ -0,0 +1,25 @@ +# Data Plane (V1) +KServe's V1 protocol offers a standardized prediction workflow across all model frameworks. This protocol version is still supported, but it is recommended that users migrate to the [V2 protocol](./v2_protocol.md) for better performance and standardization among serving runtimes. However, if a use case requires a more flexibile schema than protocol v2 provides, v1 protocol is still an option. + +| API | Verb | Path | Request Payload | Response Payload | +| ------------- | ------------- | ------------- | ------------- | ------------- | +| List Models | GET | /v1/models | | {"models": \[\\]} | +| Model Ready| GET | /v1/models/\ | | {"name": \,"ready": $bool} | +| Predict | POST | /v1/models/\:predict | {"instances": []} ** | {"predictions": []} | +| Explain | POST | /v1/models/\:explain | {"instances": []} **| {"predictions": [], "explanations": []} | | + +** = payload is optional + +Note: The response payload in V1 protocol is not strictly enforced. A custom server define and return its own response payload. We encourage using the KServe defined response payload for consistency. + + +## API Definitions + +| API | Definition | +| --- | --- | +| Predict | The "predict" API performs inference on a model. The response is the prediction result. All InferenceServices speak the [Tensorflow V1 HTTP API](https://www.tensorflow.org/tfx/serving/api_rest#predict_api). | +| Explain | The "explain" API is an optional component that provides model explanations in addition to predictions. The standardized explainer interface is identical to the Tensorflow V1 HTTP API with the addition of an ":explain" verb.| +| Model Ready | The “model ready” health API indicates if a specific model is ready for inferencing. If the model(s) is downloaded and ready to serve requests, the model ready endpoint returns the list of accessible (s). | +| List Models | The "models" API exposes a list of models in the model registry. | + + diff --git a/docs/modelserving/inference_api.md b/docs/modelserving/data_plane/v2_protocol.md similarity index 74% rename from docs/modelserving/inference_api.md rename to docs/modelserving/data_plane/v2_protocol.md index 93b8e1850..61c55bd79 100644 --- a/docs/modelserving/inference_api.md +++ b/docs/modelserving/data_plane/v2_protocol.md @@ -1,87 +1,103 @@ -# Predict Protocol - Version 2 - -This document proposes a predict/inference API independent of any -specific ML/DL framework and model server. The proposed APIs are -able to support both easy-to-use and high-performance use cases. -By implementing this protocol both -inference clients and servers will increase their utility and -portability by being able to operate seamlessly on platforms that have -standardized around this API. This protocol is endorsed by NVIDIA -Triton Inference Server, TensorFlow Serving, and ONNX Runtime -Server. - -For an inference server to be compliant with this protocol the server -must implement all APIs described below, except where an optional -feature is explicitly noted. A compliant inference server may choose -to implement either or both of the HTTP/REST API and the GRPC API. - -The protocol supports an extension mechanism as a required part of the -API, but this document does not propose any specific extensions. Any -specific extensions will be proposed separately. +## Open Inference Protocol (V2 Inference Protocol) -## HTTP/REST +**For an inference server to be compliant with this protocol the server must implement the health, metadata, and inference V2 APIs**. +Optional features that are explicitly noted are not required. A compliant inference server may choose to implement the [HTTP/REST API](#httprest) and/or the [GRPC API](#grpc). + +Check the [model serving runtime table](../v1beta1/serving_runtime.md) / the `protocolVersion` field in the [runtime YAML](https://github.com/kserve/kserve/tree/master/config/runtimes) to ensure V2 protocol is supported for model serving runtime that you are using. + +Note: For all API descriptions on this page, all strings in all contexts are case-sensitive. The V2 protocol supports an extension mechanism as a required part of the API, but this document does not propose any specific extensions. Any specific extensions will be proposed separately. + +### Note on changes between V1 & V2 -A compliant server must implement the health, metadata, and inference -APIs described in this section. +V2 protocol does not currently support the explain endpoint like V1 protocol does. If this is a feature you wish to have in the V2 protocol, please submit a [github issue](https://github.com/kserve/kserve/issues). + + +## HTTP/REST The HTTP/REST API uses JSON because it is widely supported and language independent. In all JSON schemas shown in this document $number, $string, $boolean, $object and $array refer to the fundamental JSON types. #optional indicates an optional JSON field. -All strings in all contexts are case-sensitive. +See also: The HTTP/REST endpoints are defined in [rest_predict_v2.yaml](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/rest_predict_v2.yaml) -For KFServing the server must recognize the following URLs. The -versions portion of the URL is shown as optional to allow -implementations that don’t support versioning or for cases when the -user does not want to specify a specific model version (in which case -the server will choose a version based on its own policies). +| API | Verb | Path | Request Payload | Response Payload | +| ------------- | ------------- | ------------- | ------------- | ------------- | +| Inference | POST | v2/models/[/versions/\]/infer | [$inference_request](#inference-request-json-object) | [$inference_response](#inference-response-json-object) | +| Model Metadata | GET | v2/models/\[/versions/\] | | [$metadata_model_response](#model-metadata-response-json-object) | +| Server Ready | GET | v2/health/ready | | [$ready_server_response](#server-ready-response-json-object) | +| Server Live | GET | v2/health/live | | [$live_server_response](#server-live-response-json-objet)| +| Server Metadata | GET | v2 | | [$metadata_server_response](#server-metadata-response-json-object) | +| Model Ready| GET | v2/models/\[/versions/]/ready | | [$ready_model_response](#model-ready-response-json-object) | -**Health:** +** path contents in `[]` are optional - GET v2/health/live - GET v2/health/ready - GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready +For more information regarding payload contents, see `Payload Contents`. -**Server Metadata:** +The versions portion of the `Path` URLs (in `[]`) is shown as **optional** to allow implementations that don’t support versioning or for cases when the user does not want to specify a specific model version (in which case the server will choose a version based on its own policies). +For example, if a model does not implement a version, the Model Metadata request path could look like `v2/model/my_model`. If the model has been configured to implement a version, the request path could look something like `v2/models/my_model/versions/v10`, where the version of the model is v10. - GET v2 + -**Model Metadata:** +### **API Definitions** - GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}] +| API | Definition | +| --- | --- | +| Inference | The `/infer` endpoint performs inference on a model. The response is the prediction result.| +| Model Metadata | The "model metadata" API is a per-model endpoint that returns details about the model passed in the path. | +| Server Ready | The “server ready” health API indicates if all the models are ready for inferencing. The “server ready” health API can be used directly to implement the Kubernetes readinessProbe | +| Server Live | The “server live” health API indicates if the inference server is able to receive and respond to metadata and inference requests. The “server live” API can be used directly to implement the Kubernetes livenessProbe. | +| Server Metadata | The "server metadata" API returns details describing the server. | +| Model Ready | The “model ready” health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL. | -**Inference:** +### Health/Readiness/Liveness Probes - POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer +The Model Readiness probe the question "Did the model download and is it able to serve requests?" and responds with the available model name(s). The Server Readiness/Liveness probes answer the question "Is my service and its infrastructure running, healthy, and able to receive and process requests?" -### Health +To read more about liveness and readiness probe concepts, visit the [Configure Liveness, Readiness and Startup Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) +Kubernetes documentation. -A health request is made with an HTTP GET to a health endpoint. The -HTTP response status code indicates a boolean result for the health -request. A 200 status code indicates true and a 4xx status code -indicates false. The HTTP response body should be empty. There are -three health APIs. +### **Payload Contents** -#### Server Live +### **Model Ready** -The “server live” API indicates if the inference server is able to -receive and respond to metadata and inference requests. The “server -live” API can be used directly to implement the Kubernetes -livenessProbe. +The model ready endpoint returns the readiness probe response for the server along with the name of the model. -#### Server Ready +#### Model Ready Response JSON Object -The “server ready” health API indicates if all the models are ready -for inferencing. The “server ready” health API can be used directly to -implement the Kubernetes readinessProbe. -#### Model Ready + $ready_model_response = + { + "name" : $string, + "ready": $bool + } + -The “model ready” health API indicates if a specific model is ready -for inferencing. The model name and (optionally) version must be -available in the URL. If a version is not provided the server may -choose a version based on its own policies. +### Server Ready + +The server ready endpoint returns the readiness probe response for the server. + +#### Server Ready Response JSON Object + + $ready_server_response = + { + "live" : $bool, + } + +--- + +### Server Live + +The server live endpoint returns the liveness probe response for the server. + +#### Server Live Response JSON Objet + + $live_server_response = + { + "live" : $bool, + } + +--- ### Server Metadata @@ -107,9 +123,8 @@ code. The server metadata response object, identified as * “name” : A descriptive name for the server. * "version" : The server version. -* “extensions” : The extensions supported by the server. Currently no - standard extensions are defined. Individual inference servers may - define and document their own extensions. +* “extensions” : The extensions supported by the server. Currently no standard extensions are defined. Individual inference servers may define and document their own extensions. + #### Server Metadata Response JSON Error Object @@ -124,6 +139,12 @@ status (typically 400). The HTTP body must contain the * “error” : The descriptive message for the error. + +The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains the [Model Metadata Response JSON Object](#model-metadata-response-json-object) or the [Model Metadata Response JSON Error Object](#model-metadata-response-json-error-object). +The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error. + +--- + ### Model Metadata The per-model metadata endpoint provides information about a model. A @@ -191,6 +212,8 @@ status (typically 400). The HTTP body must contain the * “error” : The descriptive message for the error. +--- + ### Inference An inference request is made with an HTTP POST to an inference @@ -224,20 +247,20 @@ return an error. inference request expressed as key/value pairs. See [Parameters](#parameters) for more information. * "inputs" : The input tensors. Each input is described using the - *$request_input* schema defined in [Request Input](#request-input). + *$request_input* schema defined in [Request Input](#inference_request-input). * "outputs" : The output tensors requested for this inference. Each requested output is described using the *$request_output* schema - defined in [Request Output](#request-output). Optional, if not + defined in [Request Output](#inference_request-output). Optional, if not specified all outputs produced by the model will be returned using default *$request_output* settings. ##### Request Input -The *$request_input* JSON describes an input to the model. If the +The *$inference_request_input* JSON describes an input to the model. If the input is batched, the shape and data must represent the full shape and contents of the entire batch. - $request_input = + $inference_request_input = { "name" : $string, "shape" : [ $number, ... ], @@ -262,7 +285,7 @@ contents of the entire batch. The *$request_output* JSON is used to request which output tensors should be returned from the model. - $request_output = + $inference_request_output = { "name" : $string, "parameters" : $parameters #optional, @@ -300,46 +323,12 @@ code. The inference response object, identified as $response_output schema defined in [Response Output](#response-output). -##### Response Output +--- -The *$response_output* JSON describes an output from the model. If the -output is batched, the shape and data represents the full shape of the -entire batch. - $response_output = - { - "name" : $string, - "shape" : [ $number, ... ], - "datatype" : $string, - "parameters" : $parameters #optional, - "data" : $tensor_data - } +### **Inference Request Examples** -* "name" : The name of the output tensor. -* "shape" : The shape of the output tensor. Each dimension must be an - integer representable as an unsigned 64-bit integer value. -* "datatype" : The data-type of the output tensor elements as defined - in [Tensor Data Types](#tensor-data-types). -* "parameters" : An object containing zero or more parameters for this - input expressed as key/value pairs. See [Parameters](#parameters) - for more information. -* “data”: The contents of the tensor. See [Tensor Data](#tensor-data) - for more information. - -#### Inference Response JSON Error Object - -A failed inference request must be indicated by an HTTP error status -(typically 400). The HTTP body must contain the -*$inference_error_response* object. - - $inference_error_response = - { - "error": - } - -* “error” : The descriptive message for the error. - -#### Inference Request Examples +### Inference Request Examples The following example shows an inference request to a model with two inputs and one output. The HTTP Content-Length header gives the size @@ -391,61 +380,29 @@ type FP32 the following response would be returned. ] } -### Parameters - -The *$parameters* JSON describes zero or more “name”/”value” pairs, -where the “name” is the name of the parameter and the “value” is a -$string, $number, or $boolean. - - $parameters = - { - $parameter, ... - } - - $parameter = $string : $string | $number | $boolean - -Currently no parameters are defined. As required a future proposal may -define one or more standard parameters to allow portable functionality -across different inference servers. A server can implement -server-specific parameters to provide non-standard capabilities. - -### Tensor Data - -Tensor data must be presented in row-major order of the tensor -elements. Element values must be given in "linear" order without any -stride or padding between elements. Tensor elements may be presented -in their nature multi-dimensional representation, or as a flattened -one-dimensional representation. - -Tensor data given explicitly is provided in a JSON array. Each element -of the array may be an integer, floating-point number, string or -boolean value. The server can decide to coerce each element to the -required type or return an error if an unexpected value is -received. Note that fp16 is problematic to communicate explicitly -since there is not a standard fp16 representation across backends nor -typically the programmatic support to create the fp16 representation -for a JSON number. -For example, the 2-dimensional matrix: +## gRPC - [ 1 2 - 4 5 ] +The GRPC API closely follows the concepts defined in the +[HTTP/REST](#httprest) API. A compliant server must implement the +health, metadata, and inference APIs described in this section. -Can be represented in its natural format as: - "data" : [ [ 1, 2 ], [ 4, 5 ] ] +| API | rpc Endpoint | Request Message | Response Message | +| --- | --- | --- | ---| +| Inference | [ModelInfer](#inference) | ModelInferRequest | ModelInferResponse | +| Model Ready | [ModelReady](#model-ready) | [ModelReadyRequest] | ModelReadyResponse | +| Model Metadata | [ModelMetadata](#model-metadata)| ModelMetadataRequest | ModelMetadataResponse | +| Server Ready | [ServerReady](#server-ready) | ServerReadyRequest | ServerReadyResponse | +| Server Live | [ServerLive](#server-live) | ServerLiveRequest | ServerLiveResponse | -Or in a flattened one-dimensional representation: +For more detailed information on each endpoint and its contents, see `API Definitions` and `Message Contents`. - "data" : [ 1, 2, 4, 5 ] +See also: The gRPC endpoints, request/response messages and contents are defined in [grpc_predict_v2.proto](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/grpc_predict_v2.proto) -## GRPC -The GRPC API closely follows the concepts defined in the -[HTTP/REST](#httprest) API. A compliant server must implement the -health, metadata, and inference APIs described in this section. +### **API Definitions** -All strings in all contexts are case-sensitive. The GRPC definition of the service is: @@ -473,12 +430,12 @@ The GRPC definition of the service is: rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse) {} } +### **Message Contents** + ### Health A health request is made using the ServerLive, ServerReady, or -ModelReady endpoint. For each of these endpoints errors are indicated -by the google.rpc.Status returned for the request. The OK code -indicates success and other codes indicate failure. +ModelReady endpoint. For each of these endpoints errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. #### Server Live @@ -528,12 +485,13 @@ inferencing. The request and response messages for ModelReady are: bool ready = 1; } -### Server Metadata +--- -The ServerMetadata API provides information about the server. Errors -are indicated by the google.rpc.Status returned for the request. The -OK code indicates success and other codes indicate failure. The -request and response messages for ServerMetadata are: +### Metadata + +#### Server Metadata + +The ServerMetadata API provides information about the server. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ServerMetadata are: message ServerMetadataRequest {} @@ -549,7 +507,7 @@ request and response messages for ServerMetadata are: repeated string extensions = 3; } -### Model Metadata +#### Model Metadata The per-model metadata API provides information about a model. Errors are indicated by the google.rpc.Status returned for the request. The @@ -598,11 +556,30 @@ request and response messages for ModelMetadata are: repeated TensorMetadata outputs = 5; } +#### Platforms + +A platform is a string indicating a DL/ML framework or +backend. Platform is returned as part of the response to a +[Model Metadata](#model_metadata) request but is information only. The +proposed inference APIs are generic relative to the DL/ML framework +used by a model and so a client does not need to know the platform of +a given model to use the API. Platform names use the format +“_”. The following platform names are allowed: + +* tensorrt_plan : A TensorRT model encoded as a serialized engine or “plan”. +* tensorflow_graphdef : A TensorFlow model encoded as a GraphDef. +* tensorflow_savedmodel : A TensorFlow model encoded as a SavedModel. +* onnx_onnxv1 : A ONNX model encoded for ONNX Runtime. +* pytorch_torchscript : A PyTorch model encoded as TorchScript. +* mxnet_mxnet: An MXNet model +* caffe2_netdef : A Caffe2 model encoded as a NetDef. + +--- + ### Inference The ModelInfer API performs inference using the specified -model. Errors are indicated by the google.rpc.Status returned for the -request. The OK code indicates success and other codes indicate +model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelInfer are: message ModelInferRequest @@ -731,16 +708,13 @@ failure. The request and response messages for ModelInfer are: repeated bytes raw_output_contents = 6; } -### Parameters +#### Parameters The Parameters message describes a “name”/”value” pair, where the “name” is the name of the parameter and the “value” is a boolean, integer, or string corresponding to the parameter. -Currently no parameters are defined. As required a future proposal may -define one or more standard parameters to allow portable functionality -across different inference servers. A server can implement -server-specific parameters to provide non-standard capabilities. +Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities. // // An inference parameter value. @@ -762,6 +736,8 @@ server-specific parameters to provide non-standard capabilities. } } +--- + ### Tensor Data In all representations tensor data must be flattened to a @@ -829,25 +805,7 @@ matches the tensor's data type. repeated bytes bytes_contents = 8; } -## Platforms - -A platform is a string indicating a DL/ML framework or -backend. Platform is returned as part of the response to a -[Model Metadata](#model_metadata) request but is information only. The -proposed inference APIs are generic relative to the DL/ML framework -used by a model and so a client does not need to know the platform of -a given model to use the API. Platform names use the format -“_”. The following platform names are allowed: - -* tensorrt_plan : A TensorRT model encoded as a serialized engine or “plan”. -* tensorflow_graphdef : A TensorFlow model encoded as a GraphDef. -* tensorflow_savedmodel : A TensorFlow model encoded as a SavedModel. -* onnx_onnxv1 : A ONNX model encoded for ONNX Runtime. -* pytorch_torchscript : A PyTorch model encoded as TorchScript. -* mxnet_mxnet: An MXNet model -* caffe2_netdef : A Caffe2 model encoded as a NetDef. - -## Tensor Data Types +#### Tensor Data Types Tensor data types are shown in the following table along with the size of each type, in bytes. @@ -868,3 +826,5 @@ of each type, in bytes. | FP32 | 4 | | FP64 | 8 | | BYTES | Variable (max 232) | + +--- diff --git a/docs/modelserving/observability/prometheus_metrics.md b/docs/modelserving/observability/prometheus_metrics.md new file mode 100644 index 000000000..ed5a87003 --- /dev/null +++ b/docs/modelserving/observability/prometheus_metrics.md @@ -0,0 +1,64 @@ +# Prometheus Metrics + +## Exposing Prometheus Port + +All supported serving runtimes support exporting prometheus metrics on a specified port in the inference service's pod. The appropriate port for the model server is defined in the [kserve/config/runtimes](https://github.com/kserve/kserve/tree/master/config/runtimes) YAML files. For example, torchserve defines its prometheus port as `8082` in `kserve-torchserve.yaml`. + +```yaml +metadata: + name: kserve-torchserve +spec: + annotations: + prometheus.kserve.io/port: '8082' + prometheus.kserve.io/path: "/metrics" +``` + +If needed, this value can be overridden in the InferenceService YAML. + +To enable prometheus metrics, add the annotation `serving.kserve.io/enable-prometheus-scraping` to the InferenceService YAML. + +```yaml +apiVersion: "serving.kserve.io/v1beta1" +kind: "InferenceService" +metadata: + name: "sklearn-irisv2" + annotations: + serving.kserve.io/enable-prometheus-scraping: "true" +spec: + predictor: + sklearn: + protocolVersion: v2 + storageUri: "gs://seldon-models/sklearn/iris" +``` + +The default values for `serving.kserve.io/enable-prometheus-scraping` can be set in the `inferenceservice-config` configmap. See [the docs](https://github.com/kserve/kserve/blob/master/qpext/README.md#configs) for more info. + +There is not currently a unified set of metrics exported by the model servers. Each model server may implement its own set of metrics to export. + +## Metrics for lgbserver, paddleserver, pmmlserver, sklearnserver, xgbserver, custom transformer/predictor + +Prometheus latency histograms are emitted for each of the steps (pre/postprocessing, explain, predict). +Additionally, the latencies of each step are logged per request. + +| Metric Name | Description | Type | +|-----------------------------------|--------------------------------|-----------| +| request_preprocess_seconds | pre-processing request latency | Histogram | +| request_explain_seconds | explain request latency | Histogram | +| request_predict_seconds | prediction request latency | Histogram | +| request_postprocess_seconds | pre-processing request latency | Histogram | + +## Other metrics + +Some model servers define their own metrics. + +* [mlserver](https://docs.seldon.io/projects/seldon-core/en/latest/analytics/analytics.html) +* [torchserve](https://pytorch.org/serve/metrics_api.html) +* [triton](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/metrics.md) +* [tensorflow](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/metrics.cc) (Please see [Github Issue #2462](https://github.com/kserve/kserve/issues/2462)) + + +## Exporting Metrics + +Exporting metrics in serverles mode requires that the queue-proxy extension image is used. + +For more information on how to export metrics, see [Queue Proxy Extension](https://github.com/kserve/kserve/blob/master/qpext/README.md) documentation. diff --git a/docs/modelserving/v1beta1/serving_runtime.md b/docs/modelserving/v1beta1/serving_runtime.md index bce165a7c..638072cd8 100644 --- a/docs/modelserving/v1beta1/serving_runtime.md +++ b/docs/modelserving/v1beta1/serving_runtime.md @@ -21,18 +21,31 @@ After models are deployed with InferenceService, you get all the following serve - Out-of-the-box metrics - Ingress/Egress control -| Model Serving Runtime | Exported model| Prediction Protocol | HTTP | gRPC | Versions | Examples | + +--- + +The table below identifies each of the model serving runtimes supported by KServe. The HTTP and gRPC columns indicate the prediction protocol version that the serving runtime supports. The KServe prediction protocol is noted as either "v1" or "v2". Some serving runtimes also support their own prediction protocol, these are noted with an `*`. The default serving runtime version column defines the source and version of the serving runtime - MLServer, KServe or its own. These versions can also be found in the [runtime kustomization YAML](https://github.com/kserve/kserve/blob/master/config/runtimes/kustomization.yaml). All KServe native model serving runtimes use the current KServe release version (v0.10). The supported framework version column lists the **major** version of the model that is supported. These can also be found in the respective [runtime YAML](https://github.com/kserve/kserve/tree/master/config/runtimes) under the `supportedModelFormats` field. For model frameworks using the KServe serving runtime, the specific default version can be found in [kserve/python](https://github.com/kserve/kserve/tree/master/python). In a given serving runtime directory the setup.py file contains the exact model framework version used. For example, in [kserve/python/lgbserver](https://github.com/kserve/kserve/tree/master/python/lgbserver) the [setup.py](https://github.com/kserve/kserve/blob/master/python/lgbserver/setup.py) file sets the model framework version to 3.3.2, `lightgbm == 3.3.2`. + +| Model Serving Runtime | Exported model | HTTP | gRPC | Default Serving Runtime Version | Supported Framework (Major) Version(s) | Examples | | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- |--------------------------------------| -| [Triton Inference Server](https://github.com/triton-inference-server/server) | [TensorFlow,TorchScript,ONNX](https://github.com/triton-inference-server/server/blob/r21.09/docs/model_repository.md)| v2 | :heavy_check_mark: | :heavy_check_mark: | [Compatibility Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)| [Torchscript cifar](triton/torchscript) | -| [TFServing](https://www.tensorflow.org/tfx/guide/serving) | [TensorFlow SavedModel](https://www.tensorflow.org/guide/saved_model) | v1 | :heavy_check_mark: | :heavy_check_mark: | [TFServing Versions](https://github.com/tensorflow/serving/releases) | [TensorFlow flower](./tensorflow) | -| [TorchServe](https://pytorch.org/serve/server.html) | [Eager Model/TorchScript](https://pytorch.org/docs/master/generated/torch.save.html) | v1/v2 REST | :heavy_check_mark: | :heavy_check_mark: | 0.5.3 | [TorchServe mnist](./torchserve) | -| [SKLearn MLServer](https://github.com/SeldonIO/MLServer) | [Pickled Model](https://scikit-learn.org/stable/modules/model_persistence.html) | v2 | :heavy_check_mark: | :heavy_check_mark: | 1.0.1 | [SKLearn Iris V2](./sklearn/v2) | -| [XGBoost MLServer](https://github.com/SeldonIO/MLServer) | [Saved Model](https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html) | v2 | :heavy_check_mark: | :heavy_check_mark: | 1.5.0 | [XGBoost Iris V2](./xgboost) | -| [SKLearn ModelServer](https://github.com/kserve/kserve/tree/master/python/sklearnserver) | [Pickled Model](https://scikit-learn.org/stable/modules/model_persistence.html) | v1 | :heavy_check_mark: | -- | 1.0.1 | [SKLearn Iris](./sklearn/v2) | -| [XGBoost ModelServer](https://github.com/kserve/kserve/tree/master/python/xgbserver) | [Saved Model](https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html) | v1 | :heavy_check_mark: | -- | 1.5.0 | [XGBoost Iris](./xgboost) | -| [PMML ModelServer](https://github.com/kserve/kserve/tree/master/python/pmmlserver) | [PMML](http://dmg.org/pmml/v4-4-1/GeneralStructure.html) | v1 | :heavy_check_mark: | -- | [PMML4.4.1](https://github.com/autodeployai/pypmml) | [SKLearn PMML](./pmml) | -| [LightGBM ModelServer](https://github.com/kserve/kserve/tree/master/python/lightgbm) | [Saved LightGBM Model](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.save_model) | v1 | :heavy_check_mark: | -- | 3.2.0 | [LightGBM Iris](./lightgbm) | -| [Custom ModelServer](https://github.com/kserve/kserve/tree/master/python/kserve/kserve) | -- | v1 | :heavy_check_mark: | -- | -- | [Custom Model](custom/custom_model) | +| [Custom ModelServer](https://github.com/kserve/kserve/tree/master/python/kserve/kserve) | -- | v1, v2 | v2 | -- | -- | [Custom Model](custom/custom_model) | +| [LightGBM MLServer](https://mlserver.readthedocs.io/en/latest/runtimes/lightgbm.html) | [Saved LightGBM Model](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.save_model) | v2 | v2 | v1.0.0 (MLServer) | 3 | [LightGBM Iris V2](./lightgbm) | +| [LightGBM ModelServer](https://github.com/kserve/kserve/tree/master/python/lgbserver) | [Saved LightGBM Model](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.save_model) | v1 | -- | v0.10 (KServe) | 3 | [LightGBM Iris](./lightgbm) | +| [MLFlow ModelServer](https://docs.seldon.io/projects/seldon-core/en/latest/servers/mlflow.html) | [Saved MLFlow Model](https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.save_model) | v2 | v2 | v1.0.0 (MLServer) | 1 | [MLFLow wine-classifier](./mlflow) | +| [PMML ModelServer](https://github.com/kserve/kserve/tree/master/python/pmmlserver) | [PMML](http://dmg.org/pmml/v4-4-1/GeneralStructure.html) | v1 | -- | v0.10 (KServe) | 3, 4 ([PMML4.4.1](https://github.com/autodeployai/pypmml)) | [SKLearn PMML](./pmml) | +| [SKLearn MLServer](https://github.com/SeldonIO/MLServer) | [Pickled Model](https://scikit-learn.org/stable/modules/model_persistence.html) | v2 | v2| v1.0.0 (MLServer) | 1 | [SKLearn Iris V2](./sklearn/v2) | +| [SKLearn ModelServer](https://github.com/kserve/kserve/tree/master/python/sklearnserver) | [Pickled Model](https://scikit-learn.org/stable/modules/model_persistence.html) | v1 | -- | v0.10 (KServe) | 1 | [SKLearn Iris](./sklearn/v2) | +| [TFServing](https://www.tensorflow.org/tfx/guide/serving) | [TensorFlow SavedModel](https://www.tensorflow.org/guide/saved_model) | v1 | *tensorflow | 2.6.2 ([TFServing Versions](https://github.com/tensorflow/serving/releases)) | 2 | [TensorFlow flower](./tensorflow) | +| [TorchServe](https://pytorch.org/serve/server.html) | [Eager Model/TorchScript](https://pytorch.org/docs/master/generated/torch.save.html) | v1, v2, *torchserve | *torchserve | 0.7.0 (TorchServe) | 1 | [TorchServe mnist](./torchserve) | +| [Triton Inference Server](https://github.com/triton-inference-server/server) | [TensorFlow,TorchScript,ONNX](https://github.com/triton-inference-server/server/blob/r21.09/docs/model_repository.md)| v2 | v2 | 21.09-py3 (Triton) | 8 (TensoRT), 1, 2 (TensorFlow), 1 (PyTorch), 2 (Triton) [Compatibility Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html)| [Torchscript cifar](triton/torchscript) | +| [XGBoost MLServer](https://github.com/SeldonIO/MLServer) | [Saved Model](https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html) | v2 | v2 | v1.0.0 (MLServer) | 1 | [XGBoost Iris V2](./xgboost) | +| [XGBoost ModelServer](https://github.com/kserve/kserve/tree/master/python/xgbserver) | [Saved Model](https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html) | v1 | -- | v0.10 (KServe) | 1 | [XGBoost Iris](./xgboost) | + + + +*tensorflow - Tensorflow implements its own prediction protocol in addition to KServe's. See: [Tensorflow Serving Prediction API](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/apis/prediction_service.proto) documentation + +*torchserve - PyTorch implements its own predicition protocol in addition to KServe's. See: [Torchserve gRPC API](https://pytorch.org/serve/grpc_api.html#) documentation !!! Note The model serving runtime version can be overwritten with the `runtimeVersion` field on InferenceService yaml and we highly recommend diff --git a/docs/modelserving/v1beta1/transformer/torchserve_image_transformer/README.md b/docs/modelserving/v1beta1/transformer/torchserve_image_transformer/README.md index 4a70f48eb..072e04983 100644 --- a/docs/modelserving/v1beta1/transformer/torchserve_image_transformer/README.md +++ b/docs/modelserving/v1beta1/transformer/torchserve_image_transformer/README.md @@ -42,24 +42,26 @@ def image_transform(instance): # for REST predictor the preprocess handler converts to input dict to the v1 REST protocol dict class ImageTransformer(kserve.Model): - def __init__(self, name: str, predictor_host: str): + def __init__(self, name: str, predictor_host: str, headers: Dict[str, str] = None): super().__init__(name) self.predictor_host = predictor_host + self.ready = True - def preprocess(self, inputs: Dict) -> Dict: + def preprocess(self, inputs: Dict, headers: Dict[str, str] = None) -> Dict: return {'instances': [image_transform(instance) for instance in inputs['instances']]} - def postprocess(self, inputs: Dict) -> Dict: + def postprocess(self, inputs: Dict, headers: Dict[str, str] = None) -> Dict: return inputs # for gRPC predictor the preprocess handler converts the input dict to the v2 gRPC protocol ModelInferRequest class ImageTransformer(kserve.Model): - def __init__(self, name: str, predictor_host: str, protocol: str): + def __init__(self, name: str, predictor_host: str, protocol: str, headers: Dict[str, str] = None): super().__init__(name) self.predictor_host = predictor_host self.protocol = protocol + self.ready = True - def preprocess(self, request: Dict) -> ModelInferRequest: + def preprocess(self, request: Dict, headers: Dict[str, str] = None) -> ModelInferRequest: input_tensors = [image_transform(instance) for instance in request["instances"]] input_tensors = numpy.asarray(input_tensors) request = ModelInferRequest() @@ -71,7 +73,7 @@ class ImageTransformer(kserve.Model): request.raw_input_contents.extend([input_0._get_content()]) return request - def postprocess(self, infer_response: ModelInferResponse) -> Dict: + def postprocess(self, infer_response: ModelInferResponse, headers: Dict[str, str] = None) -> Dict: response = InferResult(infer_response) return {"predictions": response.as_numpy("OUTPUT__0").tolist()} ``` diff --git a/docs/modelserving/v1beta1/triton/torchscript/README.md b/docs/modelserving/v1beta1/triton/torchscript/README.md index 2a35318c7..43b394117 100644 --- a/docs/modelserving/v1beta1/triton/torchscript/README.md +++ b/docs/modelserving/v1beta1/triton/torchscript/README.md @@ -376,7 +376,7 @@ class ImageTransformerV2(kserve.Model): return {output["name"]: np.array(output["data"]).reshape(output["shape"]).tolist() for output in results["outputs"]} ``` -Please find [the code example](https://github.com/kserve/kserve/tree/release-0.9/docs/samples/v1beta1/triton/torchscript/image_transformer_v2) and [Dockerfile](https://github.com/kserve/kserve/blob/release-0.9/docs/samples/v1beta1/triton/torchscript/transformer.Dockerfile). +Please find [the code example](https://github.com/kserve/kserve/tree/release-0.10/docs/samples/v1beta1/triton/torchscript/image_transformer_v2) and [Dockerfile](https://github.com/kserve/kserve/blob/release-0.10/docs/samples/v1beta1/triton/torchscript/transformer.Dockerfile). ### Build Transformer docker image ``` diff --git a/mkdocs.yml b/mkdocs.yml index 33e80822b..7abd55410 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -18,8 +18,9 @@ nav: - Control Plane: - Model Serving Control Plane: modelserving/control_plane.md - Data Plane: - - Model Serving Data Plane: modelserving/data_plane.md - - V2 Inference Protocol: modelserving/inference_api.md + - Model Serving Data Plane: modelserving/data_plane/data_plane.md + - V1 Inference Protocol: modelserving/data_plane/v1_protocol.md + - Open Inference Protocol (V2 Inference Protocol): modelserving/data_plane/v2_protocol.md - Serving Runtimes: modelserving/servingruntimes.md - Single Model Serving: - Supported Model Frameworks/Formats: @@ -79,6 +80,8 @@ nav: - Rollout Strategies: - Canary: modelserving/v1beta1/rollout/canary.md - Canary Example: modelserving/v1beta1/rollout/canary-example.md + - Inference Observability: + - Prometheus Metrics: modelserving/observability/prometheus_metrics.md - API Reference: - Control Plane API: reference/api.md - Python Client SDK: sdk_docs/sdk_doc.md