Subsections of Kserve
Install Kserve
Preliminary
Installation
Preliminary
1. Kubernetes has installed, if not check ๐link2. Helm binary has installed, if not check ๐link1.install from script directly
curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.15/hack/quick_install.sh" | bash
Installing Gateway API CRDs …
…
๐ Successfully installed Istio
๐ Successfully installed Cert Manager
๐ Successfully installed Knative
Preliminary
1. If you have only one node in your cluster, you need at least 6 CPUs, 6 GB of memory, and 30 GB of disk storage.2. If you have multiple nodes in your cluster, for each node you need at least 2 CPUs, 4 GB of memory, and 20 GB of disk storage.1.install knative serving CRD resources
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.18.0/serving-crds.yaml
2.install knative serving components
kubectl apply -f https://github.com/knative/serving/releases/download/knative-v1.18.0/serving-core.yaml
# kubectl apply -f https://raw.githubusercontent.com/AaronYang0628/assets/refs/heads/main/knative/serving/release/download/knative-v1.18.0/serving-core.yaml
3.install network layer Istio
kubectl apply -l knative.dev/crd-install=true -f https://github.com/knative/net-istio/releases/download/knative-v1.18.0/istio.yaml
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.18.0/istio.yaml
kubectl apply -f https://github.com/knative/net-istio/releases/download/knative-v1.18.0/net-istio.yaml
4.install cert manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml
5.install kserve
kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.15.0/kserve.yaml
kubectl apply --server-side -f https://github.com/kserve/kserve/releases/download/v0.15.0/kserve-cluster-resources.yaml
for more information, you can check ๐https://artifacthub.io/packages/helm/prometheus-community/prometheus
Preliminary
1. Kubernetes has installed, if not check ๐link2. ArgoCD has installed, if not check ๐link3. Helm binary has installed, if not check ๐link1.install gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
2.install cert manager
following ๐link to install cert manager
3.install istio system
following ๐link to install three istio components (istio-base, istiod, istio-ingressgateway)
4.install Knative Operator
kubectl -n argocd apply -f - << EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: knative-operator
spec:
syncPolicy:
syncOptions:
- CreateNamespace=true
project: default
source:
repoURL: https://knative.github.io/operator
chart: knative-operator
targetRevision: v1.18.1
helm:
releaseName: knative-operator
values: |
knative_operator:
knative_operator:
image: m.daocloud.io/gcr.io/knative-releases/knative.dev/operator/cmd/operator
tag: v1.18.1
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 1000m
memory: 1000Mi
operator_webhook:
image: m.daocloud.io/gcr.io/knative-releases/knative.dev/operator/cmd/webhook
tag: v1.18.1
resources:
requests:
cpu: 100m
memory: 100Mi
limits:
cpu: 500m
memory: 500Mi
destination:
server: https://kubernetes.default.svc
namespace: knative-serving
EOF
5.sync by argocd
argocd app sync argocd/knative-operator
6.install kserve serving CRD
kubectl apply -f - <<EOF
apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
name: knative-serving
namespace: knative-serving
spec:
version: 1.18.0 # this is knative serving version
config:
domain:
example.com: ""
EOF
7.install kserve CRD
kubectl -n argocd apply -f - << EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: kserve-crd
annotations:
argocd.argoproj.io/sync-options: ServerSideApply=true
argocd.argoproj.io/compare-options: IgnoreExtraneous
spec:
syncPolicy:
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
project: default
source:
repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
chart: kserve-crd
targetRevision: v0.15.2
helm:
releaseName: kserve-crd
destination:
server: https://kubernetes.default.svc
namespace: kserve
EOF
8.sync by argocd
argocd app sync argocd/kserve-crd
9.install kserve Controller
kubectl -n argocd apply -f - << EOF
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: kserve
annotations:
argocd.argoproj.io/sync-options: ServerSideApply=true
argocd.argoproj.io/compare-options: IgnoreExtraneous
spec:
syncPolicy:
syncOptions:
- CreateNamespace=true
- ServerSideApply=true
project: default
source:
repoURL: https://aaronyang0628.github.io/helm-chart-mirror/charts
chart: kserve
targetRevision: v0.15.2
helm:
releaseName: kserve
values: |
kserve:
agent:
image: m.daocloud.io/docker.io/kserve/agent
router:
image: m.daocloud.io/docker.io/kserve/router
storage:
image: m.daocloud.io/docker.io/kserve/storage-initializer
s3:
accessKeyIdName: AWS_ACCESS_KEY_ID
secretAccessKeyName: AWS_SECRET_ACCESS_KEY
endpoint: ""
region: ""
verifySSL: ""
useVirtualBucket: ""
useAnonymousCredential: ""
controller:
deploymentMode: "Serverless"
rbacProxyImage: m.daocloud.io/quay.io/brancz/kube-rbac-proxy:v0.18.0
rbacProxy:
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
gateway:
domain: example.com
image: m.daocloud.io/docker.io/kserve/kserve-controller
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
servingruntime:
tensorflow:
image: tensorflow/serving
tag: 2.6.2
mlserver:
image: m.daocloud.io/docker.io/seldonio/mlserver
tag: 1.5.0
sklearnserver:
image: m.daocloud.io/docker.io/kserve/sklearnserver
xgbserver:
image: m.daocloud.io/docker.io/kserve/xgbserver
huggingfaceserver:
image: m.daocloud.io/docker.io/kserve/huggingfaceserver
devShm:
enabled: false
sizeLimit: ""
hostIPC:
enabled: false
huggingfaceserver_multinode:
shm:
enabled: true
sizeLimit: "3Gi"
tritonserver:
image: nvcr.io/nvidia/tritonserver
pmmlserver:
image: m.daocloud.io/docker.io/kserve/pmmlserver
paddleserver:
image: m.daocloud.io/docker.io/kserve/paddleserver
lgbserver:
image: m.daocloud.io/docker.io/kserve/lgbserver
torchserve:
image: pytorch/torchserve-kfs
tag: 0.9.0
art:
image: m.daocloud.io/docker.io/kserve/art-explainer
localmodel:
enabled: false
controller:
image: m.daocloud.io/docker.io/kserve/kserve-localmodel-controller
jobNamespace: kserve-localmodel-jobs
agent:
hostPath: /mnt/models
image: m.daocloud.io/docker.io/kserve/kserve-localmodelnode-agent
inferenceservice:
resources:
limits:
cpu: "1"
memory: "2Gi"
requests:
cpu: "1"
memory: "2Gi"
destination:
server: https://kubernetes.default.svc
namespace: kserve
EOF
10.sync by argocd
argocd app sync argocd/kserve
11.install kserve eventing CRD
kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.18.1/eventing-crds.yaml
12.install kserve eventing
kubectl apply -f https://github.com/knative/eventing/releases/download/knative-v1.18.1/eventing-core.yaml
FAQ
Serving
Subsections of Serving
Inference
Subsections of Inference
First Pytorch ISVC
Mnist Inference
More Information about
mnist
service can be found ๐link
- create a namespace
kubectl create namespace kserve-test
- deploy a sample
iris
service
kubectl apply -n kserve-test -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "first-torchserve"
namespace: kserve-test
spec:
predictor:
model:
modelFormat:
name: pytorch
storageUri: gs://kfserving-examples/models/torchserve/image_classifier/v1
resources:
limits:
memory: 4Gi
EOF
- Check
InferenceService
status
kubectl -n kserve-test get inferenceservices first-torchserve
After all pods are ready, you can access the service by using the following command
If the EXTERNAL-IP value is set, your environment has an external load balancer that you can use for the ingress gateway.
export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
If the EXTERNAL-IP value is none (or perpetually pending), your environment does not provide an external load balancer for the ingress gateway. In this case, you can access the gateway using the serviceโs node port.
export INGRESS_HOST=$(minikube ip)
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
export INGRESS_HOST=$(minikube ip)
kubectl port-forward --namespace istio-system svc/istio-ingressgateway 30080:80
export INGRESS_PORT=30080
- Perform a prediction First, prepare your inference input request inside a file:
wget -O ./mnist-input.json https://raw.githubusercontent.com/kserve/kserve/refs/heads/master/docs/samples/v1beta1/torchserve/v1/imgconv/input.json
- Invoke the service
SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice first-torchserve -o jsonpath='{.status.url}' | cut -d "/" -f 3)
# http://first-torchserve.kserve-test.example.com
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/mnist:predict" -d @./mnist-input.json
First Custom Model
AlexNet Inference
More Information about
AlexNet
service can be found ๐link
- Implement Custom Model using KServe API
1import argparse
2import base64
3import io
4import time
5
6from fastapi.middleware.cors import CORSMiddleware
7from torchvision import models, transforms
8from typing import Dict
9import torch
10from PIL import Image
11
12import kserve
13from kserve import Model, ModelServer, logging
14from kserve.model_server import app
15from kserve.utils.utils import generate_uuid
16
17
18class AlexNetModel(Model):
19 def __init__(self, name: str):
20 super().__init__(name, return_response_headers=True)
21 self.name = name
22 self.load()
23 self.ready = False
24
25 def load(self):
26 self.model = models.alexnet(pretrained=True)
27 self.model.eval()
28 # The ready flag is used by model ready endpoint for readiness probes,
29 # set to True when model is loaded successfully without exceptions.
30 self.ready = True
31
32 async def predict(
33 self,
34 payload: Dict,
35 headers: Dict[str, str] = None,
36 response_headers: Dict[str, str] = None,
37 ) -> Dict:
38 start = time.time()
39 # Input follows the Tensorflow V1 HTTP API for binary values
40 # https://www.tensorflow.org/tfx/serving/api_rest#encoding_binary_values
41 img_data = payload["instances"][0]["image"]["b64"]
42 raw_img_data = base64.b64decode(img_data)
43 input_image = Image.open(io.BytesIO(raw_img_data))
44 preprocess = transforms.Compose([
45 transforms.Resize(256),
46 transforms.CenterCrop(224),
47 transforms.ToTensor(),
48 transforms.Normalize(mean=[0.485, 0.456, 0.406],
49 std=[0.229, 0.224, 0.225]),
50 ])
51 input_tensor = preprocess(input_image).unsqueeze(0)
52 output = self.model(input_tensor)
53 torch.nn.functional.softmax(output, dim=1)
54 values, top_5 = torch.topk(output, 5)
55 result = values.flatten().tolist()
56 end = time.time()
57 response_id = generate_uuid()
58
59 # Custom response headers can be added to the inference response
60 if response_headers is not None:
61 response_headers.update(
62 {"prediction-time-latency": f"{round((end - start) * 1000, 9)}"}
63 )
64
65 return {"predictions": result}
66
67
68parser = argparse.ArgumentParser(parents=[kserve.model_server.parser])
69args, _ = parser.parse_known_args()
70
71if __name__ == "__main__":
72 # Configure kserve and uvicorn logger
73 if args.configure_logging:
74 logging.configure_logging(args.log_config_file)
75 model = AlexNetModel(args.model_name)
76 model.load()
77 # Custom middlewares can be added to the model
78 app.add_middleware(
79 CORSMiddleware,
80 allow_origins=["*"],
81 allow_credentials=True,
82 allow_methods=["*"],
83 allow_headers=["*"],
84 )
85 ModelServer().start([model])
- create
requirements.txt
kserve
torchvision==0.18.0
pillow>=10.3.0,<11.0.0
- create
Dockerfile
FROM m.daocloud.io/docker.io/library/python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY model.py .
CMD ["python", "model.py", "--model_name=custom-model"]
- build and push custom docker image
docker build -t ay-custom-model .
docker tag ddfd0186813e docker-registry.lab.zverse.space/ay/ay-custom-model:latest
docker push docker-registry.lab.zverse.space/ay/ay-custom-model:latest
- create a namespace
kubectl create namespace kserve-test
- deploy a sample
custom-model
service
kubectl apply -n kserve-test -f - <<EOF
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: ay-custom-model
spec:
predictor:
containers:
- name: kserve-container
image: docker-registry.lab.zverse.space/ay/ay-custom-model:latest
EOF
- Check
InferenceService
status
kubectl -n kserve-test get inferenceservices ay-custom-model
After all pods are ready, you can access the service by using the following command
If the EXTERNAL-IP value is set, your environment has an external load balancer that you can use for the ingress gateway.
export INGRESS_HOST=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
If the EXTERNAL-IP value is none (or perpetually pending), your environment does not provide an external load balancer for the ingress gateway. In this case, you can access the gateway using the serviceโs node port.
export INGRESS_HOST=$(minikube ip)
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
export INGRESS_HOST=$(minikube ip)
kubectl port-forward --namespace istio-system svc/istio-ingressgateway 30080:80
export INGRESS_PORT=30080
- Perform a prediction
First, prepare your inference input request inside a file:
wget -O ./alex-net-input.json https://kserve.github.io/website/0.15/modelserving/v1beta1/custom/custom_model/input.json
- Invoke the service
export SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice ay-custom-model -o jsonpath='{.status.url}' | cut -d "/" -f 3)
# http://ay-custom-model.kserve-test.example.com
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" -X POST "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/custom-model:predict" -d @.//alex-net-input.json
First Model In Minio
Inference Model In Minio
More Information about
Deploy InferenceService with a saved model on S3
can be found ๐link
Create Service Account
=== “yaml”
apiVersion: v1
kind: ServiceAccount
metadata:
name: sa
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/s3access # replace with your IAM role ARN
serving.kserve.io/s3-endpoint: s3.amazonaws.com # replace with your s3 endpoint e.g minio-service.kubeflow:9000
serving.kserve.io/s3-usehttps: "1" # by default 1, if testing with minio you can set to 0
serving.kserve.io/s3-region: "us-east-2"
serving.kserve.io/s3-useanoncredential: "false" # omitting this is the same as false, if true will ignore provided credential and use anonymous credentials
=== “kubectl”
kubectl apply -f create-s3-sa.yaml
Create S3 Secret and attach to Service Account
Create a secret with your S3 user credential, KServe
reads the secret annotations to inject the S3 environment variables on storage initializer or model agent to download the models from S3 storage.
Create S3 secret
=== “yaml”
apiVersion: v1
kind: Secret
metadata:
name: s3creds
annotations:
serving.kserve.io/s3-endpoint: s3.amazonaws.com # replace with your s3 endpoint e.g minio-service.kubeflow:9000
serving.kserve.io/s3-usehttps: "1" # by default 1, if testing with minio you can set to 0
serving.kserve.io/s3-region: "us-east-2"
serving.kserve.io/s3-useanoncredential: "false" # omitting this is the same as false, if true will ignore provided credential and use anonymous credentials
type: Opaque
stringData: # use `stringData` for raw credential string or `data` for base64 encoded string
AWS_ACCESS_KEY_ID: XXXX
AWS_SECRET_ACCESS_KEY: XXXXXXXX
Attach secret to a service account
=== “yaml”
apiVersion: v1
kind: ServiceAccount
metadata:
name: sa
secrets:
- name: s3creds
=== “kubectl”
kubectl apply -f create-s3-secret.yaml
!!! note
If you are running kserve with istio sidecars enabled, there can be a race condition between the istio proxy being ready and the agent pulling models. This will result in a tcp dial connection refused
error when the agent tries to download from s3.
To resolve it, istio allows the blocking of other containers in a pod until the proxy container is ready.
You can enabled this by setting `proxy.holdApplicationUntilProxyStarts: true` in `istio-sidecar-injector` configmap, `proxy.holdApplicationUntilProxyStarts` flag was introduced in Istio 1.7 as an experimental feature and is turned off by default.
Deploy the model on S3 with InferenceService
Create the InferenceService with the s3 storageUri
and the service account with s3 credential attached.
=== “New Schema”
```yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "mnist-s3"
spec:
predictor:
serviceAccountName: sa
model:
modelFormat:
name: tensorflow
storageUri: "s3://kserve-examples/mnist"
```
=== “Old Schema”
```yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "mnist-s3"
spec:
predictor:
serviceAccountName: sa
tensorflow:
storageUri: "s3://kserve-examples/mnist"
```
Apply the autoscale-gpu.yaml
.
=== “kubectl”
kubectl apply -f mnist-s3.yaml
Run a prediction
Now, the ingress can be accessed at ${INGRESS_HOST}:${INGRESS_PORT}
or follow this instruction
to find out the ingress IP and port.
SERVICE_HOSTNAME=$(kubectl get inferenceservice mnist-s3 -o jsonpath='{.status.url}' | cut -d "/" -f 3)
MODEL_NAME=mnist-s3
INPUT_PATH=@./input.json
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d $INPUT_PATH
!!! success “Expected Output”
```{ .bash .no-copy }
Note: Unnecessary use of -X or --request, POST is already inferred.
* Trying 35.237.217.209...
* TCP_NODELAY set
* Connected to mnist-s3.default.35.237.217.209.xip.io (35.237.217.209) port 80 (#0)
> POST /v1/models/mnist-s3:predict HTTP/1.1
> Host: mnist-s3.default.35.237.217.209.xip.io
> User-Agent: curl/7.55.1
> Accept: */*
> Content-Length: 2052
> Content-Type: application/x-www-form-urlencoded
> Expect: 100-continue
>
< HTTP/1.1 100 Continue
* We are completely uploaded and fine
< HTTP/1.1 200 OK
< content-length: 251
< content-type: application/json
< date: Sun, 04 Apr 2021 20:06:27 GMT
< x-envoy-upstream-service-time: 5
< server: istio-envoy
<
* Connection #0 to host mnist-s3.default.35.237.217.209.xip.io left intact
{
"predictions": [
{
"predictions": [0.327352405, 2.00153053e-07, 0.0113353515, 0.203903764, 3.62863029e-05, 0.416683704, 0.000281196437, 8.36911859e-05, 0.0403052084, 1.82206513e-05],
"classes": 5
}
]
}
```
Kafka Sink Transformer
AlexNet Inference
More Information about
Custom Transformer
service can be found ๐link
- Implement Custom Transformer
./model.py
using Kserve API
1import os
2import argparse
3import json
4
5from typing import Dict, Union
6from kafka import KafkaProducer
7from cloudevents.http import CloudEvent
8from cloudevents.conversion import to_structured
9
10from kserve import (
11 Model,
12 ModelServer,
13 model_server,
14 logging,
15 InferRequest,
16 InferResponse,
17)
18
19from kserve.logging import logger
20from kserve.utils.utils import generate_uuid
21
22kafka_producer = KafkaProducer(
23 value_serializer=lambda v: json.dumps(v).encode('utf-8'),
24 bootstrap_servers=os.environ.get('KAFKA_BOOTSTRAP_SERVERS', 'localhost:9092')
25)
26
27class ImageTransformer(Model):
28 def __init__(self, name: str):
29 super().__init__(name, return_response_headers=True)
30 self.ready = True
31
32
33 def preprocess(
34 self, payload: Union[Dict, InferRequest], headers: Dict[str, str] = None
35 ) -> Union[Dict, InferRequest]:
36 logger.info("Received inputs %s", payload)
37 logger.info("Received headers %s", headers)
38 self.request_trace_key = os.environ.get('REQUEST_TRACE_KEY', 'algo.trace.requestId')
39 if self.request_trace_key not in payload:
40 logger.error("Request trace key '%s' not found in payload, you cannot trace the prediction result", self.request_trace_key)
41 if "instances" not in payload:
42 raise ValueError(
43 f"Request trace key '{self.request_trace_key}' not found in payload and 'instances' key is missing."
44 )
45 else:
46 headers[self.request_trace_key] = payload.get(self.request_trace_key)
47
48 return {"instances": payload["instances"]}
49
50 def postprocess(
51 self,
52 infer_response: Union[Dict, InferResponse],
53 headers: Dict[str, str] = None,
54 response_headers: Dict[str, str] = None,
55 ) -> Union[Dict, InferResponse]:
56 logger.info("postprocess headers: %s", headers)
57 logger.info("postprocess response headers: %s", response_headers)
58 logger.info("postprocess response: %s", infer_response)
59
60 attributes = {
61 "source": "data-and-computing/kafka-sink-transformer",
62 "type": "org.zhejianglab.zverse.data-and-computing.kafka-sink-transformer",
63 "request-host": headers.get('host', 'unknown'),
64 "kserve-isvc-name": headers.get('kserve-isvc-name', 'unknown'),
65 "kserve-isvc-namespace": headers.get('kserve-isvc-namespace', 'unknown'),
66 self.request_trace_key: headers.get(self.request_trace_key, 'unknown'),
67 }
68
69 _, cloudevent = to_structured(CloudEvent(attributes, infer_response))
70 try:
71 kafka_producer.send(os.environ.get('KAFKA_TOPIC', 'test-topic'), value=cloudevent.decode('utf-8').replace("'", '"'))
72 kafka_producer.flush()
73 except Exception as e:
74 logger.error("Failed to send message to Kafka: %s", e)
75 return infer_response
76
77parser = argparse.ArgumentParser(parents=[model_server.parser])
78args, _ = parser.parse_known_args()
79
80if __name__ == "__main__":
81 if args.configure_logging:
82 logging.configure_logging(args.log_config_file)
83 logging.logger.info("available model name: %s", args.model_name)
84 logging.logger.info("all args: %s", args.model_name)
85 model = ImageTransformer(args.model_name)
86 ModelServer().start([model])
- modify
./pyproject.toml
[tool.poetry]
name = "custom_transformer"
version = "0.15.2"
description = "Custom Transformer Examples. Not intended for use outside KServe Frameworks Images."
authors = ["Dan Sun <dsun20@bloomberg.net>"]
license = "Apache-2.0"
packages = [
{ include = "*.py" }
]
[tool.poetry.dependencies]
python = ">=3.9,<3.13"
kserve = {path = "../kserve", develop = true}
pillow = "^10.3.0"
kafka-python = "^2.2.15"
cloudevents = "^1.11.1"
[[tool.poetry.source]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cpu"
priority = "explicit"
[tool.poetry.group.test]
optional = true
[tool.poetry.group.test.dependencies]
pytest = "^7.4.4"
mypy = "^0.991"
[tool.poetry.group.dev]
optional = true
[tool.poetry.group.dev.dependencies]
black = { version = "~24.3.0", extras = ["colorama"] }
[tool.poetry-version-plugin]
source = "file"
file_path = "../VERSION"
[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
- prepare
../custom_transformer.Dockerfile
ARG PYTHON_VERSION=3.11
ARG BASE_IMAGE=python:${PYTHON_VERSION}-slim-bookworm
ARG VENV_PATH=/prod_venv
FROM ${BASE_IMAGE} AS builder
# Install Poetry
ARG POETRY_HOME=/opt/poetry
ARG POETRY_VERSION=1.8.3
RUN python3 -m venv ${POETRY_HOME} && ${POETRY_HOME}/bin/pip install poetry==${POETRY_VERSION}
ENV PATH="$PATH:${POETRY_HOME}/bin"
# Activate virtual env
ARG VENV_PATH
ENV VIRTUAL_ENV=${VENV_PATH}
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
COPY kserve/pyproject.toml kserve/poetry.lock kserve/
RUN cd kserve && poetry install --no-root --no-interaction --no-cache
COPY kserve kserve
RUN cd kserve && poetry install --no-interaction --no-cache
COPY custom_transformer/pyproject.toml custom_transformer/poetry.lock custom_transformer/
RUN cd custom_transformer && poetry install --no-root --no-interaction --no-cache
COPY custom_transformer custom_transformer
RUN cd custom_transformer && poetry install --no-interaction --no-cache
FROM ${BASE_IMAGE} AS prod
COPY third_party third_party
# Activate virtual env
ARG VENV_PATH
ENV VIRTUAL_ENV=${VENV_PATH}
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN useradd kserve -m -u 1000 -d /home/kserve
COPY --from=builder --chown=kserve:kserve $VIRTUAL_ENV $VIRTUAL_ENV
COPY --from=builder kserve kserve
COPY --from=builder custom_transformer custom_transformer
USER 1000
ENTRYPOINT ["python", "-m", "custom_transformer.model"]
- regenerate poetry.lock
poetry lock --no-update
- build and push custom docker image
cd python
podman build -t docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9 -f custom_transformer.Dockerfile .
podman push docker-registry.lab.zverse.space/data-and-computing/ay-dev/msg-transformer:dev9
Generative
Subsections of Generative
First Generative Service
B --> C[[Knative Serving]] --> D[่ชๅจๆฉ็ผฉๅฎน/็ฐๅบฆๅๅธ]
B --> E[[Istio]] --> F[ๆต้็ฎก็/ๅฎๅ
จ]
B --> G[[ๅญๅจ็ณป็ป]] --> H[S3/GCS/PVC]
### ๅYAML้จ็ฝฒๆจ็ๆๅก
```yaml
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: kserve-test
spec:
predictor:
model:
modelFormat:
name: sklearn
resources: {}
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
check CRD
kubectl -n kserve-test get inferenceservices sklearn-iris
kubectl -n istio-system get svc istio-ingressgateway
export INGRESS_HOST=$(minikube ip)
export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
SERVICE_HOSTNAME=$(kubectl -n kserve-test get inferenceservice sklearn-iris -o jsonpath='{.status.url}' | cut -d "/" -f 3)
# http://sklearn-iris.kserve-test.example.com
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" "http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/sklearn-iris:predict" -d @./iris-input.json
How to deploy your own ML model
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: huggingface-llama3
namespace: kserve-test
annotations:
serving.kserve.io/deploymentMode: RawDeployment
serving.kserve.io/autoscalerClass: none
spec:
predictor:
model:
modelFormat:
name: huggingface
storageUri: pvc://llama-3-8b-pvc/hf/8b_instruction_tuned
workerSpec:
pipelineParallelSize: 2
tensorParallelSize: 1
containers:
- name: worker-container
resources:
requests:
nvidia.com/gpu: "8"
Canary Policy
KServe supports canary rollouts for inference services. Canary rollouts allow for a new version of an InferenceService to receive a percentage of traffic. Kserve supports a configurable canary rollout strategy with multiple steps. The rollout strategy can also be implemented to rollback to the previous revision if a rollout step fails.
KServe automatically tracks the last good revision that was rolled out with 100% traffic. The canaryTrafficPercent
field in the component’s spec needs to be set with the percentage of traffic that should be routed to the new revision. KServe will then automatically split the traffic between the last good revision and the revision that is currently being rolled out according to the canaryTrafficPercent
value.
When the first revision of an InferenceService
is deployed, it will receive 100% of the traffic. When multiple revisions are deployed, as in step 2, and the canary rollout strategy is configured to route 10% of the traffic to the new revision, 90% of the traffic will go to the LastestRolledoutRevision
. If there is an unhealthy or bad revision applied, traffic will not be routed to that bad revision. In step 3, the rollout strategy promotes the LatestReadyRevision
from step 2 to the LatestRolledoutRevision
. Since it is now promoted, the LatestRolledoutRevision
gets 100% of the traffic and is fully rolled out. If a rollback needs to happen, 100% of the traffic will be pinned to the previous healthy/good revision- the PreviousRolledoutRevision
.
Reference
For more information, see Canary Rollout.
Subsections of Canary Policy
Rollout Example
Create the InferenceService
Follow the First Inference Service
tutorial. Set up a namespace kserve-test
and create an InferenceService.
After rolling out the first model, 100% traffic goes to the initial model with service revision 1.
kubectl -n kserve-test get isvc sklearn-iris
Apply Canary Rollout Strategy
- Add the
canaryTrafficPercent
field to the predictor component - Update the
storageUri
to use a new/updated model.
kubectl apply -n kserve-test -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: kserve-test
spec:
predictor:
canaryTrafficPercent: 10
model:
args: ["--enable_docs_url=True"]
modelFormat:
name: sklearn
resources: {}
runtime: kserve-sklearnserver
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
EOF
After rolling out the canary model, traffic is split between the latest ready revision 2 and the previously rolled out revision 1.
kubectl -n kserve-test get isvc sklearn-iris
Check the running pods, you should now see port two pods running for the old and new model and 10% traffic is routed to
the new model. Notice revision 1 contains 0002
in its name, while revision 2 contains 0003
.
kubectl get pods
NAME READY STATUS RESTARTS AGE
sklearn-iris-predictor-00002-deployment-c7bb6c685-ktk7r 2/2 Running 0 71m
sklearn-iris-predictor-00003-deployment-8498d947-fpzcg 2/2 Running 0 20m
Run a prediction
Follow the next two steps (Determine the ingress IP and ports and Perform inference) in the First Inference Service tutorial.
Send more requests to the InferenceService
to observe the 10% of traffic that routes to the new revision.
Promote the canary model
If the canary model is healthy/passes your tests,
you can promote it by removing the canaryTrafficPercent
field and re-applying the InferenceService
custom resource with the same name sklearn-iris
kubectl apply -n kserve-test -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: kserve-test
spec:
predictor:
model:
args: ["--enable_docs_url=True"]
modelFormat:
name: sklearn
resources: {}
runtime: kserve-sklearnserver
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
EOF
Now all traffic goes to the revision 2 for the new model.
kubectl get isvc sklearn-iris
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
sklearn-iris http://sklearn-iris.kserve-test.example.com True 100 sklearn-iris-predictor-00002 17m
The pods for revision generation 1 automatically scales down to 0 as it is no longer getting the traffic.
kubectl get pods -l serving.kserve.io/inferenceservice=sklearn-iris
NAME READY STATUS RESTARTS AGE
sklearn-iris-predictor-00001-deployment-66c5f5b8d5-gmfvj 1/2 Terminating 0 17m
sklearn-iris-predictor-00002-deployment-5bd9ff46f8-shtzd 2/2 Running 0 15m
Rollback and pin the previous model
You can pin the previous model (model v1, for example) by setting the canaryTrafficPercent
to 0 for the current
model (model v2, for example). This rolls back from model v2 to model v1 and decreases model v2’s traffic to zero.
Apply the custom resource to set model v2’s traffic to 0%.
kubectl apply -n kserve-test -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
spec:
predictor:
canaryTrafficPercent: 0
model:
modelFormat:
name: sklearn
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
EOF
Check the traffic split, now 100% traffic goes to the previous good model (model v1) for revision generation 1.
kubectl get isvc sklearn-iris
NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
sklearn-iris http://sklearn-iris.kserve-test.example.com True 100 0 sklearn-iris-predictor-00002 sklearn-iris-predictor-00003 18m
The pods for previous revision (model v1) now routes 100% of the traffic to its pods while the new model (model v2) routes 0% traffic to its pods.
kubectl get pods -l serving.kserve.io/inferenceservice=sklearn-iris
NAME READY STATUS RESTARTS AGE
sklearn-iris-predictor-00002-deployment-66c5f5b8d5-gmfvj 1/2 Running 0 35s
sklearn-iris-predictor-00003-deployment-5bd9ff46f8-shtzd 2/2 Running 0 16m
Route traffic using a tag
You can enable tag based routing by adding the annotation serving.kserve.io/enable-tag-routing
, so traffic can be
explicitly routed to the canary model (model v2) or the old model (model v1) via a tag in the request URL.
Apply model v2 with canaryTrafficPercent: 10
and serving.kserve.io/enable-tag-routing: "true"
.
kubectl apply -n kserve-test -f - <<EOF
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
annotations:
serving.kserve.io/enable-tag-routing: "true"
spec:
predictor:
canaryTrafficPercent: 10
model:
modelFormat:
name: sklearn
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model-2"
EOF
Check the InferenceService status to get the canary and previous model URL.
kubectl get isvc sklearn-iris -ojsonpath="{.status.components.predictor}" | jq
The output should look like
Since we updated the annotation on the InferenceService
, model v2 now corresponds to sklearn-iris-predictor--00003
.
You can now send the request explicitly to the new model or the previous model by using the tag in the request URL. Use
the curl command
from Perform inference and
add latest-
or prev-
to the model name to send a tag based request.
For example, set the model name and use the following commands to send traffic to each service based on the latest
or prev
tag.
curl the latest revision
MODEL_NAME=sklearn-iris
curl -v -H "Host: latest-${MODEL_NAME}-predictor-.kserve-test.example.com" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d @./iris-input.json
or curl the previous revision
curl -v -H "Host: prev-${MODEL_NAME}-predictor-.kserve-test.example.com" -H "Content-Type: application/json" http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict -d @./iris-input.json
Auto Scaling
Soft Limit
You can configure InferenceService with annotation autoscaling.knative.dev/target
for a soft limit. The soft limit is a targeted limit rather than a strictly enforced bound, particularly if there is a sudden burst of requests, this value can be exceeded.
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: kserve-test
annotations:
autoscaling.knative.dev/target: "5"
spec:
predictor:
model:
args: ["--enable_docs_url=True"]
modelFormat:
name: sklearn
resources: {}
runtime: kserve-sklearnserver
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
Hard Limit
You can also configure InferenceService with field containerConcurrency
with a hard limit. The hard limit is an enforced upper bound. If concurrency reaches the hard limit, surplus requests will be buffered and must wait until enough capacity is free to execute the requests.
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: kserve-test
spec:
predictor:
containerConcurrency: 5
model:
args: ["--enable_docs_url=True"]
modelFormat:
name: sklearn
resources: {}
runtime: kserve-sklearnserver
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
Scale with QPS
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: kserve-test
spec:
predictor:
scaleTarget: 1
scaleMetric: qps
model:
args: ["--enable_docs_url=True"]
modelFormat:
name: sklearn
resources: {}
runtime: kserve-sklearnserver
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
Scale with GPU
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flowers-sample-gpu"
namespace: kserve-test
spec:
predictor:
scaleTarget: 1
scaleMetric: concurrency
model:
modelFormat:
name: tensorflow
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
runtimeVersion: "2.6.2-gpu"
resources:
limits:
nvidia.com/gpu: 1
Enable Scale To Zero
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: kserve-test
spec:
predictor:
minReplicas: 0
model:
args: ["--enable_docs_url=True"]
modelFormat:
name: sklearn
resources: {}
runtime: kserve-sklearnserver
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
Prepare Concurrent Requests Container
# export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
podman run --rm \
-v /root/kserve/iris-input.json:/tmp/iris-input.json \
--privileged \
-e INGRESS_HOST=$(minikube ip) \
-e INGRESS_PORT=32132 \
-e MODEL_NAME=sklearn-iris \
-e INPUT_PATH=/tmp/iris-input.json \
-e SERVICE_HOSTNAME=sklearn-iris.kserve-test.example.com \
-it m.daocloud.io/docker.io/library/golang:1.22 bash -c "go install github.com/rakyll/hey@latest; bash"
Fire
Send traffic in 30 seconds spurts maintaining 5 in-flight requests.
hey -z 30s -c 100 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
Reference
For more information, please refer to the KPA documentation.