Auto Scaling
Soft Limit
You can configure InferenceService with annotation autoscaling.knative.dev/target
for a soft limit. The soft limit is a targeted limit rather than a strictly enforced bound, particularly if there is a sudden burst of requests, this value can be exceeded.
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: kserve-test
annotations:
autoscaling.knative.dev/target: "5"
spec:
predictor:
model:
args: ["--enable_docs_url=True"]
modelFormat:
name: sklearn
resources: {}
runtime: kserve-sklearnserver
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
Hard Limit
You can also configure InferenceService with field containerConcurrency
with a hard limit. The hard limit is an enforced upper bound. If concurrency reaches the hard limit, surplus requests will be buffered and must wait until enough capacity is free to execute the requests.
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: kserve-test
spec:
predictor:
containerConcurrency: 5
model:
args: ["--enable_docs_url=True"]
modelFormat:
name: sklearn
resources: {}
runtime: kserve-sklearnserver
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
Scale with QPS
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: kserve-test
spec:
predictor:
scaleTarget: 1
scaleMetric: qps
model:
args: ["--enable_docs_url=True"]
modelFormat:
name: sklearn
resources: {}
runtime: kserve-sklearnserver
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
Scale with GPU
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "flowers-sample-gpu"
namespace: kserve-test
spec:
predictor:
scaleTarget: 1
scaleMetric: concurrency
model:
modelFormat:
name: tensorflow
storageUri: "gs://kfserving-examples/models/tensorflow/flowers"
runtimeVersion: "2.6.2-gpu"
resources:
limits:
nvidia.com/gpu: 1
Enable Scale To Zero
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
name: "sklearn-iris"
namespace: kserve-test
spec:
predictor:
minReplicas: 0
model:
args: ["--enable_docs_url=True"]
modelFormat:
name: sklearn
resources: {}
runtime: kserve-sklearnserver
storageUri: "gs://kfserving-examples/models/sklearn/1.0/model"
Prepare Concurrent Requests Container
# export INGRESS_PORT=$(kubectl -n istio-system get service istio-ingressgateway -o jsonpath='{.spec.ports[?(@.name=="http2")].nodePort}')
podman run --rm \
-v /root/kserve/iris-input.json:/tmp/iris-input.json \
--privileged \
-e INGRESS_HOST=$(minikube ip) \
-e INGRESS_PORT=32132 \
-e MODEL_NAME=sklearn-iris \
-e INPUT_PATH=/tmp/iris-input.json \
-e SERVICE_HOSTNAME=sklearn-iris.kserve-test.example.com \
-it m.daocloud.io/docker.io/library/golang:1.22 bash -c "go install github.com/rakyll/hey@latest; bash"
Fire
Send traffic in 30 seconds spurts maintaining 5 in-flight requests.
hey -z 30s -c 100 -m POST -host ${SERVICE_HOSTNAME} -D $INPUT_PATH http://${INGRESS_HOST}:${INGRESS_PORT}/v1/models/$MODEL_NAME:predict
Reference
For more information, please refer to the KPA documentation.