Deploy the TensorFlow Serving inference service
Updated at:2025-10-27
This document explains how to deploy a TensorFlow Serving inference service, specifying the queue and GPU resources.
Prerequisites
- The CCE GPU Manager and CCE AI Job Scheduler components have been installed successfully. Otherwise, the cloud-native AI features will not be available.
Example of operation steps:
Using TensorFlow Serving as an example, this guide demonstrates the steps to deploy an inference service via a deployment.
-
Deploy the TensorFlow Serving inference service
- Specify the use of the default queue: scheduling.volcano.sh/queue-name: default
- Apply for 50% computing power of 1 GPU card and 10 GiB of GPU memory
- The scheduler must be set to volcano
Reference YAML:
Plain Text
1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: gpu-demo
5 namespace: default
6spec:
7 replicas: 1
8 selector:
9 matchLabels:
10 app: gpu-demo
11 template:
12 metadata:
13 annotations:
14 scheduling.volcano.sh/queue-name: default
15 labels:
16 app: gpu-demo
17 spec:
18 containers:
19 - image: registry.baidubce.com/cce-public/tensorflow-serving:demo-gpu
20 imagePullPolicy: Always
21 name: gpu-demo
22 env:
23 - name: MODEL_NAME
24 value: half_plus_two
25 ports:
26 - containerPort: 8501
27 resources:
28 limits:
29 cpu: "2"
30 memory: 2Gi
31 baidu.com/v100_32g_cgpu: "1"
32 baidu.com/v100_32g_cgpu_core: "50"
33 baidu.com/v100_32g_cgpu_memory: "10"
34 requests:
35 cpu: "2"
36 memory: 2Gi
37 baidu.com/v100_32g_cgpu: "1"
38 baidu.com/v100_32g_cgpu_core: "50"
39 baidu.com/v100_32g_cgpu_memory: "10"
40 # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
41 # `tf_serving_entrypoint.sh` needs to be replaced with the name of your gpu process.
42 lifecycle:
43 preStop:
44 exec:
45 command: ["/bin/sh", "-c", "kill -10 `ps -ef | grep tf_serving_entrypoint.sh | grep -v grep | awk '{print $2}'` && sleep 1"]
46 dnsPolicy: ClusterFirst
47 restartPolicy: Always
48 schedulerName: volcano
- Execute the following command to view the task running status
Plain Text
1kubectl get deployments
2NAME READY UP-TO-DATE AVAILABLE AGE
3gpu-demo 1/1 1 1 30s
4kubectl get pod -o wide
5NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
6gpu-demo-65767d67cc-xhdgg 1/1 Running 0 63s 172.23.1.86 192.168.48.8 <none> <none>
- Verify availability of the TensorFlow inference service
Plain Text
1# Replace <172.23.1.86> with the actual pod IP
2curl -d '{"instances": [1.0, 2.0, 5.0]}' -X POST http://172.23.1.86:8501/v1/models/half_plus_two:predict
3# The output will be similar to the following:
4{
5 "predictions": [2.5, 3.0, 4.5]
6}
Queue usage instructions
Specify a queue via annotations
Plain Text
1annotations:
2 scheduling.volcano.sh/queue-name: <Queue Name>
Resource request
Exclusive occupation example (single-GPU)
Plain Text
1resources:
2 requests:
3 baidu.com/v100_32g_cgpu: 1 // 1 GPU card
4 cpu: "4"
5 memory: 6Gi
6 limits:
7 baidu.com/v100_32g_cgpu: 1 // Limit and request must be the same
8 cpu: "4"
9 memory: 6Gi
Exclusive occupation example (multi-GPU)
Plain Text
1resources:
2 requests:
3 baidu.com/v100_32g_cgpu: 2 // 2 GPU card
4 cpu: "4"
5 memory: 6Gi
6 limits:
7 baidu.com/v100_32g_cgpu: 2 // Limit and request must be the same
8 cpu: "4"
9 memory: 6Gi
Single-GPU sharing example [isolation for memory only, no isolation for computing power]:
Plain Text
1resources:
2 requests:
3 baidu.com/v100_32g_cgpu: 1 // 1 GPU card
4 baidu.com/v100_32g_cgpu_memory: 10 // 10GB
5 cpu: "4"
6 memory: 6Gi
7 limits:
8 baidu.com/v100_32g_cgpu: 1 // Limit and request must be the same
9 baidu.com/v100_32g_cgpu_memory: 10
10 cpu: "4"
11 memory: 6Gi
Single-GPU sharing example [isolation for both memory and computing power]:
Plain Text
1resources:
2 requests:
3 baidu.com/v100_32g_cgpu: 1 // 1 GPU card
4 baidu.com/v100_32g_cgpu_core: 50 // 50% (i.e., 0.5 of a card’s computing power)
5 baidu.com/v100_32g_cgpu_memory: 10 // 10GB
6 cpu: "4"
7 memory: 6Gi
8 limits:
9 baidu.com/v100_32g_cgpu: 1 // Limit and request must be the same
10 baidu.com/v100_32g_cgpu_core: 50
11 baidu.com/v100_32g_cgpu_memory: 10
12 cpu: "4"
13 memory: 6Gi
Mapping between GPU card types and resource names
Currently, the following types of GPUs support sharing and isolation of memory and computing power:
| GPU model | Resource name |
|---|---|
| Tesla V100-SXM2-16GB | baidu.com/v100_16g_cgpu |
| Tesla V100-SXM2-32GB | baidu.com/v100_32g_cgpu |
| Tesla T4 | baidu.com/t4_16g_cgpu |
