Deploy the TensorFlow Serving inference service

Updated at：2025-10-27

This document explains how to deploy a TensorFlow Serving inference service, specifying the queue and GPU resources.

Prerequisites

The CCE GPU Manager and CCE AI Job Scheduler components have been installed successfully. Otherwise, the cloud-native AI features will not be available.

Example of operation steps:

Using TensorFlow Serving as an example, this guide demonstrates the steps to deploy an inference service via a deployment.

Deploy the TensorFlow Serving inference service
- Specify the use of the default queue: scheduling.volcano.sh/queue-name: default
- Apply for 50% computing power of 1 GPU card and 10 GiB of GPU memory
- The scheduler must be set to volcano

Reference YAML:

Plain Text

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: gpu-demo
5  namespace: default
6spec:
7  replicas: 1
8  selector:
9    matchLabels:
10      app: gpu-demo
11  template:
12    metadata:
13      annotations:
14         scheduling.volcano.sh/queue-name: default
15      labels:
16        app: gpu-demo
17    spec:
18      containers:
19        - image: registry.baidubce.com/cce-public/tensorflow-serving:demo-gpu
20          imagePullPolicy: Always
21          name: gpu-demo
22          env:
23            - name: MODEL_NAME
24              value: half_plus_two
25          ports:
26          - containerPort: 8501
27          resources:
28            limits:
29              cpu: "2"
30              memory: 2Gi
31              baidu.com/v100_32g_cgpu: "1"
32              baidu.com/v100_32g_cgpu_core: "50"
33              baidu.com/v100_32g_cgpu_memory: "10"
34            requests:
35              cpu: "2"
36              memory: 2Gi
37              baidu.com/v100_32g_cgpu: "1"
38              baidu.com/v100_32g_cgpu_core: "50"
39              baidu.com/v100_32g_cgpu_memory: "10"
40          # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
41          # `tf_serving_entrypoint.sh` needs to be replaced with the name of your gpu process.
42          lifecycle:
43              preStop:
44                exec:
45                  command: ["/bin/sh", "-c", "kill -10 `ps -ef | grep tf_serving_entrypoint.sh | grep -v grep | awk '{print $2}'` && sleep 1"]
46      dnsPolicy: ClusterFirst
47      restartPolicy: Always
48      schedulerName: volcano

Execute the following command to view the task running status

Plain Text

1kubectl get deployments
2NAME       READY   UP-TO-DATE   AVAILABLE   AGE
3gpu-demo   1/1     1            1           30s
4kubectl get pod -o wide
5NAME                            READY   STATUS      RESTARTS   AGE    IP            NODE           NOMINATED NODE   READINESS GATES
6gpu-demo-65767d67cc-xhdgg       1/1     Running     0          63s    172.23.1.86   192.168.48.8   <none>           <none>

Verify availability of the TensorFlow inference service

Plain Text

1# Replace <172.23.1.86> with the actual pod IP
2curl -d '{"instances": [1.0, 2.0, 5.0]}'   -X POST http://172.23.1.86:8501/v1/models/half_plus_two:predict
3# The output will be similar to the following:
4{
5    "predictions": [2.5, 3.0, 4.5]
6}

Queue usage instructions

Specify a queue via annotations

Plain Text

1annotations:
2 scheduling.volcano.sh/queue-name: <Queue Name>

Resource request

Exclusive occupation example (single-GPU)

Plain Text

1resources:
2      requests:
3 baidu.com/v100_32g_cgpu: 1 // 1 GPU card
4        cpu: "4"
5        memory: 6Gi
6      limits:
7 baidu.com/v100_32g_cgpu: 1 // Limit and request must be the same
8        cpu: "4"
9        memory: 6Gi

Exclusive occupation example (multi-GPU)

Plain Text

1resources:
2      requests:
3 baidu.com/v100_32g_cgpu: 2 // 2 GPU card
4        cpu: "4"
5        memory: 6Gi
6      limits:
7 baidu.com/v100_32g_cgpu: 2 // Limit and request must be the same
8        cpu: "4"
9        memory: 6Gi

Single-GPU sharing example [isolation for memory only, no isolation for computing power]:

Plain Text

1resources:
2      requests:
3 baidu.com/v100_32g_cgpu: 1 // 1 GPU card
4        baidu.com/v100_32g_cgpu_memory: 10 // 10GB
5        cpu: "4"
6        memory: 6Gi
7      limits:
8 baidu.com/v100_32g_cgpu: 1 // Limit and request must be the same
9        baidu.com/v100_32g_cgpu_memory: 10
10        cpu: "4"
11        memory: 6Gi

Single-GPU sharing example [isolation for both memory and computing power]:

Plain Text

1resources:
2      requests:
3 baidu.com/v100_32g_cgpu: 1 // 1 GPU card
4 baidu.com/v100_32g_cgpu_core: 50 // 50% (i.e., 0.5 of a card’s computing power)
5        baidu.com/v100_32g_cgpu_memory: 10 // 10GB
6        cpu: "4"
7        memory: 6Gi
8      limits:
9 baidu.com/v100_32g_cgpu: 1 // Limit and request must be the same
10        baidu.com/v100_32g_cgpu_core: 50
11        baidu.com/v100_32g_cgpu_memory: 10
12        cpu: "4"
13        memory: 6Gi

Mapping between GPU card types and resource names

Currently, the following types of GPUs support sharing and isolation of memory and computing power:

GPU model	Resource name
Tesla V100-SXM2-16GB	baidu.com/v100_16g_cgpu
Tesla V100-SXM2-32GB	baidu.com/v100_32g_cgpu
Tesla T4	baidu.com/t4_16g_cgpu

Elastic and Fault-Tolerant Training Using CCE AITraining Operator

Best Practice for GPU Virtualization with Optimal Isolation

CCE CCE

CCE CCE

Deploy the TensorFlow Serving inference service

Prerequisites

Example of operation steps:

Queue usage instructions

Resource request