部署 TensorFlow Serving 推理服务

更新时间：2025-08-21

本文介绍如何部署 TensorFlow Serving 推理服务，并指定队列、GPU资源。

前提条件

您已成功安装 CCE GPU Manager 和 CCE AI Job Scheduler 组件，否则云原生 AI 功能将无法使用。

操作步骤示例

这里用 TensorFlow Serving 作为示例，演示如何通过 deployment 部署推理服务。

部署 TensorFlow Serving 推理服务
- 指定使用 default 队列：scheduling.volcano.sh/queue-name: default
- 申请 1张GPU卡的50%的算力，10Gi显存
- 调度器指定为 volcano （必须）

参考 yaml 如下:

Plain Text

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4  name: gpu-demo
5  namespace: default
6spec:
7  replicas: 1
8  selector:
9    matchLabels:
10      app: gpu-demo
11  template:
12    metadata:
13      annotations:
14         scheduling.volcano.sh/queue-name: default
15      labels:
16        app: gpu-demo
17    spec:
18      containers:
19        - image: registry.baidubce.com/cce-public/tensorflow-serving:demo-gpu
20          imagePullPolicy: Always
21          name: gpu-demo
22          env:
23            - name: MODEL_NAME
24              value: half_plus_two
25          ports:
26          - containerPort: 8501
27          resources:
28            limits:
29              cpu: "2"
30              memory: 2Gi
31              baidu.com/v100_32g_cgpu: "1"
32              baidu.com/v100_32g_cgpu_core: "50"
33              baidu.com/v100_32g_cgpu_memory: "10"
34            requests:
35              cpu: "2"
36              memory: 2Gi
37              baidu.com/v100_32g_cgpu: "1"
38              baidu.com/v100_32g_cgpu_core: "50"
39              baidu.com/v100_32g_cgpu_memory: "10"
40          # if gpu core isolation is enabled, set the following preStop hook for graceful shutdown.
41          # `tf_serving_entrypoint.sh` needs to be replaced with the name of your gpu process.
42          lifecycle:
43              preStop:
44                exec:
45                  command: ["/bin/sh", "-c", "kill -10 `ps -ef | grep tf_serving_entrypoint.sh | grep -v grep | awk '{print $2}'` && sleep 1"]
46      dnsPolicy: ClusterFirst
47      restartPolicy: Always
48      schedulerName: volcano

执行以下命令，查看任务运行状态

Plain Text

1kubectl get deployments
2NAME       READY   UP-TO-DATE   AVAILABLE   AGE
3gpu-demo   1/1     1            1           30s
4
5kubectl get pod -o wide
6NAME                            READY   STATUS      RESTARTS   AGE    IP            NODE           NOMINATED NODE   READINESS GATES
7gpu-demo-65767d67cc-xhdgg       1/1     Running     0          63s    172.23.1.86   192.168.48.8   <none>           <none>

验证 Tensorflow 推理服务是否可用

Plain Text

1# 需替换 <172.23.1.86> 为实际 pod ip
2curl -d '{"instances": [1.0, 2.0, 5.0]}'   -X POST http://172.23.1.86:8501/v1/models/half_plus_two:predict
3
4# 输出类似如下结果:
5{
6    "predictions": [2.5, 3.0, 4.5]
7}

队列使用说明

可通过 annotations 指定队列

Plain Text

1annotations:
2   scheduling.volcano.sh/queue-name: <队列名称>

资源申请说明

单卡独占示例

Plain Text

1resources:
2      requests:
3        baidu.com/v100_32g_cgpu: 1 // 1卡
4        cpu: "4"
5        memory: 6Gi
6      limits:
7        baidu.com/v100_32g_cgpu: 1 // limit与request必须一致
8        cpu: "4"
9        memory: 6Gi

多卡独占示例：

Plain Text

1resources:
2      requests:
3        baidu.com/v100_32g_cgpu: 2 // 2卡
4        cpu: "4"
5        memory: 6Gi
6      limits:
7        baidu.com/v100_32g_cgpu: 2 // limit与request必须一致
8        cpu: "4"
9        memory: 6Gi

单卡共享【不进行算力隔离，只有显存隔离】示例：

Plain Text

1resources:
2      requests:
3        baidu.com/v100_32g_cgpu: 1 // 1卡
4        baidu.com/v100_32g_cgpu_memory: 10 // 10GB
5        cpu: "4"
6        memory: 6Gi
7      limits:
8        baidu.com/v100_32g_cgpu: 1 // limit与request必须一致
9        baidu.com/v100_32g_cgpu_memory: 10
10        cpu: "4"
11        memory: 6Gi

单卡共享【同时支持显存隔离和算力隔离】示例：

Plain Text

1resources:
2      requests:
3        baidu.com/v100_32g_cgpu: 1 // 1卡
4        baidu.com/v100_32g_cgpu_core: 50 // 50%, 0.5卡算力
5        baidu.com/v100_32g_cgpu_memory: 10 // 10GB
6        cpu: "4"
7        memory: 6Gi
8      limits:
9        baidu.com/v100_32g_cgpu: 1 // limit与request必须一致
10        baidu.com/v100_32g_cgpu_core: 50
11        baidu.com/v100_32g_cgpu_memory: 10
12        cpu: "4"
13        memory: 6Gi

GPU卡类型和资源名称对比关系

目前以下类型的GPU支持显存和算力的共享与隔离：

GPU卡型号	资源名称
Tesla V100-SXM2-16GB	baidu.com/v100_16g_cgpu
Tesla V100-SXM2-32GB	baidu.com/v100_32g_cgpu
Tesla T4	baidu.com/t4_16g_cgpu

使用 CCE AITraining Operator 实现弹性容错训练

GPU虚拟化之隔离性最优型的最佳实践

容器引擎 CCE

容器引擎 CCE

部署 TensorFlow Serving 推理服务

前提条件

操作步骤示例

队列使用说明

资源申请说明