CCE 支持 GPUSharing 集群
更新时间:2025-08-21
K8S GPUSharing 介绍
K8S 基于 nvidia-device-plugin 的 GPU 调度通常使用"GPU 卡"作为最小粒度,每个 Pod 至少绑定一张卡,这种方式提供了很好的隔离性,但以下场景存在不足:
- AI 开发和推理场景 GPU 利用率较低,通过多个 Pod 挂载一张卡,可以提高 GPU 利用率,;
- K8S 集群存在多种不同类型 GPU 卡的混布,不同 GPU 卡的算力差别较大,调度时需要考虑卡的类型。
基于以上原因,CCE 将内部 KongMing GPUSharing 方案对外开放,提供 GPUSharing 功能,既支持多 Pod 共享 GPU 卡,也支持按卡类型进行调度。
在 CCE 使用 GPUSharing
新建集群
CCE 支持直接新建 GPUSharing 集群,先按照正常创建集群流程选完参数,在提交前切换为"自定义集群配置"模式:

修改 clusterType 为 gpuShare,直接发起创建:

ps:后续会直接支持 GPUSharing 类型集群,更加方便。
已有集群
已有集群可按照下面文档描述,自行修改组件配置,建议修改配置前先备份。下面操作都在 Master 机器上进行,仅支持自定义类型集群。
部署 extender-scheduler
修改 /etc/kubernetes/scheduler-policy.json 配置
备份已有配置:
Bash
1$cp /etc/kubernetes/scheduler-policy.json /etc/kubernetes/scheduler-policy.json.bak
修改 scheduler-policy.json,下面配置支持 v100、k40、p40、p4 等常见 GPU 卡类型,可根据实际情况进行调整:
Plain Text
1{
2 "kind": "Policy",
3 "apiVersion": "v1",
4 "predicates": [{"name":"PodFitsHostPorts"},{"name":"PodFitsResources"},{"name":"NoDiskConflict"},{"name":"CheckVolumeBinding"},{"name":"NoVolumeZoneConflict"},{"name":"MatchNodeSelector"},{"name":"HostName"}],
5 "priorities": [{"name":"ServiceSpreadingPriority","weight":1},{"name":"EqualPriority","weight":1},{"name":"LeastRequestedPriority","weight":1},{"name":"BalancedResourceAllocation","weight":1}],
6 "extenders":[
7 {
8 "urlPrefix":"http://127.0.0.1:39999/gpushare-scheduler",
9 "filterVerb":"filter",
10 "bindVerb":"bind",
11 "enableHttps":false,
12 "nodeCacheCapable":true,
13 "ignorable":false,
14 "managedResources":[
15 {
16 "name":"baidu.com/v100_cgpu_memory",
17 "ignoredByScheduler":false
18 },
19 {
20 "name":"baidu.com/v100_cgpu_core",
21 "ignoredByScheduler":false
22 },
23 {
24 "name":"baidu.com/k40_cgpu_memory",
25 "ignoredByScheduler":false
26 },
27 {
28 "name":"baidu.com/k40_cgpu_core",
29 "ignoredByScheduler":false
30 },
31 {
32 "name":"baidu.com/p40_cgpu_memory",
33 "ignoredByScheduler":false
34 },
35 {
36 "name":"baidu.com/p40_cgpu_core",
37 "ignoredByScheduler":false
38 },
39 {
40 "name":"baidu.com/p4_cgpu_memory",
41 "ignoredByScheduler":false
42 },
43 {
44 "name":"baidu.com/p4_cgpu_core",
45 "ignoredByScheduler":false
46 }
47 ]
48 }
49 ],
50 "hardPodAffinitySymmetricWeight": 10
51}
修改 /etc/systemd/system/kube-extender-scheduler.service 配置
Plain Text
1[Unit]
2Description=Kubernetes Extender Scheduler
3After=network.target
4After=kube-apiserver.service
5After=kube-scheduler.service
6
7[Service]
8Environment=KUBECONFIG=/etc/kubernetes/admin.conf
9
10ExecStart=/opt/kube/bin/kube-extender-scheduler \
11--logtostderr \
12--policy-config-file=/etc/kubernetes/scheduler-policy.json \
13--mps=false \
14--core=100 \
15--health-check=true \
16--memory-unit=GiB \
17--mem-quota-env-name=GPU_MEMORY \
18--compute-quota-env-name=GPU_COMPUTATION \
19--v=6
20Restart=always
21Type=simple
22LimitNOFILE=65536
23
24[Install]
25WantedBy=multi-user.target
部署 extender-scheduler
不同地域二进制地址:
- 北京:http://baidu-container.bj.bcebos.com/packages/gpu-extender/nvidia-share-extender-scheduler
- 广州:http://baidu-container-gz.gz.bcebos.com/packages/gpu-extender/nvidia-share-extender-scheduler
- 苏州:http://baidu-container-su.su.bcebos.com/packages/gpu-extender/nvidia-share-extender-scheduler
- 保定:http://baidu-container-bd.bd.bcebos.com/packages/gpu-extender/nvidia-share-extender-scheduler
- 香港:http://baidu-container-hk.hkg.bcebos.com/packages/gpu-extender/nvidia-share-extender-scheduler
- 武汉:http://baidu-container-whgg.fwh.bcebos.com/packages/gpu-extender/nvidia-share-extender-scheduler
下载二进制:
Bash
1$wget -q -O /opt/kube/bin/kube-extender-scheduler http://baidu-container.bj.bcebos.com/packages/gpu-extender/nvidia-share-extender-scheduler
启动服务 extender-scheduler:
Plain Text
1$chmod +x /opt/kube/bin/kube-extender-scheduler
2
3$systemctl daemon-reload
4
5$systemctl enable kube-extender-scheduler.service
6
7$systemctl restart kube-extender-scheduler.service
重启 scheduler
Plain Text
1$systemctl restart kube-scheduler.service
一般 Master 为 3 副本,依次完成上述操作。
部署 device-plugin
备份 nvidia-device-plugin,删除 ,可以和 nvidia-device-plugin:
Bash
1$ kubectl get ds nvidia-device-plugin-daemonset -n kube-system -o yaml > nvidia-device-plugin.yaml
2
3$ kubectl delete ds nvidia-device-plugin-daemonset -n kube-system
部署 kongming-device-plugin,all-in-one YAML 如下:
YAML
1# RBAC authn and authz
2apiVersion: v1
3kind: ServiceAccount
4metadata:
5 name: cce-gpushare-device-plugin
6 namespace: kube-system
7 labels:
8 k8s-app: cce-gpushare-device-plugin
9 kubernetes.io/cluster-service: "true"
10 addonmanager.kubernetes.io/mode: Reconcile
11
12---
13kind: ClusterRole
14apiVersion: rbac.authorization.k8s.io/v1
15metadata:
16 name: cce-gpushare-device-plugin
17 labels:
18 k8s-app: cce-gpushare-device-plugin
19 kubernetes.io/cluster-service: "true"
20 addonmanager.kubernetes.io/mode: Reconcile
21rules:
22 - apiGroups:
23 - ""
24 resources:
25 - nodes
26 verbs:
27 - get
28 - list
29 - watch
30 - apiGroups:
31 - ""
32 resources:
33 - events
34 verbs:
35 - create
36 - patch
37 - apiGroups:
38 - ""
39 resources:
40 - pods
41 verbs:
42 - update
43 - patch
44 - get
45 - list
46 - watch
47 - apiGroups:
48 - ""
49 resources:
50 - nodes/status
51 verbs:
52 - patch
53 - update
54
55---
56kind: ClusterRoleBinding
57apiVersion: rbac.authorization.k8s.io/v1
58metadata:
59 namespace: kube-system
60 name: cce-gpushare-device-plugin
61 labels:
62 k8s-app: cce-gpushare-device-plugin
63 kubernetes.io/cluster-service: "true"
64 addonmanager.kubernetes.io/mode: Reconcile
65subjects:
66 - kind: ServiceAccount
67 name: cce-gpushare-device-plugin
68 namespace: kube-system
69 apiGroup: ""
70roleRef:
71 kind: ClusterRole
72 name: cce-gpushare-device-plugin
73 apiGroup: ""
74
75---
76apiVersion: apps/v1
77kind: DaemonSet
78metadata:
79 namespace: kube-system
80 name: cce-gpushare-device-plugin
81 labels:
82 app: cce-gpushare-device-plugin
83spec:
84 updateStrategy:
85 type: RollingUpdate
86 selector:
87 matchLabels:
88 app: cce-gpushare-device-plugin
89 template:
90 metadata:
91 labels:
92 app: cce-gpushare-device-plugin
93 spec:
94 serviceAccountName: cce-gpushare-device-plugin
95 nodeSelector:
96 beta.kubernetes.io/instance-type: GPU
97 containers:
98 - name: cce-gpushare-device-plugin
99 image: hub.baidubce.com/jpaas-public/cce-nvidia-share-device-plugin:v0
100 imagePullPolicy: Always
101 args:
102 - --logtostderr
103 - --mps=false
104 - --core=100
105 - --health-check=true
106 - --memory-unit=GiB
107 - --mem-quota-env-name=GPU_MEMORY
108 - --compute-quota-env-name=GPU_COMPUTATION
109 - --gpu-type=baidu.com/gpu_k40_4,baidu.com/gpu_k40_16,baidu.com/gpu_p40_8,baidu.com/gpu_v100_8,baidu.com/gpu_p4_4
110 - --v=1
111 resources:
112 limits:
113 memory: "300Mi"
114 cpu: "1"
115 requests:
116 memory: "300Mi"
117 cpu: "1"
118 env:
119 - name: NODE_NAME
120 valueFrom:
121 fieldRef:
122 fieldPath: spec.nodeName
123 securityContext:
124 allowPrivilegeEscalation: false
125 capabilities:
126 drop: ["ALL"]
127 volumeMounts:
128 - name: device-plugin
129 mountPath: /var/lib/kubelet/device-plugins
130 volumes:
131 - name: device-plugin
132 hostPath:
133 path: /var/lib/kubelet/device-plugins
134 dnsPolicy: ClusterFirst
135 hostNetwork: true
136 restartPolicy: Always
检查 Node 资源
通过 kubectl get node -o yaml,能够看到 node 上有新的 GPU 资源:
Plain Text
1 allocatable:
2 baidu.com/gpu-count: "1"
3 baidu.com/t4_cgpu_core: "100"
4 baidu.com/t4_cgpu_memory: "14"
5 cpu: 23870m
6 ephemeral-storage: "631750310891"
7 hugepages-1Gi: "0"
8 hugepages-2Mi: "0"
9 memory: "65813636449"
10 pods: "256"
11 capacity:
12 baidu.com/gpu-count: "1"
13 baidu.com/t4_cgpu_core: "100"
14 baidu.com/t4_cgpu_memory: "14"
15 cpu: "24"
16 ephemeral-storage: 685492960Ki
17 hugepages-1Gi: "0"
18 hugepages-2Mi: "0"
19 memory: 74232212Ki
20 pods: "256"
提交测试任务
提交测试任务:
YAML
1apiVersion: v1
2kind: ReplicationController
3metadata:
4 name: paddlebook
5spec:
6 replicas: 1
7 selector:
8 app: paddlebook
9 template:
10 metadata:
11 name: paddlebook
12 labels:
13 app: paddlebook
14 spec:
15 containers:
16 - name: paddlebook
17 image: hub.baidubce.com/cce/tensorflow:gpu-benckmarks
18 command: ["/bin/sh", "-c", "sleep 3600"]
19 #command: ["/bin/sh", "-c", "python /root/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server"]
20 resources:
21 requests:
22 baidu.com/t4_cgpu_core: 10
23 baidu.com/t4_cgpu_memory: 2
24 limits:
25 baidu.com/t4_cgpu_core: 10
26 baidu.com/t4_cgpu_memory: 2
