CCE GPU Manager Description

Updated at：2025-10-27

Component introduction

A bundle of GPU device plugins, paired with a compatible scheduler, delivers GPU resource scheduling capabilities in complex scenarios. The CCE GPU Manager component supports isolation-optimized mode, facilitating shared use and isolation of computing power and memory.

Component function

Topology allocation: Enables a GPU topology allocation function. When more than one GPU card is assigned to a Pod, the system automatically chooses the fastest topology connection mode to allocate GPU devices.
GPU sharing: Provides the option to enable memory sharing for GPU devices on a node and supports allocating GPU cards to multiple Pods based on memory size.
Memory and computing power isolation: Ensures isolation of memory and computing power when multiple Pods share a single GPU card.
Fine-grained scheduling: When enabled, you can select specific GPU models during the creation of queues and tasks. When disabled, only quota input is allowed while creating queues and containers, and specific GPU models cannot be selected.
Encoding/decoding instances: Submit encoding/decoding tasks using the independent encoding/decoding units of GPUs for hardware-based encoding/decoding.
For detailed component usage instructions, please refer to: GPU Exclusive and Shared Instructions

Application scenarios

Running GPU applications in CCE clusters addresses the issue of resource waste caused by exclusively using entire GPU cards in scenarios like AI training, thereby improving resource utilization and reducing costs.

Limitations

Only supports Kubernetes clusters of version v1.18 and above.
Currently, this component depends on the CCE AI Job Scheduler. If required, please install both components together; otherwise, the functions of this component may be unavailable.
GPU-shared virtualization supports the following mainstream GPU CUDA and Driver versions. The isolation-optimized mode imposes additional requirements on OS kernel versions and others. For adaptation to other versions, please submit a request. The current support details are as follows.

Configuration	Version
Container runtime	Docker、Containerd
GPU CUDA/Driver version	GPU Driver 470.X，515.X，525.X
OS kernel version (isolation-optimized mode only)	CentOS: 3.10.0-957.21.3.el7.x86_64 3.10.0-1160.41.1.el7.x86_64 3.10.0-1160.42.2.el7.x86_64 3.10.0-1160.45.1.el7.x86_64 3.10.0-1160.62.1.el7.x86_64 3.10.0-1160.71.1.el7.x86_64 3.10.0-1160.76.1.el7.x86_64 3.10.0-1160.80.1.el7.x86_64 3.10.0-1160.81.1.el7.x86_64 3.10.0-1160.83.1.el7.x86_64 3.10.0-1160.88.1.el7.x86_64 3.10.0-1160.90.1.el7.x86_64 4.17.11-1.el7.elrepo.x86_64 5.4.123-1.el7.elrepo.x86_64 Ubuntu: 4.4.0-150-generic 4.15.0-140-generic 5.4.0-72-generic 5.4.0-139-generic

Install component

Sign in to the Baidu AI Cloud Official Website and enter the management console.
Go to Product Services - Cloud Native - Cloud Container Engine (CCE) to access the CCE management console.
Click Cluster Management > Cluster List in the left navigation bar.
Click on the target cluster name in the Cluster List page to navigate to the cluster management page.
On the Cluster Management page, click O&M & Management > Component Management.
Select the CCE GPU Manager component from the component management list and click on Install.
In the installation confirmation dialog box, isolation-optimized mode is selected by default.
The default unit for GPU memory sharing is GiB.
Fine-grained scheduling is enabled by default.
Click the OK button to finalize the component installation.

Screenshot 07/11/2024 6.21.56 PM.png

Version records

Version No.	Cluster version compatibility	Update time	Change content	Impact
1.5.35	CCE v1.18+	2024.07.05	New function: Pod requests virtualized resource usage adjustment, supports only requesting through virtualized resource descriptors, and removes the constraint on the baidu.com/xx_xx_cgpu descriptor Optimize: Adapt to BCC packages for RDAM models, and support automatic topology awareness of single-machine GPUs and network interface cards by the NCCL communication library Adapt to H-chip scheduler card partitioning, and change the dependency for device information acquisition from cuda to nvml Bug Fixes: [Service impact] Fix the data backup function of PaddleJob and occasional backup failures	GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
1.5.34	CCE v1.18+	2024.06.24	Optimize: The container-runtime adapts to nccl environment variable injection for host network containers in VPC-ENI network mode Bug Fixes: Due to impacts on communication performance, dcgm-exporter does not collect FP16/FP32/FP64 metrics by default;	GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
1.5.33	CCE v1.18+	2024.05.31	New function: Add a GPU card partitioning information recognition service for multiple schedulers to identify GPU card information allocated by the default scheduler and other schedulers, to avoid mixed scheduling of nodes by multiple schedulers; Enable adaptive routing by default in IB environments; Optimize: Virtualized webhooks support high availability;	GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
1.5.32	CCE v1.18+	2024.05.15	New function: Support coexistence of two virtualization modes in the cluster; Support automatic injection of BCC RDMA topology files; Optimize: Optimize residual container issues in isolation-optimized GPU virtualization;	GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
1.5.31	CCE v1.18+	2024.05.06	New function: Add the GFD module Support reporting of GPU node driver, Cuda and other environment information to nodes; Isolation-optimized GPU virtualization supports L20 chips; Support hook injection in eks mode	GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
1.5.30	CCE v1.18+	2024.03.26	New function: Adapt BCC H800 models to identify rdma topology Fix: Fix a set of image vulnerabilities;	GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
1.5.29	CCE v1.18+	2024.01.19	New function: Add nvlink bandwidth, sm utilization and FP64/32/16 computing utilization metrics to dcgm-exporter	GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
1.5.28	CCE v1.18+	2023.12.15	New function: Support adaptation of NCCL environment variables for different GPU card types: A100/A800 use NCCL_IB_QPS_PER_CONNECTION=8 and NCCL_IB_ADAPTIVE_ROUTING=0; H800 uses NCCL_IB_QPS_PER_CONNECTION=1 and NCCL_IB_ADAPTIVE_ROUTING=1 Optimize: Set the dp health check port as a modifiable parameter and disable dp health checks for isolation-optimized GPU virtualization	GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
1.5.27	CCE v1.18+	2023.12.1	Optimize: Add kernel logs for GPU virtualization to print memory statistics for GPU virtualization OOM Optimize GPU virtualization container residues to improve cleanup efficiency. The container side is compatible with multiple scenarios such as container creation and residues, and kernel modules have added GPU virtualization cleanup Add metrics for GPU virtualization container residues to reflect GPU virtualization residues	GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
1.5.26	CCE v1.18+	2023.11.17	New function: Adapt to Kubernetes 1.26 Adapt to Ubuntu 22.04 OS Optimize: Add completion of kernel versions supported by GPU virtualization Add health checks for the dp component to be compatible with scenarios such as kubelet restart/apiserver access failures Bug Fixes: [Service impact] Fix the issue where dcgm-exporter and gpu-exporter failed to report container information when nodes had both docker and containerd installed [Service impact] Fix invalid GPU card allocation and RDMA configuration by container-runtime due to systemd path resolution errors [Service impact] Fix the issue where gpu-exporter failed to obtain container information due to systemd path resolution errors	GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
1.5.25	CCE v1.18+	2023.11.03	New function: Support GPU virtualization kernel mode 535 drivers Support exclusive/shared card modes for 4090 chips Optimize: Initialization optimization for isolation-optimized GPU virtualization: Add pre-checks for the sgpu.ko kernel on nodes: including verification of residual module versions and deletion/reinstallation of invalid residual modules Add a GC module for GPU virtualization containers to clean up residual GPU virtualization configurations Optimize exception handling in the container-runtime-sgpu-hook prestart/poststop processes, and modify the process to return error information when configuration fails Bug Fixes: [Service impact] Fix incorrect card allocation for multi-container Pods due to container-runtime failing to distinguish the resources obtained by containers for obtaining pods [Service impact] Fix the issue where DCGM Pods remained in terminating status and could not be deleted because the install.sh process entered kernel mode during component upgrades [Service impact] Fix task startup errors caused by ineffective default lib-path of runtime in OS Ubuntu 20 [Service impact] Fix invalid libcuda.so hijacking for shared cards in performance-optimized GPU virtualization under CUDA Driver 525 environments Limit: Do not support creating DDP training tasks that use performance-optimized GPU virtualization shared cards for communication via NCCL	GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
1.5.24	CCE v1.18+	2023.09.22	New function Support multi-shared-card usage for single containers Bug Fixes: Resolve the issue of failed monitor metric collection in virtualization scenarios Usage Limitations: For versions 1.5.14 and below, if performance-optimized GPU virtualization is enabled, virtualization tasks must be stopped before upgrading the component For versions 1.5.13 and below, if isolation-optimized GPU virtualization is enabled, virtualization tasks must be stopped before upgrading the component
1.5.23	CCE v1.18+	2023.08.29	Optimize: Enable necessary subsystems for NCCL_DEBUG logs by default, and change NCCL_DEBUG_SUBSYS from ENV to INIT,ENV,GRAPH Usage Limitations: For versions 1.5.14 and below, if performance-optimized GPU virtualization is enabled, virtualization tasks must be stopped before upgrading the component For versions 1.5.13 and below, if isolation-optimized GPU virtualization is enabled, virtualization tasks must be stopped before upgrading the component
1.5.22	CCE v1.18+	2023.08.10	Bug fixes Fix occasional errors in virtualized memory and computing power resource allocation when creating virtualized containers concurrently Usage Limitations: For versions 1.5.14 and below, if performance-optimized GPU virtualization is enabled, virtualization tasks must be stopped before upgrading the component For versions 1.5.13 and below, if isolation-optimized GPU virtualization is enabled, virtualization tasks must be stopped before upgrading the component

CCE QoS Agent Description

CCE Ingress NGINX Controller Description

CCE CCE