CCE GPU Manager Description

CCE CCE

  • Function Release Records
  • Common Tools
    • Command Line Scenario Examples
  • API Reference
    • Overview
    • Common Headers and Error Responses
    • General Description
  • Product Announcement
    • Announcement on the Discontinuation of CCE Standalone Clusters
    • CCE New Cluster Management Release Announcement
    • Upgrade Announcement for CCE Cluster Audit Component kube-external-auditor
    • CCE Console Upgrade Announcement
    • Announcement on Management Fees for CCE Managed Clusters
    • Container Runtime Version Release Notes
    • Announcement on the Decommissioning of CCE Image Repository
    • Kubernetes Version Release Notes
      • CCE Release of Kubernetes v1_26 History
      • CCE Kubernetes Version Update Notes
      • CCE Release of Kubernetes v1_24 History
      • CCE Release of Kubernetes v1_30 History
      • CCE Release of Kubernetes v1_22 History
      • CCE Release of Kubernetes v1_18 History
      • CCE Release of Kubernetes v1_20 History
      • CCE Release of Kubernetes v1_28 History
      • Release Notes for CCE Kubernetes 1_31 Version
      • Kubernetes Version Overview and Mechanism
    • Security Vulnerability Fix Announcement
      • Vulnerability CVE-2019-5736 Fix Announcement
      • Vulnerability CVE-2021-30465 Fix Announcement
      • CVE-2025-1097, CVE-2025-1098, and Other Vulnerabilities Fix Announcement
      • CVE-2020-14386 Vulnerability Fix Announcement
      • Impact Statement on runc Security Issue (CVE-2024-21626)
  • Service Level Agreement (SLA)
    • CCE Service Level Agreement SLA (V1_0)
  • Typical Practices
    • Pod Anomaly Troubleshooting
    • Adding CGroup V2 Node
    • Common Linux System Configuration Parameters Description
    • Encrypting etcd Data Using KMS
    • Configuring Container Network Parameters Using CNI
    • CCE - Public Network Access Practice
    • Practice of using private images in CCE clusters
    • Unified Access for Virtual Machines and Container Services via CCE Ingress
    • User Guide for Custom CNI Plugins
    • CCE Cluster Network Description and Planning
    • Cross-Cloud Application Migration to Baidu CCE Using Velero
    • CCE Resource Recommender User Documentation
    • Continuous Deployment with Jenkins in CCE Cluster
    • CCE Best Practice-Guestbook Setup
    • CCE Best Practice-Container Network Mode Selection
    • CCE Usage Checklist
    • VPC-ENI Mode Cluster Public Network Access Practice
    • CCE Container Runtime Selection
    • Cloud-native AI
      • Elastic and Fault-Tolerant Training Using CCE AITraining Operator
      • Deploy the TensorFlow Serving inference service
      • Best Practice for GPU Virtualization with Optimal Isolation
  • FAQs
    • How do business applications use load balancer
    • Using kubectl on Windows
    • Cluster management FAQs
    • Common Questions Overview
    • Auto scaling FAQs
    • Create a simple service via kubectl
  • Operation guide
    • Prerequisites for use
    • Identity and access management
    • Permission Management
      • Configure IAM Tag Permission Policy
      • Permission Overview
      • Configure IAM Custom Permission Policy
      • Configure Predefined RBAC Permission Policy
      • Configure IAM Predefined Permission Policy
      • Configure Cluster OIDC Authentication
    • Configuration Management
      • Configmap Management
      • Secret Management
    • Traffic access
      • BLB ingress annotation description
      • Use K8S_Service via CCE
      • Use K8S_Ingress via CCE
      • Implement Canary Release with CCE Based on Nginx-Ingress
      • Create CCE_Ingress via YAML
      • LoadBalancer Service Annotation Description
      • Service Reuses Existing Load Balancer BLB
      • Use Direct Pod Mode LoadBalancer Service
      • NGINX Ingress Configuration Reference
      • Create LoadBalancer_Service via YAML
      • Use NGINX Ingress
    • Virtual Node
      • Configuring BCIPod
      • Configuring bci-profile
      • Managing virtual nodes
    • Node management
      • Add a node
      • Managing Taints
      • Setting Node Blocking
      • Setting GPU Memory Sharing
      • Remove a node
      • Customizing Kubelet Parameters
      • Kubelet Container Monitor Read-Only Port Risk Warning
      • Managing Node Tag
      • Drain node
    • Component Management
      • CCE CSI CDS Plugin Description
      • CCE Fluid Description
      • CCE CSI PFS L2 Plugin
      • CCE Calico Felix Description
      • CCE Ingress Controller Description
      • CCE QoS Agent Description
      • CCE GPU Manager Description
      • CCE Ingress NGINX Controller Description
      • CCE P2P Accelerator Description
      • CCE Virtual Kubelet Component
      • CoreDNS Description
      • CCE Log Operator Description
      • CCE Node Remedier Description
      • CCE Descheduler Description
      • CCE Dynamic Scheduling Plugin Description
      • Kube Scheduler Documentation
      • CCE NPU Manager Description
      • CCE CronHPA Controller Description
      • CCE LB Controller Description
      • Kube ApiServer Description
      • CCE Backup Controller Description
      • CCE Network Plugin Description
      • CCE CSI PFS Plugin Description
      • CCE Credential Controller Description
      • CCE Deep Learning Frameworks Operator Description
      • Component Overview
      • CCE Image Accelerate Description
      • CCE CSI BOS Plugin Description
      • CCE Onepilot Description
      • Description of Kube Controller Manager
      • CCE_Hybrid_Manager Description
      • CCE NodeLocal DNSCache Description
      • CCE Node Problem Detector Description
      • CCE Ascend Mindx DL Description
      • CCE RDMA Device Plugin Description
      • CCE AI Job Scheduler Description
    • Image registry
      • Image Registry Basic Operations
      • Using Container Image to Build Services
    • Helm Management
      • Helm Template
      • Helm Instance
    • Cluster management
      • Upgrade Cluster Kubernetes Version
      • CCE Node CDS Dilatation
      • Managed Cluster Usage Instructions
      • Create cluster
      • CCE Supports GPUSharing Cluster
      • View Cluster
      • Connect to Cluster via kubectl
      • CCE Security Group
      • CCE Node Resource Reservation Instructions
      • Operate Cluster
      • Cluster Snapshot
    • Serverless Cluster
      • Product overview
      • Using Service in Serverless Cluster
      • Creating a Serverless Cluster
    • Storage Management
      • Using Cloud File System
      • Overview
      • Using Parallel File System PFS
      • Using RapidFS
      • Using Object Storage BOS
      • Using Parallel File System PFS L2
      • Using Local Storage
      • Using Cloud Disk CDS
    • Inspection and Diagnosis
      • Cluster Inspection
      • GPU Runtime Environment Check
      • Fault Diagnosis
    • Cloud-native AI
      • Cloud-Native AI Overview
      • AI Monitoring Dashboard
        • Connecting to a Prometheus Instance and Starting a Job
        • NVIDIA Chip Resource Observation
          • AI Job Scheduler component
          • GPU node resources
          • GPU workload resources
          • GPUManager component
          • GPU resource pool overview
        • Ascend Chip Resource Observation
          • Ascend resource pool overview
          • Ascend node resource
          • Ascend workload resource
      • Task Management
        • View Task Information
        • Create TensorFlow Task
        • Example of RDMA Distributed Training Based on NCCL
        • Create PaddlePaddle Task
        • Create AI Training Task
        • Delete task
        • Create PyTorch Task
        • Create Mxnet Task
      • Queue Management
        • Modify Queue
        • Create Queue
        • Usage Instructions for Logical Queues and Physical Queues
        • Queue deletion
      • Dataset Management
        • Create Dataset
        • Delete dataset
        • View Dataset
        • Operate Dataset
      • AI Acceleration Kit
        • AIAK Introduction
        • Using AIAK-Training PyTorch Edition
        • Deploying Distributed Training Tasks Using AIAK-Training
        • Accelerating Inference Business Using AIAK-Inference
      • GPU Virtualization
        • GPU Exclusive and Shared Usage Instructions
        • Image Build Precautions in Shared GPU Scenarios
        • Instructions for Multi-GPU Usage in Single-GPU Containers
        • GPU Virtualization Adaptation Table
        • GPU Online and Offline Mixed Usage Instructions
        • MPS Best Practices & Precautions
        • Precautions for Disabling Node Video Memory Sharing
    • Elastic Scaling
      • Container Timing Horizontal Scaling (CronHPA)
      • Container Horizontal Scaling (HPA)
      • Implementing Second-Level Elastic Scaling with cce-autoscaling-placeholder
      • CCE Cluster Node Auto-Scaling
    • Network Management
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC-ENI Mode)
      • Container Access to External Services in CCE Clusters
      • CCE supports dual-stack networks of IPv4 and IPv6
      • Using NetworkPolicy Network Policy
      • Traffic Forwarding Configuration for Containers in Peering Connections Scenarios
      • CCE IP Masquerade Agent User Guide
      • Creating VPC-ENI Mode Cluster
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC Network Mode)
      • Using NetworkPolicy in CCE Clusters
      • Network Orchestration
        • Container Network QoS Management
        • VPC-ENI Specified Subnet IP Allocation (Container Network v2)
        • Cluster Pod Subnet Topology Distribution (Container Network v2)
      • Network Connectivity
        • Container network accesses the public network via NAT gateway
      • Network Maintenance
        • Common Error Code Table for CCE Container Network
      • DNS
        • CoreDNS Component Manual Dilatation Guide
        • DNS Troubleshooting Guide
        • DNS Principle Overview
    • Namespace Management
      • Set Limit Range
      • Set Resource Quota
      • Basic Namespace Operations
    • Workload
      • CronJob Management
      • Set Workload Auto-Scaling
      • Deployment Management
      • Job Management
      • View the Pod
      • StatefulSet Management
      • Password-Free Pull of Container Image
      • Create Workload Using Private Image
      • DaemonSet Management
    • Monitor Logs
      • Monitor Cluster with Prometheus
      • CCE Event Center
      • Cluster Service Profiling
      • CCE Cluster Anomaly Event Alerts
      • Java Application Monitor
      • Cluster Audit Dashboard
      • Logging
      • Cluster Audit
      • Log Center
        • Configure Collection Rules Using CRD
        • View Cluster Control Plane Logs
        • View Business Logs
        • Log Overview
        • Configure Collection Rules in Cloud Container Engine Console
    • Application management
      • Overview
      • Secret
      • Configuration dictionary
      • Deployment
      • Service
      • Pod
    • NodeGroup Management
      • NodeGroup Management
      • NodeGroup Node Fault Detection and Self-Healing
      • Configuring Scaling Policies
      • NodeGroup Introduction
      • Adding Existing External Nodes
      • Custom NodeGroup Kubelet Configuration
      • Adding Alternative Models
      • Dilatation NodeGroup
    • Backup Center
      • Restore Management
      • Backup Overview
      • Backup Management
      • Backup repository
  • Quick Start
    • Quick Deployment of Nginx Application
    • CCE Container Engine Usage Process Overview
  • Product pricing
    • Product pricing
  • Product Description
    • Application scenarios
    • Introduction
    • Usage restrictions
    • Features
    • Advantages
    • Core concepts
  • Solution-Fabric
    • Fabric Solution
  • Development Guide
    • EFK Log Collection System Deployment Guide
    • Using Network Policy in CCE Cluster
    • Creating a LoadBalancer-Type Service
    • Prometheus Monitoring System Deployment Guide
    • kubectl Management Configuration
  • API_V2 Reference
    • Overview
    • Common Headers and Error Responses
    • Cluster Related Interfaces
    • Instance Related Interfaces
    • Service domain
    • General Description
    • Kubeconfig Related Interfaces
    • RBAC Related Interfaces
    • Autoscaler Related Interfaces
    • Network Related Interfaces
    • InstanceGroup Related Interfaces
    • Appendix
    • Component management-related APIs
    • Package adaptation-related APIs
    • Task Related Interfaces
  • Solution-Xchain
    • Hyperchain Solution
  • SDK
    • Go-SDK
      • Overview
      • NodeGroup Management
      • Initialization
      • Install the SDK Package
      • Cluster management
      • Node management
All documents
menu
No results found, please re-enter

CCE CCE

  • Function Release Records
  • Common Tools
    • Command Line Scenario Examples
  • API Reference
    • Overview
    • Common Headers and Error Responses
    • General Description
  • Product Announcement
    • Announcement on the Discontinuation of CCE Standalone Clusters
    • CCE New Cluster Management Release Announcement
    • Upgrade Announcement for CCE Cluster Audit Component kube-external-auditor
    • CCE Console Upgrade Announcement
    • Announcement on Management Fees for CCE Managed Clusters
    • Container Runtime Version Release Notes
    • Announcement on the Decommissioning of CCE Image Repository
    • Kubernetes Version Release Notes
      • CCE Release of Kubernetes v1_26 History
      • CCE Kubernetes Version Update Notes
      • CCE Release of Kubernetes v1_24 History
      • CCE Release of Kubernetes v1_30 History
      • CCE Release of Kubernetes v1_22 History
      • CCE Release of Kubernetes v1_18 History
      • CCE Release of Kubernetes v1_20 History
      • CCE Release of Kubernetes v1_28 History
      • Release Notes for CCE Kubernetes 1_31 Version
      • Kubernetes Version Overview and Mechanism
    • Security Vulnerability Fix Announcement
      • Vulnerability CVE-2019-5736 Fix Announcement
      • Vulnerability CVE-2021-30465 Fix Announcement
      • CVE-2025-1097, CVE-2025-1098, and Other Vulnerabilities Fix Announcement
      • CVE-2020-14386 Vulnerability Fix Announcement
      • Impact Statement on runc Security Issue (CVE-2024-21626)
  • Service Level Agreement (SLA)
    • CCE Service Level Agreement SLA (V1_0)
  • Typical Practices
    • Pod Anomaly Troubleshooting
    • Adding CGroup V2 Node
    • Common Linux System Configuration Parameters Description
    • Encrypting etcd Data Using KMS
    • Configuring Container Network Parameters Using CNI
    • CCE - Public Network Access Practice
    • Practice of using private images in CCE clusters
    • Unified Access for Virtual Machines and Container Services via CCE Ingress
    • User Guide for Custom CNI Plugins
    • CCE Cluster Network Description and Planning
    • Cross-Cloud Application Migration to Baidu CCE Using Velero
    • CCE Resource Recommender User Documentation
    • Continuous Deployment with Jenkins in CCE Cluster
    • CCE Best Practice-Guestbook Setup
    • CCE Best Practice-Container Network Mode Selection
    • CCE Usage Checklist
    • VPC-ENI Mode Cluster Public Network Access Practice
    • CCE Container Runtime Selection
    • Cloud-native AI
      • Elastic and Fault-Tolerant Training Using CCE AITraining Operator
      • Deploy the TensorFlow Serving inference service
      • Best Practice for GPU Virtualization with Optimal Isolation
  • FAQs
    • How do business applications use load balancer
    • Using kubectl on Windows
    • Cluster management FAQs
    • Common Questions Overview
    • Auto scaling FAQs
    • Create a simple service via kubectl
  • Operation guide
    • Prerequisites for use
    • Identity and access management
    • Permission Management
      • Configure IAM Tag Permission Policy
      • Permission Overview
      • Configure IAM Custom Permission Policy
      • Configure Predefined RBAC Permission Policy
      • Configure IAM Predefined Permission Policy
      • Configure Cluster OIDC Authentication
    • Configuration Management
      • Configmap Management
      • Secret Management
    • Traffic access
      • BLB ingress annotation description
      • Use K8S_Service via CCE
      • Use K8S_Ingress via CCE
      • Implement Canary Release with CCE Based on Nginx-Ingress
      • Create CCE_Ingress via YAML
      • LoadBalancer Service Annotation Description
      • Service Reuses Existing Load Balancer BLB
      • Use Direct Pod Mode LoadBalancer Service
      • NGINX Ingress Configuration Reference
      • Create LoadBalancer_Service via YAML
      • Use NGINX Ingress
    • Virtual Node
      • Configuring BCIPod
      • Configuring bci-profile
      • Managing virtual nodes
    • Node management
      • Add a node
      • Managing Taints
      • Setting Node Blocking
      • Setting GPU Memory Sharing
      • Remove a node
      • Customizing Kubelet Parameters
      • Kubelet Container Monitor Read-Only Port Risk Warning
      • Managing Node Tag
      • Drain node
    • Component Management
      • CCE CSI CDS Plugin Description
      • CCE Fluid Description
      • CCE CSI PFS L2 Plugin
      • CCE Calico Felix Description
      • CCE Ingress Controller Description
      • CCE QoS Agent Description
      • CCE GPU Manager Description
      • CCE Ingress NGINX Controller Description
      • CCE P2P Accelerator Description
      • CCE Virtual Kubelet Component
      • CoreDNS Description
      • CCE Log Operator Description
      • CCE Node Remedier Description
      • CCE Descheduler Description
      • CCE Dynamic Scheduling Plugin Description
      • Kube Scheduler Documentation
      • CCE NPU Manager Description
      • CCE CronHPA Controller Description
      • CCE LB Controller Description
      • Kube ApiServer Description
      • CCE Backup Controller Description
      • CCE Network Plugin Description
      • CCE CSI PFS Plugin Description
      • CCE Credential Controller Description
      • CCE Deep Learning Frameworks Operator Description
      • Component Overview
      • CCE Image Accelerate Description
      • CCE CSI BOS Plugin Description
      • CCE Onepilot Description
      • Description of Kube Controller Manager
      • CCE_Hybrid_Manager Description
      • CCE NodeLocal DNSCache Description
      • CCE Node Problem Detector Description
      • CCE Ascend Mindx DL Description
      • CCE RDMA Device Plugin Description
      • CCE AI Job Scheduler Description
    • Image registry
      • Image Registry Basic Operations
      • Using Container Image to Build Services
    • Helm Management
      • Helm Template
      • Helm Instance
    • Cluster management
      • Upgrade Cluster Kubernetes Version
      • CCE Node CDS Dilatation
      • Managed Cluster Usage Instructions
      • Create cluster
      • CCE Supports GPUSharing Cluster
      • View Cluster
      • Connect to Cluster via kubectl
      • CCE Security Group
      • CCE Node Resource Reservation Instructions
      • Operate Cluster
      • Cluster Snapshot
    • Serverless Cluster
      • Product overview
      • Using Service in Serverless Cluster
      • Creating a Serverless Cluster
    • Storage Management
      • Using Cloud File System
      • Overview
      • Using Parallel File System PFS
      • Using RapidFS
      • Using Object Storage BOS
      • Using Parallel File System PFS L2
      • Using Local Storage
      • Using Cloud Disk CDS
    • Inspection and Diagnosis
      • Cluster Inspection
      • GPU Runtime Environment Check
      • Fault Diagnosis
    • Cloud-native AI
      • Cloud-Native AI Overview
      • AI Monitoring Dashboard
        • Connecting to a Prometheus Instance and Starting a Job
        • NVIDIA Chip Resource Observation
          • AI Job Scheduler component
          • GPU node resources
          • GPU workload resources
          • GPUManager component
          • GPU resource pool overview
        • Ascend Chip Resource Observation
          • Ascend resource pool overview
          • Ascend node resource
          • Ascend workload resource
      • Task Management
        • View Task Information
        • Create TensorFlow Task
        • Example of RDMA Distributed Training Based on NCCL
        • Create PaddlePaddle Task
        • Create AI Training Task
        • Delete task
        • Create PyTorch Task
        • Create Mxnet Task
      • Queue Management
        • Modify Queue
        • Create Queue
        • Usage Instructions for Logical Queues and Physical Queues
        • Queue deletion
      • Dataset Management
        • Create Dataset
        • Delete dataset
        • View Dataset
        • Operate Dataset
      • AI Acceleration Kit
        • AIAK Introduction
        • Using AIAK-Training PyTorch Edition
        • Deploying Distributed Training Tasks Using AIAK-Training
        • Accelerating Inference Business Using AIAK-Inference
      • GPU Virtualization
        • GPU Exclusive and Shared Usage Instructions
        • Image Build Precautions in Shared GPU Scenarios
        • Instructions for Multi-GPU Usage in Single-GPU Containers
        • GPU Virtualization Adaptation Table
        • GPU Online and Offline Mixed Usage Instructions
        • MPS Best Practices & Precautions
        • Precautions for Disabling Node Video Memory Sharing
    • Elastic Scaling
      • Container Timing Horizontal Scaling (CronHPA)
      • Container Horizontal Scaling (HPA)
      • Implementing Second-Level Elastic Scaling with cce-autoscaling-placeholder
      • CCE Cluster Node Auto-Scaling
    • Network Management
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC-ENI Mode)
      • Container Access to External Services in CCE Clusters
      • CCE supports dual-stack networks of IPv4 and IPv6
      • Using NetworkPolicy Network Policy
      • Traffic Forwarding Configuration for Containers in Peering Connections Scenarios
      • CCE IP Masquerade Agent User Guide
      • Creating VPC-ENI Mode Cluster
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC Network Mode)
      • Using NetworkPolicy in CCE Clusters
      • Network Orchestration
        • Container Network QoS Management
        • VPC-ENI Specified Subnet IP Allocation (Container Network v2)
        • Cluster Pod Subnet Topology Distribution (Container Network v2)
      • Network Connectivity
        • Container network accesses the public network via NAT gateway
      • Network Maintenance
        • Common Error Code Table for CCE Container Network
      • DNS
        • CoreDNS Component Manual Dilatation Guide
        • DNS Troubleshooting Guide
        • DNS Principle Overview
    • Namespace Management
      • Set Limit Range
      • Set Resource Quota
      • Basic Namespace Operations
    • Workload
      • CronJob Management
      • Set Workload Auto-Scaling
      • Deployment Management
      • Job Management
      • View the Pod
      • StatefulSet Management
      • Password-Free Pull of Container Image
      • Create Workload Using Private Image
      • DaemonSet Management
    • Monitor Logs
      • Monitor Cluster with Prometheus
      • CCE Event Center
      • Cluster Service Profiling
      • CCE Cluster Anomaly Event Alerts
      • Java Application Monitor
      • Cluster Audit Dashboard
      • Logging
      • Cluster Audit
      • Log Center
        • Configure Collection Rules Using CRD
        • View Cluster Control Plane Logs
        • View Business Logs
        • Log Overview
        • Configure Collection Rules in Cloud Container Engine Console
    • Application management
      • Overview
      • Secret
      • Configuration dictionary
      • Deployment
      • Service
      • Pod
    • NodeGroup Management
      • NodeGroup Management
      • NodeGroup Node Fault Detection and Self-Healing
      • Configuring Scaling Policies
      • NodeGroup Introduction
      • Adding Existing External Nodes
      • Custom NodeGroup Kubelet Configuration
      • Adding Alternative Models
      • Dilatation NodeGroup
    • Backup Center
      • Restore Management
      • Backup Overview
      • Backup Management
      • Backup repository
  • Quick Start
    • Quick Deployment of Nginx Application
    • CCE Container Engine Usage Process Overview
  • Product pricing
    • Product pricing
  • Product Description
    • Application scenarios
    • Introduction
    • Usage restrictions
    • Features
    • Advantages
    • Core concepts
  • Solution-Fabric
    • Fabric Solution
  • Development Guide
    • EFK Log Collection System Deployment Guide
    • Using Network Policy in CCE Cluster
    • Creating a LoadBalancer-Type Service
    • Prometheus Monitoring System Deployment Guide
    • kubectl Management Configuration
  • API_V2 Reference
    • Overview
    • Common Headers and Error Responses
    • Cluster Related Interfaces
    • Instance Related Interfaces
    • Service domain
    • General Description
    • Kubeconfig Related Interfaces
    • RBAC Related Interfaces
    • Autoscaler Related Interfaces
    • Network Related Interfaces
    • InstanceGroup Related Interfaces
    • Appendix
    • Component management-related APIs
    • Package adaptation-related APIs
    • Task Related Interfaces
  • Solution-Xchain
    • Hyperchain Solution
  • SDK
    • Go-SDK
      • Overview
      • NodeGroup Management
      • Initialization
      • Install the SDK Package
      • Cluster management
      • Node management
  • Document center
  • arrow
  • CCECCE
  • arrow
  • Operation guide
  • arrow
  • Component Management
  • arrow
  • CCE GPU Manager Description
Table of contents on this page
  • Component introduction
  • Component function
  • Application scenarios
  • Limitations
  • Install component
  • Version records

CCE GPU Manager Description

Updated at:2025-10-27

Component introduction

A bundle of GPU device plugins, paired with a compatible scheduler, delivers GPU resource scheduling capabilities in complex scenarios. The CCE GPU Manager component supports isolation-optimized mode, facilitating shared use and isolation of computing power and memory.

Component function

  • Topology allocation: Enables a GPU topology allocation function. When more than one GPU card is assigned to a Pod, the system automatically chooses the fastest topology connection mode to allocate GPU devices.
  • GPU sharing: Provides the option to enable memory sharing for GPU devices on a node and supports allocating GPU cards to multiple Pods based on memory size.
  • Memory and computing power isolation: Ensures isolation of memory and computing power when multiple Pods share a single GPU card.
  • Fine-grained scheduling: When enabled, you can select specific GPU models during the creation of queues and tasks. When disabled, only quota input is allowed while creating queues and containers, and specific GPU models cannot be selected.
  • Encoding/decoding instances: Submit encoding/decoding tasks using the independent encoding/decoding units of GPUs for hardware-based encoding/decoding.
  • For detailed component usage instructions, please refer to: GPU Exclusive and Shared Instructions

Application scenarios

Running GPU applications in CCE clusters addresses the issue of resource waste caused by exclusively using entire GPU cards in scenarios like AI training, thereby improving resource utilization and reducing costs.

Limitations

  • Only supports Kubernetes clusters of version v1.18 and above.
  • Currently, this component depends on the CCE AI Job Scheduler. If required, please install both components together; otherwise, the functions of this component may be unavailable.
  • GPU-shared virtualization supports the following mainstream GPU CUDA and Driver versions. The isolation-optimized mode imposes additional requirements on OS kernel versions and others. For adaptation to other versions, please submit a request. The current support details are as follows.
Configuration Version
Container runtime Docker、Containerd
GPU CUDA/Driver version
  • GPU Driver 470.X,515.X,525.X

  • OS kernel version (isolation-optimized mode only) CentOS:
  • 3.10.0-957.21.3.el7.x86_64
  • 3.10.0-1160.41.1.el7.x86_64
  • 3.10.0-1160.42.2.el7.x86_64
  • 3.10.0-1160.45.1.el7.x86_64
  • 3.10.0-1160.62.1.el7.x86_64
  • 3.10.0-1160.71.1.el7.x86_64
  • 3.10.0-1160.76.1.el7.x86_64
  • 3.10.0-1160.80.1.el7.x86_64
  • 3.10.0-1160.81.1.el7.x86_64
  • 3.10.0-1160.83.1.el7.x86_64
  • 3.10.0-1160.88.1.el7.x86_64
  • 3.10.0-1160.90.1.el7.x86_64
  • 4.17.11-1.el7.elrepo.x86_64
  • 5.4.123-1.el7.elrepo.x86_64

    Ubuntu:
  • 4.4.0-150-generic
  • 4.15.0-140-generic
  • 5.4.0-72-generic
  • 5.4.0-139-generic
  • Install component

    1. Sign in to the Baidu AI Cloud Official Website and enter the management console.
    2. Go to Product Services - Cloud Native - Cloud Container Engine (CCE) to access the CCE management console.
    3. Click Cluster Management > Cluster List in the left navigation bar.
    4. Click on the target cluster name in the Cluster List page to navigate to the cluster management page.
    5. On the Cluster Management page, click O&M & Management > Component Management.
    6. Select the CCE GPU Manager component from the component management list and click on Install.
    7. In the installation confirmation dialog box, isolation-optimized mode is selected by default.
    8. The default unit for GPU memory sharing is GiB.
    9. Fine-grained scheduling is enabled by default.
    10. Click the OK button to finalize the component installation.

    Screenshot 07/11/2024 6.21.56 PM.png

    Version records

    Version No. Cluster version compatibility Update time Change content Impact
    1.5.35 CCE v1.18+ 2024.07.05 New function:
  • Pod requests virtualized resource usage adjustment, supports only requesting through virtualized resource descriptors, and removes the constraint on the baidu.com/xx_xx_cgpu descriptor
    Optimize:
  • Adapt to BCC packages for RDAM models, and support automatic topology awareness of single-machine GPUs and network interface cards by the NCCL communication library
  • Adapt to H-chip scheduler card partitioning, and change the dependency for device information acquisition from cuda to nvml
    Bug Fixes:
  • [Service impact] Fix the data backup function of PaddleJob and occasional backup failures
  • GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
    1.5.34 CCE v1.18+ 2024.06.24 Optimize:
  • The container-runtime adapts to nccl environment variable injection for host network containers in VPC-ENI network mode
    Bug Fixes:
  • Due to impacts on communication performance, dcgm-exporter does not collect FP16/FP32/FP64 metrics by default;
  • GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
    1.5.33 CCE v1.18+ 2024.05.31 New function:
  • Add a GPU card partitioning information recognition service for multiple schedulers to identify GPU card information allocated by the default scheduler and other schedulers, to avoid mixed scheduling of nodes by multiple schedulers;
  • Enable adaptive routing by default in IB environments;
    Optimize:
  • Virtualized webhooks support high availability;
  • GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
    1.5.32 CCE v1.18+ 2024.05.15 New function:
  • Support coexistence of two virtualization modes in the cluster;
  • Support automatic injection of BCC RDMA topology files;
    Optimize:
  • Optimize residual container issues in isolation-optimized GPU virtualization;
  • GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
    1.5.31 CCE v1.18+ 2024.05.06 New function:
  • Add the GFD module
  • Support reporting of GPU node driver, Cuda and other environment information to nodes;
  • Isolation-optimized GPU virtualization supports L20 chips;
  • Support hook injection in eks mode
  • GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
    1.5.30 CCE v1.18+ 2024.03.26 New function:
  • Adapt BCC H800 models to identify rdma topology
    Fix:
  • Fix a set of image vulnerabilities;
  • GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
    1.5.29 CCE v1.18+ 2024.01.19 New function:
  • Add nvlink bandwidth, sm utilization and FP64/32/16 computing utilization metrics to dcgm-exporter
  • GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
    1.5.28 CCE v1.18+ 2023.12.15 New function:
  • Support adaptation of NCCL environment variables for different GPU card types: A100/A800 use NCCL_IB_QPS_PER_CONNECTION=8 and NCCL_IB_ADAPTIVE_ROUTING=0; H800 uses NCCL_IB_QPS_PER_CONNECTION=1 and NCCL_IB_ADAPTIVE_ROUTING=1
    Optimize:
  • Set the dp health check port as a modifiable parameter and disable dp health checks for isolation-optimized GPU virtualization
  • GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
    1.5.27 CCE v1.18+ 2023.12.1 Optimize:
  • Add kernel logs for GPU virtualization to print memory statistics for GPU virtualization OOM
  • Optimize GPU virtualization container residues to improve cleanup efficiency. The container side is compatible with multiple scenarios such as container creation and residues, and kernel modules have added GPU virtualization cleanup
  • Add metrics for GPU virtualization container residues to reflect GPU virtualization residues
  • GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
    1.5.26 CCE v1.18+ 2023.11.17 New function:
  • Adapt to Kubernetes 1.26
  • Adapt to Ubuntu 22.04 OS
    Optimize:
  • Add completion of kernel versions supported by GPU virtualization
  • Add health checks for the dp component to be compatible with scenarios such as kubelet restart/apiserver access failures
    Bug Fixes:
  • [Service impact] Fix the issue where dcgm-exporter and gpu-exporter failed to report container information when nodes had both docker and containerd installed
  • [Service impact] Fix invalid GPU card allocation and RDMA configuration by container-runtime due to systemd path resolution errors
  • [Service impact] Fix the issue where gpu-exporter failed to obtain container information due to systemd path resolution errors
  • GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
    1.5.25 CCE v1.18+ 2023.11.03 New function:
  • Support GPU virtualization kernel mode 535 drivers
  • Support exclusive/shared card modes for 4090 chips
    Optimize:
  • Initialization optimization for isolation-optimized GPU virtualization:
    Add pre-checks for the sgpu.ko kernel on nodes: including verification of residual module versions and deletion/reinstallation of invalid residual modules
  • Add a GC module for GPU virtualization containers to clean up residual GPU virtualization configurations
  • Optimize exception handling in the container-runtime-sgpu-hook prestart/poststop processes, and modify the process to return error information when configuration fails
    Bug Fixes:
  • [Service impact] Fix incorrect card allocation for multi-container Pods due to container-runtime failing to distinguish the resources obtained by containers for obtaining pods
  • [Service impact] Fix the issue where DCGM Pods remained in terminating status and could not be deleted because the install.sh process entered kernel mode during component upgrades
  • [Service impact] Fix task startup errors caused by ineffective default lib-path of runtime in OS Ubuntu 20
  • [Service impact] Fix invalid libcuda.so hijacking for shared cards in performance-optimized GPU virtualization under CUDA Driver 525 environments
    Limit:
  • Do not support creating DDP training tasks that use performance-optimized GPU virtualization shared cards for communication via NCCL
  • GPU kernel mode virtualization services do not support hotspot upgrades; the upgrade mode is to drain the node upgrade
    1.5.24 CCE v1.18+ 2023.09.22 New function
  • Support multi-shared-card usage for single containers
    Bug Fixes:
  • Resolve the issue of failed monitor metric collection in virtualization scenarios
    Usage Limitations:
  • For versions 1.5.14 and below, if performance-optimized GPU virtualization is enabled, virtualization tasks must be stopped before upgrading the component
  • For versions 1.5.13 and below, if isolation-optimized GPU virtualization is enabled, virtualization tasks must be stopped before upgrading the component
  • 1.5.23 CCE v1.18+ 2023.08.29 Optimize:
  • Enable necessary subsystems for NCCL_DEBUG logs by default, and change NCCL_DEBUG_SUBSYS from ENV to INIT,ENV,GRAPH
    Usage Limitations:
  • For versions 1.5.14 and below, if performance-optimized GPU virtualization is enabled, virtualization tasks must be stopped before upgrading the component
  • For versions 1.5.13 and below, if isolation-optimized GPU virtualization is enabled, virtualization tasks must be stopped before upgrading the component
  • 1.5.22 CCE v1.18+ 2023.08.10 Bug fixes
  • Fix occasional errors in virtualized memory and computing power resource allocation when creating virtualized containers concurrently
    Usage Limitations:
  • For versions 1.5.14 and below, if performance-optimized GPU virtualization is enabled, virtualization tasks must be stopped before upgrading the component
  • For versions 1.5.13 and below, if isolation-optimized GPU virtualization is enabled, virtualization tasks must be stopped before upgrading the component
  • Previous
    CCE QoS Agent Description
    Next
    CCE Ingress NGINX Controller Description