CCE AI Job Scheduler Description

Updated at：2025-10-27

Component introduction

The task scheduling component allows for organizing and managing diverse AI tasks. When used with the CCE Deep Learning Frameworks Operator, it enables direct training of deep learning models on CCE.

Component function

Support a variety of scheduling strategies and advanced job management capabilities.
Scheduling strategies include two types: "spread" and "binpack." "Binpack" prioritizes centralized GPU card sharing and usage by multiple Pods, ideal for improving GPU resource utilization. "Spread" aims to distribute multiple Pods across different GPU cards for scenarios requiring high GPU availability.
Preemption modes support intra-queue priority preemption and inter-queue oversell/preemption. Intra-queue priority preemption: In the same queue, high-priority tasks can preempt resources of low-priority tasks to ensure the operation of high-priority tasks; inter-queue oversell/preemption: When queue A is fully utilized and queue B has idle resources, new tasks submitted to queue A will be scheduled to queue B. If queue B later receives new tasks and lacks resources, the oversell tasks will be killed to ensure the running of queue B tasks.
For the use of preemption functions, refer to the relevant descriptions in Queue Management and Task Management.

Application scenarios

Run deep learning tasks directly on CCE clusters to boost AI engineering efficiency.

Limitations

Supports only Kubernetes clusters of version 1.18 or higher.

Install component

Sign in to the Baidu AI Cloud official website and enter the management console.
Go to Product Services - Cloud Native - Cloud Container Engine (CCE) to access the CCE management console.
Click Cluster Management - Cluster List in the left navigation bar.
Click on the target cluster name in the Cluster List page to navigate to the cluster management page.
On the Cluster Management page, click Component Management.
From the component management list, choose the CCE AI Job Scheduler Component and click Install.
On the Component Configuration page, finish setting up the deep learning framework.

Screenshot 6/5/2024 2.24.08 PM.png

Scheduling strategies include two options: spread and binpack. Binpack prioritizes centralized use of the same GPU card for multiple Pods, ideal for enhancing GPU resource usage. Spread aims to distribute multiple Pods across different GPU cards, ensuring high availability of GPU resources.
Preemption modes include intra-queue priority preemption and inter-queue oversell/preemption. For intra-queue priority preemption: Within the same queue, high-priority tasks can seize resources from lower-priority tasks to maintain their operations. For inter-queue oversell/preemption: If queue A is fully utilized and queue B has idle resources, tasks submitted to queue A are scheduled to queue B. In case queue B later experiences a resource shortage due to new tasks, oversell tasks will be terminated to maintain the operation of tasks in queue B.

Click the OK button to finalize the component installation.

Version records

Version No.	Cluster version compatibility	Change time	Change content	Impact
1.7.25	CCE v1.18+	2024.11.07	New Function: Control plane modules support specified node deployment; webhook components support host network deployment; add taint tolerations Optimized tor metadata synchronization: When configuring rdma information using the volcano-node-spec package, support configuring ehc fields Optimize: Improve resource application for full-machine scenarios by using scalar comparison to reduce data copying and improve performance Add the myriator plugin to support scheduling large-model tasks in a tor by index sorting and optimize hotspot functions to improve scheduling performance Bug Fixes: Fix crashes caused by concurrent map access (conflicts between writing to the map during binding and reading during preemption)	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version. For versions below v1.7.13, please contact Baidu AI Cloud for assistance with upgrading.
1.7.24	CCE v1.18+	2024.09.30	New Function: Queues support configuring scheduling strategies (e.g., StrictFIFO)	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version. For versions below v1.7.13, please contact Baidu AI Cloud for assistance with upgrading.
1.7.23	CCE v1.18+	2024.09.27	New Function: Queues support independent configuration of priority preemption switches for refined control of priority preemption capabilities Add observable metrics for scheduling phases to support visualization of scheduling phase time consumption In intra-queue priority preemption scenarios, the task quota application considers non-preemptable resources within the queue Performance optimization of NPU topology-aware scheduling strategies Bug Fixes: [Non-impact on service] Fixed occasional panic caused by concurrent access to scheduling caches	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version. For versions below v1.7.13, please contact Baidu AI Cloud for assistance with upgrading.
1.7.22	CCE v1.18+	2024.09.03	New Function: RDMA TOR topology-aware scheduling adapted to EHC Cluster Support unified schedulers for NPU and GPU Optimize: Support intra-queue preemption/oversell (lowest) tasks Support delayed scheduling of oversell (lowest) tasks to prioritize scheduling normal tasks with high/medium/low priorities	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version. For versions below v1.7.13, please contact Baidu AI Cloud for assistance with upgrading.
1.7.21	CCE v1.18+	2024.08.14	Optimize: Refine certificate creation logic during installation to resolve timeouts caused by inaccessible cluster nodes. RDMA information synchronization components adapted to BCC/HPAS, supporting RDMA information configuration via external configuration NPU plugins support preemption for intra-queue/inter-queue NPU scenarios (for NPU) Bug Fixes: [Non-impact on service] Fix occasional scheduling failures when multiple PodSpecs exist in a job	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version. For versions below v1.7.13, please contact Baidu AI Cloud for assistance with upgrading.
1.7.20	CCE v1.18+	2024.07.22	New Function: Support NPU chip resource view dashboard Bug Fixes: [Non-impact on service] Fix potential scheduling failures of some Pods affecting others when a task has multiple Pod configurations [Non-impact on service] Handle cases where existing queues have the same name as root, causing root queue update failures [Non-impact on service] Fix queue information not updating due to initialization failures in some volcano controller functions	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version. For versions below v1.7.13, please contact Baidu AI Cloud for assistance with upgrading.
1.7.19	CCE v1.18+	2024.07.05	New Function: Support cluster configuration to uniformly apply GPU resource Pods to the volcano scheduler Optimize: Optimize the RDMA affinity check strategy in the preemption scenario, and enable HPN check if preemption is enabled Optimize the optimal strategy for applying rdma resources for single tasks, and try to make the binpack effect more pronounced Bug Fixes: Resolve the issue where RDMA resource views are incompatible with resources released during termination, causing a scheduler panic Fix the issue of not setting a default queue when tasks do not specify a queue	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version. For versions below v1.7.13, please contact Baidu AI Cloud for assistance with upgrading.
1.7.18	CCE v1.18+	2024.06.26	New Function: Queue metrics support P800 chips; add P800 resource view dashboard; Resource view command line interfaces adapted to P800 chips support task diagnosis in P800 clusters; Physical queues support custom resource management node labels to be compatible with user-existing resource management labels; RDMA affinity scheduling strategies support extended custom resource descriptors (e.g., baidu/gpu_hzz1o_8); Optimize: MPIJob RDMA TOR strategy optimization: Only Pods that apply for CPU are removed from the distribution constraint of the same RDMA POD within a job IB scenario adaptation: For IB instances unable to obtain RDMA TOR information, TOR affinity scheduling strategies do not need to be disabled Bug Fixes: Fix inference services not being controlled by physical queues; support adaptation of multiple workloads to physical queues Fix the weak effectiveness of anti-affinity deployment strategies due to low pod/node affinity weights Resolve abnormal calculation in volcano view tool dumps	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version. For versions below v1.7.13, please contact Baidu AI Cloud for assistance with upgrading.
1.7.17	CCE v1.18+	2024.06.02	New Function: Add queue resource view dashboard with rich metrics; support elastic/hierarchical queues; supports multiple chips (e.g., nvidia and Kunlun); Optimize: Enhance stability for mixed cluster schedulers, and support identifying GPU cards allocated by other cluster schedulers to avoid mixing schedulers on a single node; Add validation for resource applications between queue capability, deserved and guarantee to avoid invalid queue creation;	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version. For versions below v1.7.13, please contact Baidu AI Cloud for assistance with upgrading.
1.7.16	CCE v1.18+	2024.05.23	New Function: Add a forced interception switch for GPU resource schedulers. Optimize: Fix queues failing to ignore rdma resources Fix invalid injection of node affinity scheduling	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version. For versions below v1.7.13, please contact Baidu AI Cloud for assistance with upgrading.
1.7.15	CCE v1.18+	2024.05.17	New Function: Support new Kunlun chips and topology-aware scheduling. Optimize: Optimize hierarchical queue scheduling failure messages to expose events for non-leaf queue scheduling failures	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version. For versions below v1.7.13, please contact Baidu AI Cloud for assistance with upgrading.
1.7.14	CCE v1.18+	2024.05.09	New Function: Introduce elastic queue capabilities, enabling resource reservation, sharing, and reclamation support. Launch physical queue capabilities, supporting the directed scheduling of queue tasks to designated resource pools. Support configuring minimum guaranteed replicas for workloads via task/service labels. Bug Fixes: Fix resource view failure to self-recover after node resource outOfSync inconsistencies Optimized preemption strategy: Preemption is not triggered if the preemptor task still cannot be scheduled after preempting victims	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version. For versions below v1.7.13, please contact Baidu AI Cloud for assistance with upgrading.
1.7.13	CCE v1.18+	2024.04.15	New Function: Release hierarchical queue capabilities, and support hierarchical queue quota management capabilities Optimize: When enabling intra-queue preemption function, add calculation of preemptable resources during queue enqueue; allow enqueue if scheduling conditions are met after expected preemption; Add PodGroup events in RDMA topology-aware strategy Bug Fixes: Fix scheduler restart caused by incorrect resource view calculation in preemption scenarios	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version.
1.7.12	CCE v1.18+	2024.03.28	New Function RDMA affinity strategy supports scheduling based on RDMA POD/TOR topology to improve multi-machine training performance Optimize Default deployment strategy optimization a. Disable online/offline mixed deployment by default b. Disable intra-queue/inter-queue preemption by default c. Disable VPC TOR affinity scheduling by default d. Support SLA policy switches for specific customer scenarios Bug Fixes: Fix failure to allocate Kunlun card IDs in Kunlun topology-aware scheduling Fix enqueue failures caused by an incorrectly deserved quota for inference services Fix crashes caused by concurrent memory access in webhook/controller	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version.
1.7.11	CCE v1.18+	2024.01.31	Optimize: Resource view optimization: Add pod_group_uid label to workload metrics and node type labels to node resource metrics View tools support user-defined volcano namespaces Optimize internal card partitioning protocols in the scheduler to avoid card errors caused by failure to write card partitioning information to the apiserver Bug Fixes: Fix scheduler preferring nodes with terminating resources over idle nodes when multiple nodes meet scheduling requirements Fix controller restart caused by missing task annotations Fix scheduler restart caused by unlocked concurrent map access Fix scheduler restart caused by abnormal queue monitor metric reporting	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version.
1.7.10	CCE v1.18+	2023.12.21	Optimize: Priority scheduling strategies support cross-namespace Bug Fixes: Fix panic caused by tor scheduling failure when no tor is available for selection Fix panic caused by the device-affinity plugin and add a switch for the device-affinity strategy Fix volcano webhook to ignore namespaces with the kubernetes.io/mutate-pod-webhook: unavailable label; by default, add this label to kube-system and volcano-system during installation Fix owner reference management for Pods	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version.
1.7.9	CCE v1.18+	2023.11.28	New Function: Resource views support resource statistics dashboards and node resource dashboards Resource views support workload detail dashboards volcano stability dashboard metrics Optimize: Support tasks specifying non-preemptable status via the preemptable label Bug Fixes: - Fix view errors caused by view synchronization delays after scheduler restart Fix volcano to add binpack strategy for nvidia.com/gpu resources Fix preemption to ensure card type consistency; no preemption if types differ Fix null pointer exception in tor strategy Fix panic caused by concurrent access to devices Fix panic caused by concurrent access during collector metric collection	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version.
1.7.8	CCE v1.18+	2023.10.30	New Function: Support PodGroup lifecycle management for standard K8S workloads (Pod/Job/Deployment/StatefulSet) Add the command line interface to support resource view check for cluster nodes/queues and independent troubleshooting of unschedulable tasks Optimize: Support MPIJob to view preempted events. Bug Fixes: Resolve residual queue/cluster quotas for unsupported workloads; Resolve incorrect queue quota overuse judgments caused by unignored RDMA resources in elastic tasks Resolve scheduler panic caused by incorrect resource view metric logic in GPU shared card scenarios Resolve inconsistent webhook certificates during upgrades from versions below v1.7.3 due to unreasonable rolling strategies	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version.
1.7.7	CCE v1.18+	2023.10.11	New Function: Add numa scheduling for new Kunlun r480 chips (dependent on GPU-Manager version 1.5.25) Support exclusive card mode for H800 chips (dependent on GPU-Manager version 1.5.25) Support exclusive/shared card mode for 4090 chips (dependent on GPU-Manager version 1.5.25) Resource views support grafana monitor dashboards, displaying cluster resource overviews and node resource details (consistent with Baige pages) Optimize: Support podgroup management of deployment Command line tool adds options to support filtering job lists based on job type and podgroup status, and supports the summary option to sum up the resource consumption of the selected job list Add the totalgpu field to the command line to count the actual number of gpu cards when nvidia and cgpu descriptors are mixed Bug Fixes: Fix the selection of Pods in the terminating phase during GPU card selection Fix grafana monitor display issues for notready nodes Fix scheduling freezing caused by terminating in the predicate phase	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version.
1.7.6	CCE v1.18+	2023.09.22	New Function: Add cluster resource views/scheduling problem diagnosis tools Support multi-shared cards per container TOR architecture awareness capability has added support for MPIJob type tasks, and is compatible with Training operator 1.5+/Baidu Deep Learning Framework component 1.6+. Optimize: Log optimization: Support dynamic adjustment of log levels; adjust to JSON format Bug Fixes: Resolve scheduler panics caused by errors in elastic queue resource calculations. For the podgroup supporting minResources in versions 1.7.x and above, address scheduler panics when some Pods in the podgroup are running but exclude all resources defined in minResources. For more details, see https://github.com/volcano-sh/volcano/issues/3105 Fix scheduler panic caused by null candidate nodes after device affinity strategy calculation during Pod scheduling Fix failure to obtain podgroup labels for jobs due to insufficient controller permissions	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version.
1.7.4	CCE v1.18+	2023.06.14	New Function: Support high availability for volcano scheduler/admission/controller with default 3-replica mode Optimize Queues support usage statistics Optimize the admission certificate issuance process, and use secrets to store access certificates Add resource configuration parameters to scheduler/admission/controller Bug Fixes: Fix scheduler panic caused by concurrent reading/writing of node resources	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version.
1.7.3	CCE v1.18+	2023.05.06	New Function: Support custom preemption strategies	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version.
1.7.2	CCE v1.18+	2023.04.24	New Function: Support exclusive/shared mode for A800 chips Support custom scheduler names and scheduling resource groups	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version.
1.7.0	CCE v1.18+	2023.04.14	New Function: volcano upgraded to community version 1.7	This upgrade will not affect service. Do not support upgrading from versions below v1.5.8 to this version.

CCE RDMA Device Plugin Description

Image registry

CCE CCE