AI Job Scheduler component

Updated at：2025-10-27

The AI Job Scheduler component provides insights into key metrics, queue backlogs, cluster/queue overages, Pod scheduling performance, task scheduling efficiency, and scheduler stage latency.

Prerequisites

The CCE AI Job Scheduler component has been installed and its version is ≥ 1.7.9
Accessed monitoring instances
Collection tasks need to be enabled. For details, refer to the document: Access Monitoring Instance and Enable Collection Tasks

Application method

Sign in to Cloud Container Engine Console (CCE).
Click Cluster Management on the left sidebar. In the Cluster List, select the Cluster Name you need. Under Actions - More on the right, click Prometheus Monitoring to navigate to the Prometheus Monitoring Service.

In the options at the bottom of the Prometheus Monitoring Page, select Cloud-Native AI Monitoring, then select AI Job Scheduler Component.

The AI Job Scheduler component is shown as follows

You can click the button in the upper right corner to set monitoring time, manual refresh, and automatic refresh by yourself.

Detailed description of AI Job Scheduler component

Component health overview

Monitoring items	Description
Scheduler instance count	AI Job Scheduler Pod instances, defaulting to 3 replicas. Metric anomaly: 1-3 scheduler instances indicate normal status, while 0 indicates abnormality.
CPU utilization rate	CPU utilization rate of AI Job Scheduler master instance
Memory usage	Memory usage of AI Job Scheduler master instance
Pod scheduling throughput CPM (counts/minute)	Number of Pods scheduled by the scheduler per minute
Task scheduling throughput CPM (counts/minute)	Number of tasks scheduled by the scheduler per minute

Overview of pending Pod/task queue

Monitoring items	Description
Cluster scheduling queue status	Cluster scheduling task queuing status, which shows pending tasks, queued tasks, total scheduling tasks, pending Pods, and task queuing rate of the cluster.
Cluster card allocation	GPU card allocation in the cluster, which shows total GPU cards, allocated GPU cards, idle GPU cards, unavailable GPU cards, and GPU card allocation rate.
Cluster pending Pod/task trend chart	Detailed pending Pod/task trends in the cluster, which displays pending Pods, pending tasks, queued tasks, total tasks and task queuing rate.
Card allocation rate	Detailed GPU card allocation rate in the cluster.
Task retry scheduling count	Number of continuous task scheduling tries by the scheduler, which also includes task type, task name, namespace and queue status.

Overview of Pod/task scheduling performance

Monitoring items	Description
Task scheduling latency	Time from task creation to task scheduling completion
Total Pod scheduling latency	Time from Pod creation to Pod scheduling completion

Overview of scheduling latency action

Monitoring items	Description
Scheduler phase latency distribution	Range of scheduler phase latency distribution
Pod scheduling latency	Sum of Pod scheduling algorithm + binding phase time
Scheduler phase latency	Scheduling time for scheduler phase

Connecting to a Prometheus Instance and Starting a Job

GPU node resources

CCE CCE