GPU resource pool overview

Updated at：2025-10-27

An overview of the GPU resource pool includes node usage, cluster GPU card allocation, GPU card usage, CPU and memory utilization, allocated GPU cards, GPU card allocation rate, GPU card utilization rate, GPU card memory usage rate, node details, and workloads using GPU resources.

Prerequisites

The CCE AI Job Scheduler component has been installed and its version is ≥ 1.7.9
The CCE GPU Manager component has been installed
Accessed monitoring instances
Collection tasks need to be enabled. For details, refer to the document: Access Monitoring Instance and Enable Collection Tasks

Application method

Sign in to Cloud Container Engine Console (CCE).
Click Cluster Management on the left sidebar. In the Cluster List, select the Cluster Name you need. Under Actions - More on the right, click Prometheus Monitoring to navigate to the Prometheus Monitoring Service.

In the options at the bottom of the Prometheus Monitoring Page, select Cloud-Native AI Monitoring, then select GPU Resource Pool Overview.

GPU resource pool overview is shown below:

You can click the button in the upper right corner to set monitoring time, manual refresh, and automatic refresh by yourself.

Detailed explanation of GPU resource pool overview

Node usage

Monitoring items	Description
Total node count	All nodes in cluster
Allocated node count	Nodes with 0 available GPU cards
Idle node count	Nodes with GPU card count > 0, including tainted nodes
Unavailable node count	Cordoned or not ready nodes

Cluster card allocation

Monitoring items	Description
Total GPU cards	GPU cards across all nodes in the cluster
Allocated cards	Allocated and in-use GPU cards
Idle GPU cards	Nodes with GPU card count > 0, including idle GPU cards of tainted nodes
Unavailable GPU cards	Idle GPU cards across nodes in the cluster

GPU card usage

Monitoring items	Description
Average GPU card utilization rate	Real-time average utilization rate of GPU cards across all nodes in the current cluster, average utilization rate of GPU cards = sum (utilization rate of GPU cards across all nodes) / GPU cards across all nodes)
Average memory utilization rate of GPU card	Real-time average memory utilization rate of GPU cards across all nodes in the current cluster, average memory utilization rate = sum (memory utilization rate of GPU cards across all nodes) / GPU cards across all nodes)

CPU & memory

Monitoring items	Description
CPU core count	Total CPU core count in the current cluster
Average CPU utilization rate	Real-time average utilization rate of all CPUs in the current cluster
Total memory	Total memory of the current cluster
Average memory utilization rate	Real-time average utilization rate of all memory in the current cluster

Utilization rate & allocation rate

Monitoring items	Description
Allocated GPU cards	Allocated GPU cards
Card allocation rate	Allocation rate = allocated GPU cards / total GPU cards
Overall average GPU utilization rate	Real-time average utilization rate of GPU cards across all nodes in the current cluster, average utilization rate = sum (utilization rate of GPU cards across all nodes) / GPU cards across all nodes)
Average GPU utilization rate of running tasks	Average utilization rate of GPU cards = sum(utilization rate of allocated GPU cards) / allocated GPU cards
Average overall GPU memory utilization rate	Real-time average memory utilization rate of GPU cards across all nodes in the current cluster, average memory utilization rate = sum (memory utilization rate of GPU cards across all nodes) / GPU cards across all nodes)
Average GPU memory utilization rate of running tasks	Average utilization rate of GPU memory = sum(memory utilization rate of allocated GPU cards) / GPU cards allocated

Idle node statistics

Monitoring items	Description
Real-time statistics of idle node distribution	Distribution of the number of idle nodes in the current cluster
Trend of idle node distribution	Historical trends of distribution of the number of idle nodes in the current cluster
Card model	Existing GPU card models in the current cluster
Allocated cards	Number of GPU cards of this model that have been allocated and used in the cluster
Idle GPU cards	Number of idle and unallocated GPU cards of this model in the cluster
Total GPU cards	Total number of GPU cards of this model in the cluster
Count idle x card nodes	Number of nodes with idle x cards per node in the cluster

Node information

Monitoring items	Description
Node name	Names of nodes in the current cluster
Count of allocated cards	GPU cards allocated on nodes in the current cluster
Count of GPU-Pods	Count of GPU-occupied Pods on the current node
CPU utilization rate	Real-time average utilization rate of all CPUs in the current node
Memory utilization rate	Real-time average utilization rate of all memory in the current node
Node status	Current node status
CPU core count	Total CPU core count in the current node
Total memory	Total memory of the current node

GPU-occupied workload information

Monitoring items	Description
Name of workload	Name of GPU-occupied workloads in the current cluster
Type	Type of GPU-occupied workloads in the current cluster
Namespace	Namespace of GPU-occupied workloads in the current cluster
Start time	Start time of GPU-occupied workloads in the current cluster
Runtime	Runtime of GPU-occupied workloads in the current cluster
Allocated GPU cards	Count of GPU cards allocated to GPU-occupied workloads in the current cluster
Average memory utilization rate	Real-time average memory utilization rate of all GPU cards of GPU-occupied workloads in current cluster
Average GPU utilization rate	Real-time average utilization rate of GPU cards in GPU-occupied workloads in current cluster
Memory usage	Memory usage of GPU-occupied workloads in the current cluster
CPU core count	CPU core count of GPU-occupied workloads in the current cluster

GPUManager component

Ascend Chip Resource Observation

CCE CCE