Fault Diagnosis

Updated at：2025-10-27

Overview

The fault diagnosis feature of Baidu AI Cloud Container Engine (CCE) offers automated anomaly detection and root cause identification. It provides comprehensive health checks of core cluster components, enabling fast detection of common configuration issues, resource constraints, and component malfunctions.

Prerequisites

A K8S cluster CCE has been created. For specific operations, please refer to Create Cluster
This functionality ensures Kubernetes clusters operate normally.

Diagnosis function introduction

The current version focuses on diagnosing resource anomalies at the ‌node level‌ and ‌pod level‌. For detailed diagnostic items, refer to the descriptions below.

Enable fault diagnosis

Important: The fault diagnosis feature collects information such as the system version, load, the operational status of components like Docker and Kubelet, and critical error details from system logs. The entire diagnostic process adheres to data security standards and excludes any business or sensitive data.

The operations for © are similar. The following uses enabling node diagnosis as an example to explain how to use the fault diagnosis feature.
Method I:

Sign in to Baidu AI Cloud Management Console, navigate to Product Services - Cloud Native - Cloud Container Engine (CCE), and click Cluster Management - Cluster List to enter the cluster list page.
Click the name of the target cluster. In the left navigation bar under Inspection & Diagnosis, select Fault Diagnosis.
On the Cluster Inspection page, select the Node Diagnosis tab and click Diagnose Now
In the Node Diagnosis pop-up window, select the node name. After carefully reading the notes, check the box for I understand and agree, then click OK to start the diagnosis.
Method II:
On the node list page of the target cluster, select Fault Diagnosis in the Operation column
In the Node Diagnosis pop-up window, after carefully reading the notes, check the box for I understand and agree, then click OK to start the diagnosis.

Once the diagnosis is initiated, you can monitor its progress in the task list under the status column.

View diagnosis results

In the fault diagnosis page, navigate to the diagnosis list, locate the target diagnosis report, and click on "Diagnosis Details." The Diagnosis Details page provides a thorough view of the diagnostic results.

Node diagnostic items & descriptions

Node

Diagnostic item name	Description	Fix solution
Node status	Ensure that the node status is set to "Ready."	Try restarting the node
Node scheduling status	Confirm that the node is not flagged as unschedulable.	The node is unschedulable; check node taint settings. For specific operations, see Set Node Taints.
BCI instance existence	Verify whether the BCC instance associated with the node exists.	Check the status of the BCC instance for any issues.
BCI instance health	Ensure that the BCC instance linked to the node is functioning properly.	Check the status of the BCC instance for any issues.
Node CPU usage	Confirm that the node’s current CPU usage falls within the expected normal range.	None
Node memory usage	Verify that the node’s current memory usage is within normal parameters.	None
Node weekly CPU utilization	Ensure that the node’s CPU usage has not been consistently high over the past week to prevent resource contention from impacting services.	To minimize business impact: 1. Configure appropriate pod requests and limits; 2. Avoid deploying too many pods on a single node.
Node weekly memory utilization	Confirm that the node’s memory usage has not been consistently high over the past week to prevent OOM (Out of Memory) issues that might affect services.	To minimize business impact: 1. Configure appropriate pod requests and limits; 2. Avoid deploying too many pods on a single node.
Node OOM event	Check whether the node has encountered any OOM (Out of Memory) events.	Log in to the node and view the kernel logs of the node where the pod resides: /var/log/messages
Kubelet status	Ensure the node’s kubelet process is running correctly.	Review the kubelet logs of the node for any anomalies.
Containerd status	Verify that the node’s containerd service is functioning as expected.	Log in to the node and view the kernel logs of the node: /var/log/messages
Docker status	Ensure the node’s Docker service is running smoothly.	Log in to the node and view the kernel logs of the node: /var/log/messages
Docker hang check	Check if the node has experienced Docker service hangs.	Log in to the node if necessary and restart the Docker service using the command `systemctl restart docker`.
API server connectivity	Verify that the node can connect to the cluster API server without issues.	Inspect the cluster-related configurations.
Node DNS service	Ensure the node can utilize the host DNS service correctly.	Check if the host DNS service is normal. For more information, refer to DNS Troubleshooting Guide
Cluster DNS service	Check if the node can access the cluster IP of the cluster’s kube-dns service and use the cluster’s DNS service normally	Check the running status and logs of CoreDNS pods. For more information, refer to DNS Troubleshooting Guide
Cluster CoreDNS pod availability	Confirm the node can access the pod IP of the cluster’s CoreDNS without problems.	Ensure the node can reach the pod IP of CoreDNS successfully.
Containerd image pull	Verify that the node’s containerd is able to pull images properly.	Examine the node’s network and image settings.
Docker image pull status	Check whether the node’s Docker engine can pull images as expected.	Examine the node’s network and image settings.

Node core component (NodeComponent)

Diagnostic item name	Description	Recommendations
CNI component status	Ensure the node’s CNI component is operating normally.	Submit a support ticket for further assistance.
CSI component status	Ensure the node’s CSI component is functioning correctly.	Navigate to CCE Cluster > O&M & Management > Component Management > Storage, and verify the status of the cluster’s storage components.
Network agent status	Check that the node’s network agent service is operating as expected.	Submit a support ticket for further assistance.
Network operator status	Confirm that the cluster’s network operator service is running appropriately.	Submit a support ticket for further assistance.

Cluster component (ClusterComponent)

Diagnostic item name	Description	Recommendations
Pod CIDR block remaining	Ensure the cluster has at least five available pod CIDR blocks in VPC routing mode to prevent new nodes from failing due to CIDR block exhaustion.	Submit a support ticket for further assistance.
DNS service cluster IP	Verify the cluster IP of the cluster’s DNS service is normally assigned (DNS service anomalies will cause cluster function failures and affect services)	Check the running status and logs of CoreDNS pods. For more information, refer to DNS Troubleshooting Guide
API server BLB instance status	Make sure the API server BLB instance is functioning properly.	Access the application BLB instance page, locate the BLB instance associated with the cluster, and review the instance status in the details section.
API server BLB instance existence	Verify the existence of the API server BLB instance.	Confirm the existence of the BLB instance linked to the cluster on the application BLB instance page.
API server BLB 6443 port listening	Ensure the API server BLB port 6443 is configured correctly for listening.	Navigate to the application BLB instance page, locate the BLB linked to the cluster, and examine the target group configuration of the BLB instance.
Availability zone consistency between node and container subnet	Ensure that the node and container subnet are within the same availability zone when operating in VPC-ENI mode.	Access the CCE cluster details, locate the container network, add a subnet, and confirm that the node and subnet belong to the same availability zone.
Subnet IP remaining	Confirm that there are enough available IPs remaining in the subnet for VPC-ENI mode.	Access the CCE cluster details, locate the container network, and add an appropriate subnet.

GPU node (GPUNode)

Diagnostic item name	Description	Recommendations
GPU node status	Verify that the GPU status of the node is functioning normally.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
GPU allocable count	Ensure that the number of GPUs allocable on the node is as expected.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID48Error	Inspect for double-bit ECC errors in the NVIDIA GPU.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID62Error	Check the NVIDIA GPU for internal micro-controller halts (applicable to newer drivers).	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID64Error	Inspect the NVIDIA GPU for ECC page retirement or row remapper recording issues.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID74Error	Check for NVLINK errors on the NVIDIA GPU.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID79Error	Verify if the NVIDIA GPU has disconnected from the bus.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID95Error	Look for uncontained ECC errors on the NVIDIA GPU.	Try restarting the GPU node; if the issue persists after restart, submit a ticket
NVIDIA XID109Error	Check for Context Switch Timeout errors on the NVIDIA GPU.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID140Error	Inspect for unrecovered ECC errors in the NVIDIA GPU.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XIDError	Check for XID errors related to the NVIDIA GPU.	Try restarting the GPU node; if the issue persists after restart, submit a ticket
NVIDIA SXIDError	Investigate for SXID errors on the NVIDIA GPU.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA row remapper failure	Verify if the NVIDIA GPU has experienced row remapper failures.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA Device Plugin GPU disconnection	Check whether the NVIDIA Device Plugin indicates any GPU disconnection.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA infoROM integrity	Verify if the NVIDIA GPU infoROM has been corrupted.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA ECC error	Inspect for any NVIDIA GPU ECC errors.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU high temperature alert	Ensure the NVIDIA GPU temperature is within the normal range.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU operation mode	Confirm if the NVIDIA GPU is operating in the normal mode.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA-SMI status code	Review the nvidia-smi status code.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
PCI configuration read/write	Determine if PCI configuration read/write operations are failing.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
PCI address access	Ensure lspci is able to read the GPU configuration space.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
GPU bandwidth	Validate if the GPU bandwidth is functioning properly.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
GPU power consumption alert	Verify if the GPU power consumption is within normal levels.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
GPU driver accessibility	Ensure the GPU driver is being accessed correctly.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU recognition	Validate that the NVIDIA GPUs on the bus are recognized by both the driver and nvidia-smi.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA-Container-Toolkit version	Confirm if the NVIDIA-Container-Toolkit version matches the cluster version.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA-Container-Toolkit configuration	Ensure that the NVIDIA-Container-Toolkit is configured correctly in the container runtime.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA-Container-Toolkit status	Check if the NVIDIA-Container-Toolkit is functioning normally.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
Abnormal process on GPU nodes	Identify any abnormal processes running on the GPU node.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
HAS status	Verify that the HAS status is operating normally.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
HAS version	Ensure the HAS version is supported.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU disconnection status	Check if the NVIDIA GPU has been disconnected from the bus or is otherwise inaccessible.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA ECC error limit	Inspect if NVIDIA GPU ECC memory correction errors have exceeded the permissible limit.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU interconnect link mode	Determine if the NVIDIA GPU interconnect link mode is functioning properly (SYS or NODE mode may cause speed degradation).	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU interconnect link alert	Check for NVLink & NVSwitch errors	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU interconnect service error	Verify if the GPU interconnect service (FabricManager) is operating normally.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVLink status	Verify if NVLink is either disconnected or inactive.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
CUDA version	Confirm whether the installed CUDA version is supported.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
GPU driver version	Confirm whether the GPU driver version is supported.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA device power cable connection	Ensure that the NVIDIA GPU device power cables are properly connected.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVLink quantity	Verify if there is any reduction in the number of NVLink connections.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
GPU-NIC connection type	Ensure that the NIC is inserted into the correct slot.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
Node network interface card status	Overall status of the NIC.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NIC PCI address unavailability	Confirm whether the NIC PCI address is accessible.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NIC channel quantity	Verify if the number of NIC channels has reached the maximum supported.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NIC bandwidth	Check whether the NIC bandwidth has reached its maximum supported capacity.	Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.

Pod diagnostic items & descriptions

Diagnostic item name	Description	Recommendations
Number of pod container restarts	Count the number of container restarts in the pod to identify any abnormal restart patterns.	Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting
Pod container image download	Verify if the node hosting the pod is facing image download interference from other pods (to prevent resource contention).	Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting
Pod image pull secrets validity	Confirm whether the secrets required for the pod’s image pull are valid (to avoid errors during image retrieval).	Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting
Pod memory utilization	Check if the pod’s memory usage ≤ 95% (to avoid OOM affecting services)	1. Edit the workload YAML via the workloads page or kubectl, find the resources field, and adjust resource quotas (request, limit). 2. Increase the desired number of pods via the Workloads page (click Scale) or edit the workload YAML via kubectl. 3. Configure auto scaling (HPA).
Pod CPU utilization	Ensure that the pod’s CPU usage is ≤ 95% (to avoid resource contention affecting services).	1. Edit the workload YAML via the workloads page or kubectl, find the resources field, and adjust resource quotas (request, limit). 2. Increase the desired number of pods via the Workloads page (click Scale) or edit the workload YAML via kubectl. 3. Configure auto scaling (HPA).
Pod-to-CoreDNS pod connectivity	Check if the pod can access CoreDNS pods normally	Test the connectivity between the pod and the CoreDNS pod.
Pod-to-CoreDNS service connectivity	Check if the pod can access the CoreDNS service normally	Test the connectivity between the pod and the CoreDNS pod.
Pod-to-host network DNS connectivity	Check if the pod can access the host network’s DNS server normally	Test the DNS connectivity between the pod and the host network.
Pod initialization status	Check if the pod has completed its initialization and entered the normal running state.	Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting
Pod scheduling status	Verify if the pod has been successfully scheduled onto its target node.	Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting
Pod schedulability	Confirm whether the pod meets scheduling requirements and can be allocated to an appropriate node.	Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting
Pod status	Verify if the current status of the pod aligns with expectations (e.g., running, pending).	Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting

GPU Runtime Environment Check

Cloud-native AI

CCE CCE