Fault Diagnosis
Overview
The fault diagnosis feature of Baidu AI Cloud Container Engine (CCE) offers automated anomaly detection and root cause identification. It provides comprehensive health checks of core cluster components, enabling fast detection of common configuration issues, resource constraints, and component malfunctions.
Prerequisites
- A K8S cluster CCE has been created. For specific operations, please refer to Create Cluster
- This functionality ensures Kubernetes clusters operate normally.
Diagnosis function introduction
The current version focuses on diagnosing resource anomalies at the node level and pod level. For detailed diagnostic items, refer to the descriptions below.
Enable fault diagnosis
Important: The fault diagnosis feature collects information such as the system version, load, the operational status of components like Docker and Kubelet, and critical error details from system logs. The entire diagnostic process adheres to data security standards and excludes any business or sensitive data.
The operations for © are similar. The following uses enabling node diagnosis as an example to explain how to use the fault diagnosis feature.
Method I:
- Sign in to Baidu AI Cloud Management Console, navigate to Product Services - Cloud Native - Cloud Container Engine (CCE), and click Cluster Management - Cluster List to enter the cluster list page.
- Click the name of the target cluster. In the left navigation bar under Inspection & Diagnosis, select Fault Diagnosis.
- On the Cluster Inspection page, select the Node Diagnosis tab and click Diagnose Now
- In the Node Diagnosis pop-up window, select the node name. After carefully reading the notes, check the box for I understand and agree, then click OK to start the diagnosis.
Method II: - On the node list page of the target cluster, select Fault Diagnosis in the Operation column
- In the Node Diagnosis pop-up window, after carefully reading the notes, check the box for I understand and agree, then click OK to start the diagnosis.
Once the diagnosis is initiated, you can monitor its progress in the task list under the status column.
View diagnosis results
In the fault diagnosis page, navigate to the diagnosis list, locate the target diagnosis report, and click on "Diagnosis Details." The Diagnosis Details page provides a thorough view of the diagnostic results.
Node diagnostic items & descriptions
Node
| Diagnostic item name | Description | Fix solution |
|---|---|---|
| Node status | Ensure that the node status is set to "Ready." | Try restarting the node |
| Node scheduling status | Confirm that the node is not flagged as unschedulable. | The node is unschedulable; check node taint settings. For specific operations, see Set Node Taints. |
| BCI instance existence | Verify whether the BCC instance associated with the node exists. | Check the status of the BCC instance for any issues. |
| BCI instance health | Ensure that the BCC instance linked to the node is functioning properly. | Check the status of the BCC instance for any issues. |
| Node CPU usage | Confirm that the node’s current CPU usage falls within the expected normal range. | None |
| Node memory usage | Verify that the node’s current memory usage is within normal parameters. | None |
| Node weekly CPU utilization | Ensure that the node’s CPU usage has not been consistently high over the past week to prevent resource contention from impacting services. | To minimize business impact: 1. Configure appropriate pod requests and limits; 2. Avoid deploying too many pods on a single node. |
| Node weekly memory utilization | Confirm that the node’s memory usage has not been consistently high over the past week to prevent OOM (Out of Memory) issues that might affect services. | To minimize business impact: 1. Configure appropriate pod requests and limits; 2. Avoid deploying too many pods on a single node. |
| Node OOM event | Check whether the node has encountered any OOM (Out of Memory) events. | Log in to the node and view the kernel logs of the node where the pod resides: /var/log/messages |
| Kubelet status | Ensure the node’s kubelet process is running correctly. | Review the kubelet logs of the node for any anomalies. |
| Containerd status | Verify that the node’s containerd service is functioning as expected. | Log in to the node and view the kernel logs of the node: /var/log/messages |
| Docker status | Ensure the node’s Docker service is running smoothly. | Log in to the node and view the kernel logs of the node: /var/log/messages |
| Docker hang check | Check if the node has experienced Docker service hangs. | Log in to the node if necessary and restart the Docker service using the command systemctl restart docker. |
| API server connectivity | Verify that the node can connect to the cluster API server without issues. | Inspect the cluster-related configurations. |
| Node DNS service | Ensure the node can utilize the host DNS service correctly. | Check if the host DNS service is normal. For more information, refer to DNS Troubleshooting Guide |
| Cluster DNS service | Check if the node can access the cluster IP of the cluster’s kube-dns service and use the cluster’s DNS service normally | Check the running status and logs of CoreDNS pods. For more information, refer to DNS Troubleshooting Guide |
| Cluster CoreDNS pod availability | Confirm the node can access the pod IP of the cluster’s CoreDNS without problems. | Ensure the node can reach the pod IP of CoreDNS successfully. |
| Containerd image pull | Verify that the node’s containerd is able to pull images properly. | Examine the node’s network and image settings. |
| Docker image pull status | Check whether the node’s Docker engine can pull images as expected. | Examine the node’s network and image settings. |
Node core component (NodeComponent)
| Diagnostic item name | Description | Recommendations |
|---|---|---|
| CNI component status | Ensure the node’s CNI component is operating normally. | Submit a support ticket for further assistance. |
| CSI component status | Ensure the node’s CSI component is functioning correctly. | Navigate to CCE Cluster > O&M & Management > Component Management > Storage, and verify the status of the cluster’s storage components. |
| Network agent status | Check that the node’s network agent service is operating as expected. | Submit a support ticket for further assistance. |
| Network operator status | Confirm that the cluster’s network operator service is running appropriately. | Submit a support ticket for further assistance. |
Cluster component (ClusterComponent)
| Diagnostic item name | Description | Recommendations |
|---|---|---|
| Pod CIDR block remaining | Ensure the cluster has at least five available pod CIDR blocks in VPC routing mode to prevent new nodes from failing due to CIDR block exhaustion. | Submit a support ticket for further assistance. |
| DNS service cluster IP | Verify the cluster IP of the cluster’s DNS service is normally assigned (DNS service anomalies will cause cluster function failures and affect services) | Check the running status and logs of CoreDNS pods. For more information, refer to DNS Troubleshooting Guide |
| API server BLB instance status | Make sure the API server BLB instance is functioning properly. | Access the application BLB instance page, locate the BLB instance associated with the cluster, and review the instance status in the details section. |
| API server BLB instance existence | Verify the existence of the API server BLB instance. | Confirm the existence of the BLB instance linked to the cluster on the application BLB instance page. |
| API server BLB 6443 port listening | Ensure the API server BLB port 6443 is configured correctly for listening. | Navigate to the application BLB instance page, locate the BLB linked to the cluster, and examine the target group configuration of the BLB instance. |
| Availability zone consistency between node and container subnet | Ensure that the node and container subnet are within the same availability zone when operating in VPC-ENI mode. | Access the CCE cluster details, locate the container network, add a subnet, and confirm that the node and subnet belong to the same availability zone. |
| Subnet IP remaining | Confirm that there are enough available IPs remaining in the subnet for VPC-ENI mode. | Access the CCE cluster details, locate the container network, and add an appropriate subnet. |
GPU node (GPUNode)
| Diagnostic item name | Description | Recommendations |
|---|---|---|
| GPU node status | Verify that the GPU status of the node is functioning normally. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| GPU allocable count | Ensure that the number of GPUs allocable on the node is as expected. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA XID48Error | Inspect for double-bit ECC errors in the NVIDIA GPU. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA XID62Error | Check the NVIDIA GPU for internal micro-controller halts (applicable to newer drivers). | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA XID64Error | Inspect the NVIDIA GPU for ECC page retirement or row remapper recording issues. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA XID74Error | Check for NVLINK errors on the NVIDIA GPU. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA XID79Error | Verify if the NVIDIA GPU has disconnected from the bus. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA XID95Error | Look for uncontained ECC errors on the NVIDIA GPU. | Try restarting the GPU node; if the issue persists after restart, submit a ticket |
| NVIDIA XID109Error | Check for Context Switch Timeout errors on the NVIDIA GPU. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA XID140Error | Inspect for unrecovered ECC errors in the NVIDIA GPU. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA XIDError | Check for XID errors related to the NVIDIA GPU. | Try restarting the GPU node; if the issue persists after restart, submit a ticket |
| NVIDIA SXIDError | Investigate for SXID errors on the NVIDIA GPU. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA row remapper failure | Verify if the NVIDIA GPU has experienced row remapper failures. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA Device Plugin GPU disconnection | Check whether the NVIDIA Device Plugin indicates any GPU disconnection. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA infoROM integrity | Verify if the NVIDIA GPU infoROM has been corrupted. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA ECC error | Inspect for any NVIDIA GPU ECC errors. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA GPU high temperature alert | Ensure the NVIDIA GPU temperature is within the normal range. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA GPU operation mode | Confirm if the NVIDIA GPU is operating in the normal mode. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA-SMI status code | Review the nvidia-smi status code. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| PCI configuration read/write | Determine if PCI configuration read/write operations are failing. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| PCI address access | Ensure lspci is able to read the GPU configuration space. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| GPU bandwidth | Validate if the GPU bandwidth is functioning properly. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| GPU power consumption alert | Verify if the GPU power consumption is within normal levels. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| GPU driver accessibility | Ensure the GPU driver is being accessed correctly. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA GPU recognition | Validate that the NVIDIA GPUs on the bus are recognized by both the driver and nvidia-smi. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA-Container-Toolkit version | Confirm if the NVIDIA-Container-Toolkit version matches the cluster version. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA-Container-Toolkit configuration | Ensure that the NVIDIA-Container-Toolkit is configured correctly in the container runtime. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA-Container-Toolkit status | Check if the NVIDIA-Container-Toolkit is functioning normally. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| Abnormal process on GPU nodes | Identify any abnormal processes running on the GPU node. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| HAS status | Verify that the HAS status is operating normally. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| HAS version | Ensure the HAS version is supported. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA GPU disconnection status | Check if the NVIDIA GPU has been disconnected from the bus or is otherwise inaccessible. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA ECC error limit | Inspect if NVIDIA GPU ECC memory correction errors have exceeded the permissible limit. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA GPU interconnect link mode | Determine if the NVIDIA GPU interconnect link mode is functioning properly (SYS or NODE mode may cause speed degradation). | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA GPU interconnect link alert | Check for NVLink & NVSwitch errors | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA GPU interconnect service error | Verify if the GPU interconnect service (FabricManager) is operating normally. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVLink status | Verify if NVLink is either disconnected or inactive. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| CUDA version | Confirm whether the installed CUDA version is supported. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| GPU driver version | Confirm whether the GPU driver version is supported. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVIDIA device power cable connection | Ensure that the NVIDIA GPU device power cables are properly connected. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NVLink quantity | Verify if there is any reduction in the number of NVLink connections. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| GPU-NIC connection type | Ensure that the NIC is inserted into the correct slot. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| Node network interface card status | Overall status of the NIC. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NIC PCI address unavailability | Confirm whether the NIC PCI address is accessible. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NIC channel quantity | Verify if the number of NIC channels has reached the maximum supported. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
| NIC bandwidth | Check whether the NIC bandwidth has reached its maximum supported capacity. | Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket. |
Pod diagnostic items & descriptions
| Diagnostic item name | Description | Recommendations |
|---|---|---|
| Number of pod container restarts | Count the number of container restarts in the pod to identify any abnormal restart patterns. | Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting |
| Pod container image download | Verify if the node hosting the pod is facing image download interference from other pods (to prevent resource contention). | Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting |
| Pod image pull secrets validity | Confirm whether the secrets required for the pod’s image pull are valid (to avoid errors during image retrieval). | Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting |
| Pod memory utilization | Check if the pod’s memory usage ≤ 95% (to avoid OOM affecting services) | 1. Edit the workload YAML via the workloads page or kubectl, find the resources field, and adjust resource quotas (request, limit). 2. Increase the desired number of pods via the Workloads page (click Scale) or edit the workload YAML via kubectl. 3. Configure auto scaling (HPA). |
| Pod CPU utilization | Ensure that the pod’s CPU usage is ≤ 95% (to avoid resource contention affecting services). | 1. Edit the workload YAML via the workloads page or kubectl, find the resources field, and adjust resource quotas (request, limit). 2. Increase the desired number of pods via the Workloads page (click Scale) or edit the workload YAML via kubectl. 3. Configure auto scaling (HPA). |
| Pod-to-CoreDNS pod connectivity | Check if the pod can access CoreDNS pods normally | Test the connectivity between the pod and the CoreDNS pod. |
| Pod-to-CoreDNS service connectivity | Check if the pod can access the CoreDNS service normally | Test the connectivity between the pod and the CoreDNS pod. |
| Pod-to-host network DNS connectivity | Check if the pod can access the host network’s DNS server normally | Test the DNS connectivity between the pod and the host network. |
| Pod initialization status | Check if the pod has completed its initialization and entered the normal running state. | Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting |
| Pod scheduling status | Verify if the pod has been successfully scheduled onto its target node. | Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting |
| Pod schedulability | Confirm whether the pod meets scheduling requirements and can be allocated to an appropriate node. | Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting |
| Pod status | Verify if the current status of the pod aligns with expectations (e.g., running, pending). | Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting |
