Fault Diagnosis

CCE CCE

  • Function Release Records
  • Common Tools
    • Command Line Scenario Examples
  • API Reference
    • Overview
    • Common Headers and Error Responses
    • General Description
  • Product Announcement
    • Announcement on the Discontinuation of CCE Standalone Clusters
    • CCE New Cluster Management Release Announcement
    • Upgrade Announcement for CCE Cluster Audit Component kube-external-auditor
    • CCE Console Upgrade Announcement
    • Announcement on Management Fees for CCE Managed Clusters
    • Container Runtime Version Release Notes
    • Announcement on the Decommissioning of CCE Image Repository
    • Kubernetes Version Release Notes
      • CCE Release of Kubernetes v1_26 History
      • CCE Kubernetes Version Update Notes
      • CCE Release of Kubernetes v1_24 History
      • CCE Release of Kubernetes v1_30 History
      • CCE Release of Kubernetes v1_22 History
      • CCE Release of Kubernetes v1_18 History
      • CCE Release of Kubernetes v1_20 History
      • CCE Release of Kubernetes v1_28 History
      • Release Notes for CCE Kubernetes 1_31 Version
      • Kubernetes Version Overview and Mechanism
    • Security Vulnerability Fix Announcement
      • Vulnerability CVE-2019-5736 Fix Announcement
      • Vulnerability CVE-2021-30465 Fix Announcement
      • CVE-2025-1097, CVE-2025-1098, and Other Vulnerabilities Fix Announcement
      • CVE-2020-14386 Vulnerability Fix Announcement
      • Impact Statement on runc Security Issue (CVE-2024-21626)
  • Service Level Agreement (SLA)
    • CCE Service Level Agreement SLA (V1_0)
  • Typical Practices
    • Pod Anomaly Troubleshooting
    • Adding CGroup V2 Node
    • Common Linux System Configuration Parameters Description
    • Encrypting etcd Data Using KMS
    • Configuring Container Network Parameters Using CNI
    • CCE - Public Network Access Practice
    • Practice of using private images in CCE clusters
    • Unified Access for Virtual Machines and Container Services via CCE Ingress
    • User Guide for Custom CNI Plugins
    • CCE Cluster Network Description and Planning
    • Cross-Cloud Application Migration to Baidu CCE Using Velero
    • CCE Resource Recommender User Documentation
    • Continuous Deployment with Jenkins in CCE Cluster
    • CCE Best Practice-Guestbook Setup
    • CCE Best Practice-Container Network Mode Selection
    • CCE Usage Checklist
    • VPC-ENI Mode Cluster Public Network Access Practice
    • CCE Container Runtime Selection
    • Cloud-native AI
      • Elastic and Fault-Tolerant Training Using CCE AITraining Operator
      • Deploy the TensorFlow Serving inference service
      • Best Practice for GPU Virtualization with Optimal Isolation
  • FAQs
    • How do business applications use load balancer
    • Using kubectl on Windows
    • Cluster management FAQs
    • Common Questions Overview
    • Auto scaling FAQs
    • Create a simple service via kubectl
  • Operation guide
    • Prerequisites for use
    • Identity and access management
    • Permission Management
      • Configure IAM Tag Permission Policy
      • Permission Overview
      • Configure IAM Custom Permission Policy
      • Configure Predefined RBAC Permission Policy
      • Configure IAM Predefined Permission Policy
      • Configure Cluster OIDC Authentication
    • Configuration Management
      • Configmap Management
      • Secret Management
    • Traffic access
      • BLB ingress annotation description
      • Use K8S_Service via CCE
      • Use K8S_Ingress via CCE
      • Implement Canary Release with CCE Based on Nginx-Ingress
      • Create CCE_Ingress via YAML
      • LoadBalancer Service Annotation Description
      • Service Reuses Existing Load Balancer BLB
      • Use Direct Pod Mode LoadBalancer Service
      • NGINX Ingress Configuration Reference
      • Create LoadBalancer_Service via YAML
      • Use NGINX Ingress
    • Virtual Node
      • Configuring BCIPod
      • Configuring bci-profile
      • Managing virtual nodes
    • Node management
      • Add a node
      • Managing Taints
      • Setting Node Blocking
      • Setting GPU Memory Sharing
      • Remove a node
      • Customizing Kubelet Parameters
      • Kubelet Container Monitor Read-Only Port Risk Warning
      • Managing Node Tag
      • Drain node
    • Component Management
      • CCE CSI CDS Plugin Description
      • CCE Fluid Description
      • CCE CSI PFS L2 Plugin
      • CCE Calico Felix Description
      • CCE Ingress Controller Description
      • CCE QoS Agent Description
      • CCE GPU Manager Description
      • CCE Ingress NGINX Controller Description
      • CCE P2P Accelerator Description
      • CCE Virtual Kubelet Component
      • CoreDNS Description
      • CCE Log Operator Description
      • CCE Node Remedier Description
      • CCE Descheduler Description
      • CCE Dynamic Scheduling Plugin Description
      • Kube Scheduler Documentation
      • CCE NPU Manager Description
      • CCE CronHPA Controller Description
      • CCE LB Controller Description
      • Kube ApiServer Description
      • CCE Backup Controller Description
      • CCE Network Plugin Description
      • CCE CSI PFS Plugin Description
      • CCE Credential Controller Description
      • CCE Deep Learning Frameworks Operator Description
      • Component Overview
      • CCE Image Accelerate Description
      • CCE CSI BOS Plugin Description
      • CCE Onepilot Description
      • Description of Kube Controller Manager
      • CCE_Hybrid_Manager Description
      • CCE NodeLocal DNSCache Description
      • CCE Node Problem Detector Description
      • CCE Ascend Mindx DL Description
      • CCE RDMA Device Plugin Description
      • CCE AI Job Scheduler Description
    • Image registry
      • Image Registry Basic Operations
      • Using Container Image to Build Services
    • Helm Management
      • Helm Template
      • Helm Instance
    • Cluster management
      • Upgrade Cluster Kubernetes Version
      • CCE Node CDS Dilatation
      • Managed Cluster Usage Instructions
      • Create cluster
      • CCE Supports GPUSharing Cluster
      • View Cluster
      • Connect to Cluster via kubectl
      • CCE Security Group
      • CCE Node Resource Reservation Instructions
      • Operate Cluster
      • Cluster Snapshot
    • Serverless Cluster
      • Product overview
      • Using Service in Serverless Cluster
      • Creating a Serverless Cluster
    • Storage Management
      • Using Cloud File System
      • Overview
      • Using Parallel File System PFS
      • Using RapidFS
      • Using Object Storage BOS
      • Using Parallel File System PFS L2
      • Using Local Storage
      • Using Cloud Disk CDS
    • Inspection and Diagnosis
      • Cluster Inspection
      • GPU Runtime Environment Check
      • Fault Diagnosis
    • Cloud-native AI
      • Cloud-Native AI Overview
      • AI Monitoring Dashboard
        • Connecting to a Prometheus Instance and Starting a Job
        • NVIDIA Chip Resource Observation
          • AI Job Scheduler component
          • GPU node resources
          • GPU workload resources
          • GPUManager component
          • GPU resource pool overview
        • Ascend Chip Resource Observation
          • Ascend resource pool overview
          • Ascend node resource
          • Ascend workload resource
      • Task Management
        • View Task Information
        • Create TensorFlow Task
        • Example of RDMA Distributed Training Based on NCCL
        • Create PaddlePaddle Task
        • Create AI Training Task
        • Delete task
        • Create PyTorch Task
        • Create Mxnet Task
      • Queue Management
        • Modify Queue
        • Create Queue
        • Usage Instructions for Logical Queues and Physical Queues
        • Queue deletion
      • Dataset Management
        • Create Dataset
        • Delete dataset
        • View Dataset
        • Operate Dataset
      • AI Acceleration Kit
        • AIAK Introduction
        • Using AIAK-Training PyTorch Edition
        • Deploying Distributed Training Tasks Using AIAK-Training
        • Accelerating Inference Business Using AIAK-Inference
      • GPU Virtualization
        • GPU Exclusive and Shared Usage Instructions
        • Image Build Precautions in Shared GPU Scenarios
        • Instructions for Multi-GPU Usage in Single-GPU Containers
        • GPU Virtualization Adaptation Table
        • GPU Online and Offline Mixed Usage Instructions
        • MPS Best Practices & Precautions
        • Precautions for Disabling Node Video Memory Sharing
    • Elastic Scaling
      • Container Timing Horizontal Scaling (CronHPA)
      • Container Horizontal Scaling (HPA)
      • Implementing Second-Level Elastic Scaling with cce-autoscaling-placeholder
      • CCE Cluster Node Auto-Scaling
    • Network Management
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC-ENI Mode)
      • Container Access to External Services in CCE Clusters
      • CCE supports dual-stack networks of IPv4 and IPv6
      • Using NetworkPolicy Network Policy
      • Traffic Forwarding Configuration for Containers in Peering Connections Scenarios
      • CCE IP Masquerade Agent User Guide
      • Creating VPC-ENI Mode Cluster
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC Network Mode)
      • Using NetworkPolicy in CCE Clusters
      • Network Orchestration
        • Container Network QoS Management
        • VPC-ENI Specified Subnet IP Allocation (Container Network v2)
        • Cluster Pod Subnet Topology Distribution (Container Network v2)
      • Network Connectivity
        • Container network accesses the public network via NAT gateway
      • Network Maintenance
        • Common Error Code Table for CCE Container Network
      • DNS
        • CoreDNS Component Manual Dilatation Guide
        • DNS Troubleshooting Guide
        • DNS Principle Overview
    • Namespace Management
      • Set Limit Range
      • Set Resource Quota
      • Basic Namespace Operations
    • Workload
      • CronJob Management
      • Set Workload Auto-Scaling
      • Deployment Management
      • Job Management
      • View the Pod
      • StatefulSet Management
      • Password-Free Pull of Container Image
      • Create Workload Using Private Image
      • DaemonSet Management
    • Monitor Logs
      • Monitor Cluster with Prometheus
      • CCE Event Center
      • Cluster Service Profiling
      • CCE Cluster Anomaly Event Alerts
      • Java Application Monitor
      • Cluster Audit Dashboard
      • Logging
      • Cluster Audit
      • Log Center
        • Configure Collection Rules Using CRD
        • View Cluster Control Plane Logs
        • View Business Logs
        • Log Overview
        • Configure Collection Rules in Cloud Container Engine Console
    • Application management
      • Overview
      • Secret
      • Configuration dictionary
      • Deployment
      • Service
      • Pod
    • NodeGroup Management
      • NodeGroup Management
      • NodeGroup Node Fault Detection and Self-Healing
      • Configuring Scaling Policies
      • NodeGroup Introduction
      • Adding Existing External Nodes
      • Custom NodeGroup Kubelet Configuration
      • Adding Alternative Models
      • Dilatation NodeGroup
    • Backup Center
      • Restore Management
      • Backup Overview
      • Backup Management
      • Backup repository
  • Quick Start
    • Quick Deployment of Nginx Application
    • CCE Container Engine Usage Process Overview
  • Product pricing
    • Product pricing
  • Product Description
    • Application scenarios
    • Introduction
    • Usage restrictions
    • Features
    • Advantages
    • Core concepts
  • Solution-Fabric
    • Fabric Solution
  • Development Guide
    • EFK Log Collection System Deployment Guide
    • Using Network Policy in CCE Cluster
    • Creating a LoadBalancer-Type Service
    • Prometheus Monitoring System Deployment Guide
    • kubectl Management Configuration
  • API_V2 Reference
    • Overview
    • Common Headers and Error Responses
    • Cluster Related Interfaces
    • Instance Related Interfaces
    • Service domain
    • General Description
    • Kubeconfig Related Interfaces
    • RBAC Related Interfaces
    • Autoscaler Related Interfaces
    • Network Related Interfaces
    • InstanceGroup Related Interfaces
    • Appendix
    • Component management-related APIs
    • Package adaptation-related APIs
    • Task Related Interfaces
  • Solution-Xchain
    • Hyperchain Solution
  • SDK
    • Go-SDK
      • Overview
      • NodeGroup Management
      • Initialization
      • Install the SDK Package
      • Cluster management
      • Node management
All documents
menu
No results found, please re-enter

CCE CCE

  • Function Release Records
  • Common Tools
    • Command Line Scenario Examples
  • API Reference
    • Overview
    • Common Headers and Error Responses
    • General Description
  • Product Announcement
    • Announcement on the Discontinuation of CCE Standalone Clusters
    • CCE New Cluster Management Release Announcement
    • Upgrade Announcement for CCE Cluster Audit Component kube-external-auditor
    • CCE Console Upgrade Announcement
    • Announcement on Management Fees for CCE Managed Clusters
    • Container Runtime Version Release Notes
    • Announcement on the Decommissioning of CCE Image Repository
    • Kubernetes Version Release Notes
      • CCE Release of Kubernetes v1_26 History
      • CCE Kubernetes Version Update Notes
      • CCE Release of Kubernetes v1_24 History
      • CCE Release of Kubernetes v1_30 History
      • CCE Release of Kubernetes v1_22 History
      • CCE Release of Kubernetes v1_18 History
      • CCE Release of Kubernetes v1_20 History
      • CCE Release of Kubernetes v1_28 History
      • Release Notes for CCE Kubernetes 1_31 Version
      • Kubernetes Version Overview and Mechanism
    • Security Vulnerability Fix Announcement
      • Vulnerability CVE-2019-5736 Fix Announcement
      • Vulnerability CVE-2021-30465 Fix Announcement
      • CVE-2025-1097, CVE-2025-1098, and Other Vulnerabilities Fix Announcement
      • CVE-2020-14386 Vulnerability Fix Announcement
      • Impact Statement on runc Security Issue (CVE-2024-21626)
  • Service Level Agreement (SLA)
    • CCE Service Level Agreement SLA (V1_0)
  • Typical Practices
    • Pod Anomaly Troubleshooting
    • Adding CGroup V2 Node
    • Common Linux System Configuration Parameters Description
    • Encrypting etcd Data Using KMS
    • Configuring Container Network Parameters Using CNI
    • CCE - Public Network Access Practice
    • Practice of using private images in CCE clusters
    • Unified Access for Virtual Machines and Container Services via CCE Ingress
    • User Guide for Custom CNI Plugins
    • CCE Cluster Network Description and Planning
    • Cross-Cloud Application Migration to Baidu CCE Using Velero
    • CCE Resource Recommender User Documentation
    • Continuous Deployment with Jenkins in CCE Cluster
    • CCE Best Practice-Guestbook Setup
    • CCE Best Practice-Container Network Mode Selection
    • CCE Usage Checklist
    • VPC-ENI Mode Cluster Public Network Access Practice
    • CCE Container Runtime Selection
    • Cloud-native AI
      • Elastic and Fault-Tolerant Training Using CCE AITraining Operator
      • Deploy the TensorFlow Serving inference service
      • Best Practice for GPU Virtualization with Optimal Isolation
  • FAQs
    • How do business applications use load balancer
    • Using kubectl on Windows
    • Cluster management FAQs
    • Common Questions Overview
    • Auto scaling FAQs
    • Create a simple service via kubectl
  • Operation guide
    • Prerequisites for use
    • Identity and access management
    • Permission Management
      • Configure IAM Tag Permission Policy
      • Permission Overview
      • Configure IAM Custom Permission Policy
      • Configure Predefined RBAC Permission Policy
      • Configure IAM Predefined Permission Policy
      • Configure Cluster OIDC Authentication
    • Configuration Management
      • Configmap Management
      • Secret Management
    • Traffic access
      • BLB ingress annotation description
      • Use K8S_Service via CCE
      • Use K8S_Ingress via CCE
      • Implement Canary Release with CCE Based on Nginx-Ingress
      • Create CCE_Ingress via YAML
      • LoadBalancer Service Annotation Description
      • Service Reuses Existing Load Balancer BLB
      • Use Direct Pod Mode LoadBalancer Service
      • NGINX Ingress Configuration Reference
      • Create LoadBalancer_Service via YAML
      • Use NGINX Ingress
    • Virtual Node
      • Configuring BCIPod
      • Configuring bci-profile
      • Managing virtual nodes
    • Node management
      • Add a node
      • Managing Taints
      • Setting Node Blocking
      • Setting GPU Memory Sharing
      • Remove a node
      • Customizing Kubelet Parameters
      • Kubelet Container Monitor Read-Only Port Risk Warning
      • Managing Node Tag
      • Drain node
    • Component Management
      • CCE CSI CDS Plugin Description
      • CCE Fluid Description
      • CCE CSI PFS L2 Plugin
      • CCE Calico Felix Description
      • CCE Ingress Controller Description
      • CCE QoS Agent Description
      • CCE GPU Manager Description
      • CCE Ingress NGINX Controller Description
      • CCE P2P Accelerator Description
      • CCE Virtual Kubelet Component
      • CoreDNS Description
      • CCE Log Operator Description
      • CCE Node Remedier Description
      • CCE Descheduler Description
      • CCE Dynamic Scheduling Plugin Description
      • Kube Scheduler Documentation
      • CCE NPU Manager Description
      • CCE CronHPA Controller Description
      • CCE LB Controller Description
      • Kube ApiServer Description
      • CCE Backup Controller Description
      • CCE Network Plugin Description
      • CCE CSI PFS Plugin Description
      • CCE Credential Controller Description
      • CCE Deep Learning Frameworks Operator Description
      • Component Overview
      • CCE Image Accelerate Description
      • CCE CSI BOS Plugin Description
      • CCE Onepilot Description
      • Description of Kube Controller Manager
      • CCE_Hybrid_Manager Description
      • CCE NodeLocal DNSCache Description
      • CCE Node Problem Detector Description
      • CCE Ascend Mindx DL Description
      • CCE RDMA Device Plugin Description
      • CCE AI Job Scheduler Description
    • Image registry
      • Image Registry Basic Operations
      • Using Container Image to Build Services
    • Helm Management
      • Helm Template
      • Helm Instance
    • Cluster management
      • Upgrade Cluster Kubernetes Version
      • CCE Node CDS Dilatation
      • Managed Cluster Usage Instructions
      • Create cluster
      • CCE Supports GPUSharing Cluster
      • View Cluster
      • Connect to Cluster via kubectl
      • CCE Security Group
      • CCE Node Resource Reservation Instructions
      • Operate Cluster
      • Cluster Snapshot
    • Serverless Cluster
      • Product overview
      • Using Service in Serverless Cluster
      • Creating a Serverless Cluster
    • Storage Management
      • Using Cloud File System
      • Overview
      • Using Parallel File System PFS
      • Using RapidFS
      • Using Object Storage BOS
      • Using Parallel File System PFS L2
      • Using Local Storage
      • Using Cloud Disk CDS
    • Inspection and Diagnosis
      • Cluster Inspection
      • GPU Runtime Environment Check
      • Fault Diagnosis
    • Cloud-native AI
      • Cloud-Native AI Overview
      • AI Monitoring Dashboard
        • Connecting to a Prometheus Instance and Starting a Job
        • NVIDIA Chip Resource Observation
          • AI Job Scheduler component
          • GPU node resources
          • GPU workload resources
          • GPUManager component
          • GPU resource pool overview
        • Ascend Chip Resource Observation
          • Ascend resource pool overview
          • Ascend node resource
          • Ascend workload resource
      • Task Management
        • View Task Information
        • Create TensorFlow Task
        • Example of RDMA Distributed Training Based on NCCL
        • Create PaddlePaddle Task
        • Create AI Training Task
        • Delete task
        • Create PyTorch Task
        • Create Mxnet Task
      • Queue Management
        • Modify Queue
        • Create Queue
        • Usage Instructions for Logical Queues and Physical Queues
        • Queue deletion
      • Dataset Management
        • Create Dataset
        • Delete dataset
        • View Dataset
        • Operate Dataset
      • AI Acceleration Kit
        • AIAK Introduction
        • Using AIAK-Training PyTorch Edition
        • Deploying Distributed Training Tasks Using AIAK-Training
        • Accelerating Inference Business Using AIAK-Inference
      • GPU Virtualization
        • GPU Exclusive and Shared Usage Instructions
        • Image Build Precautions in Shared GPU Scenarios
        • Instructions for Multi-GPU Usage in Single-GPU Containers
        • GPU Virtualization Adaptation Table
        • GPU Online and Offline Mixed Usage Instructions
        • MPS Best Practices & Precautions
        • Precautions for Disabling Node Video Memory Sharing
    • Elastic Scaling
      • Container Timing Horizontal Scaling (CronHPA)
      • Container Horizontal Scaling (HPA)
      • Implementing Second-Level Elastic Scaling with cce-autoscaling-placeholder
      • CCE Cluster Node Auto-Scaling
    • Network Management
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC-ENI Mode)
      • Container Access to External Services in CCE Clusters
      • CCE supports dual-stack networks of IPv4 and IPv6
      • Using NetworkPolicy Network Policy
      • Traffic Forwarding Configuration for Containers in Peering Connections Scenarios
      • CCE IP Masquerade Agent User Guide
      • Creating VPC-ENI Mode Cluster
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC Network Mode)
      • Using NetworkPolicy in CCE Clusters
      • Network Orchestration
        • Container Network QoS Management
        • VPC-ENI Specified Subnet IP Allocation (Container Network v2)
        • Cluster Pod Subnet Topology Distribution (Container Network v2)
      • Network Connectivity
        • Container network accesses the public network via NAT gateway
      • Network Maintenance
        • Common Error Code Table for CCE Container Network
      • DNS
        • CoreDNS Component Manual Dilatation Guide
        • DNS Troubleshooting Guide
        • DNS Principle Overview
    • Namespace Management
      • Set Limit Range
      • Set Resource Quota
      • Basic Namespace Operations
    • Workload
      • CronJob Management
      • Set Workload Auto-Scaling
      • Deployment Management
      • Job Management
      • View the Pod
      • StatefulSet Management
      • Password-Free Pull of Container Image
      • Create Workload Using Private Image
      • DaemonSet Management
    • Monitor Logs
      • Monitor Cluster with Prometheus
      • CCE Event Center
      • Cluster Service Profiling
      • CCE Cluster Anomaly Event Alerts
      • Java Application Monitor
      • Cluster Audit Dashboard
      • Logging
      • Cluster Audit
      • Log Center
        • Configure Collection Rules Using CRD
        • View Cluster Control Plane Logs
        • View Business Logs
        • Log Overview
        • Configure Collection Rules in Cloud Container Engine Console
    • Application management
      • Overview
      • Secret
      • Configuration dictionary
      • Deployment
      • Service
      • Pod
    • NodeGroup Management
      • NodeGroup Management
      • NodeGroup Node Fault Detection and Self-Healing
      • Configuring Scaling Policies
      • NodeGroup Introduction
      • Adding Existing External Nodes
      • Custom NodeGroup Kubelet Configuration
      • Adding Alternative Models
      • Dilatation NodeGroup
    • Backup Center
      • Restore Management
      • Backup Overview
      • Backup Management
      • Backup repository
  • Quick Start
    • Quick Deployment of Nginx Application
    • CCE Container Engine Usage Process Overview
  • Product pricing
    • Product pricing
  • Product Description
    • Application scenarios
    • Introduction
    • Usage restrictions
    • Features
    • Advantages
    • Core concepts
  • Solution-Fabric
    • Fabric Solution
  • Development Guide
    • EFK Log Collection System Deployment Guide
    • Using Network Policy in CCE Cluster
    • Creating a LoadBalancer-Type Service
    • Prometheus Monitoring System Deployment Guide
    • kubectl Management Configuration
  • API_V2 Reference
    • Overview
    • Common Headers and Error Responses
    • Cluster Related Interfaces
    • Instance Related Interfaces
    • Service domain
    • General Description
    • Kubeconfig Related Interfaces
    • RBAC Related Interfaces
    • Autoscaler Related Interfaces
    • Network Related Interfaces
    • InstanceGroup Related Interfaces
    • Appendix
    • Component management-related APIs
    • Package adaptation-related APIs
    • Task Related Interfaces
  • Solution-Xchain
    • Hyperchain Solution
  • SDK
    • Go-SDK
      • Overview
      • NodeGroup Management
      • Initialization
      • Install the SDK Package
      • Cluster management
      • Node management
  • Document center
  • arrow
  • CCECCE
  • arrow
  • Operation guide
  • arrow
  • Inspection and Diagnosis
  • arrow
  • Fault Diagnosis
Table of contents on this page
  • Overview
  • Prerequisites
  • Diagnosis function introduction
  • Enable fault diagnosis
  • View diagnosis results
  • Node diagnostic items & descriptions
  • Node
  • Node core component (NodeComponent)
  • Cluster component (ClusterComponent)
  • GPU node (GPUNode)
  • Pod diagnostic items & descriptions

Fault Diagnosis

Updated at:2025-10-27

Overview

The fault diagnosis feature of Baidu AI Cloud Container Engine (CCE) offers automated anomaly detection and root cause identification. It provides comprehensive health checks of core cluster components, enabling fast detection of common configuration issues, resource constraints, and component malfunctions.

Prerequisites

  • A K8S cluster CCE has been created. For specific operations, please refer to Create Cluster
  • This functionality ensures Kubernetes clusters operate normally.

Diagnosis function introduction

The current version focuses on diagnosing resource anomalies at the ‌node level‌ and ‌pod level‌. For detailed diagnostic items, refer to the descriptions below.

Enable fault diagnosis

Important: The fault diagnosis feature collects information such as the system version, load, the operational status of components like Docker and Kubelet, and critical error details from system logs. The entire diagnostic process adheres to data security standards and excludes any business or sensitive data.

The operations for © are similar. The following uses enabling node diagnosis as an example to explain how to use the fault diagnosis feature.
Method I:

  1. Sign in to Baidu AI Cloud Management Console, navigate to Product Services - Cloud Native - Cloud Container Engine (CCE), and click Cluster Management - Cluster List to enter the cluster list page.
  2. Click the name of the target cluster. In the left navigation bar under Inspection & Diagnosis, select Fault Diagnosis.
  3. On the Cluster Inspection page, select the Node Diagnosis tab and click Diagnose Now
  4. In the Node Diagnosis pop-up window, select the node name. After carefully reading the notes, check the box for I understand and agree, then click OK to start the diagnosis.
    Method II:
  5. On the node list page of the target cluster, select Fault Diagnosis in the Operation column
  6. In the Node Diagnosis pop-up window, after carefully reading the notes, check the box for I understand and agree, then click OK to start the diagnosis.

Once the diagnosis is initiated, you can monitor its progress in the task list under the status column.

View diagnosis results

In the fault diagnosis page, navigate to the diagnosis list, locate the target diagnosis report, and click on "Diagnosis Details." The Diagnosis Details page provides a thorough view of the diagnostic results.

Node diagnostic items & descriptions


Node


Diagnostic item name Description Fix solution
Node status Ensure that the node status is set to "Ready." Try restarting the node
Node scheduling status Confirm that the node is not flagged as unschedulable. The node is unschedulable; check node taint settings. For specific operations, see Set Node Taints.
BCI instance existence Verify whether the BCC instance associated with the node exists. Check the status of the BCC instance for any issues.
BCI instance health Ensure that the BCC instance linked to the node is functioning properly. Check the status of the BCC instance for any issues.
Node CPU usage Confirm that the node’s current CPU usage falls within the expected normal range. None
Node memory usage Verify that the node’s current memory usage is within normal parameters. None
Node weekly CPU utilization Ensure that the node’s CPU usage has not been consistently high over the past week to prevent resource contention from impacting services. To minimize business impact: 1. Configure appropriate pod requests and limits; 2. Avoid deploying too many pods on a single node.
Node weekly memory utilization Confirm that the node’s memory usage has not been consistently high over the past week to prevent OOM (Out of Memory) issues that might affect services. To minimize business impact: 1. Configure appropriate pod requests and limits; 2. Avoid deploying too many pods on a single node.
Node OOM event Check whether the node has encountered any OOM (Out of Memory) events. Log in to the node and view the kernel logs of the node where the pod resides: /var/log/messages
Kubelet status Ensure the node’s kubelet process is running correctly. Review the kubelet logs of the node for any anomalies.
Containerd status Verify that the node’s containerd service is functioning as expected. Log in to the node and view the kernel logs of the node: /var/log/messages
Docker status Ensure the node’s Docker service is running smoothly. Log in to the node and view the kernel logs of the node: /var/log/messages
Docker hang check Check if the node has experienced Docker service hangs. Log in to the node if necessary and restart the Docker service using the command systemctl restart docker.
API server connectivity Verify that the node can connect to the cluster API server without issues. Inspect the cluster-related configurations.
Node DNS service Ensure the node can utilize the host DNS service correctly. Check if the host DNS service is normal. For more information, refer to DNS Troubleshooting Guide
Cluster DNS service Check if the node can access the cluster IP of the cluster’s kube-dns service and use the cluster’s DNS service normally Check the running status and logs of CoreDNS pods. For more information, refer to DNS Troubleshooting Guide
Cluster CoreDNS pod availability Confirm the node can access the pod IP of the cluster’s CoreDNS without problems. Ensure the node can reach the pod IP of CoreDNS successfully.
Containerd image pull Verify that the node’s containerd is able to pull images properly. Examine the node’s network and image settings.
Docker image pull status Check whether the node’s Docker engine can pull images as expected. Examine the node’s network and image settings.

Node core component (NodeComponent)


Diagnostic item name Description Recommendations
CNI component status Ensure the node’s CNI component is operating normally. Submit a support ticket for further assistance.
CSI component status Ensure the node’s CSI component is functioning correctly. Navigate to CCE Cluster > O&M & Management > Component Management > Storage, and verify the status of the cluster’s storage components.
Network agent status Check that the node’s network agent service is operating as expected. Submit a support ticket for further assistance.
Network operator status Confirm that the cluster’s network operator service is running appropriately. Submit a support ticket for further assistance.

Cluster component (ClusterComponent)


Diagnostic item name Description Recommendations
Pod CIDR block remaining Ensure the cluster has at least five available pod CIDR blocks in VPC routing mode to prevent new nodes from failing due to CIDR block exhaustion. Submit a support ticket for further assistance.
DNS service cluster IP Verify the cluster IP of the cluster’s DNS service is normally assigned (DNS service anomalies will cause cluster function failures and affect services) Check the running status and logs of CoreDNS pods. For more information, refer to DNS Troubleshooting Guide
API server BLB instance status Make sure the API server BLB instance is functioning properly. Access the application BLB instance page, locate the BLB instance associated with the cluster, and review the instance status in the details section.
API server BLB instance existence Verify the existence of the API server BLB instance. Confirm the existence of the BLB instance linked to the cluster on the application BLB instance page.
API server BLB 6443 port listening Ensure the API server BLB port 6443 is configured correctly for listening. Navigate to the application BLB instance page, locate the BLB linked to the cluster, and examine the target group configuration of the BLB instance.
Availability zone consistency between node and container subnet Ensure that the node and container subnet are within the same availability zone when operating in VPC-ENI mode. Access the CCE cluster details, locate the container network, add a subnet, and confirm that the node and subnet belong to the same availability zone.
Subnet IP remaining Confirm that there are enough available IPs remaining in the subnet for VPC-ENI mode. Access the CCE cluster details, locate the container network, and add an appropriate subnet.

GPU node (GPUNode)


Diagnostic item name Description Recommendations
GPU node status Verify that the GPU status of the node is functioning normally. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
GPU allocable count Ensure that the number of GPUs allocable on the node is as expected. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID48Error Inspect for double-bit ECC errors in the NVIDIA GPU. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID62Error Check the NVIDIA GPU for internal micro-controller halts (applicable to newer drivers). Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID64Error Inspect the NVIDIA GPU for ECC page retirement or row remapper recording issues. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID74Error Check for NVLINK errors on the NVIDIA GPU. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID79Error Verify if the NVIDIA GPU has disconnected from the bus. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID95Error Look for uncontained ECC errors on the NVIDIA GPU. Try restarting the GPU node; if the issue persists after restart, submit a ticket
NVIDIA XID109Error Check for Context Switch Timeout errors on the NVIDIA GPU. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XID140Error Inspect for unrecovered ECC errors in the NVIDIA GPU. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA XIDError Check for XID errors related to the NVIDIA GPU. Try restarting the GPU node; if the issue persists after restart, submit a ticket
NVIDIA SXIDError Investigate for SXID errors on the NVIDIA GPU. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA row remapper failure Verify if the NVIDIA GPU has experienced row remapper failures. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA Device Plugin GPU disconnection Check whether the NVIDIA Device Plugin indicates any GPU disconnection. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA infoROM integrity Verify if the NVIDIA GPU infoROM has been corrupted. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA ECC error Inspect for any NVIDIA GPU ECC errors. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU high temperature alert Ensure the NVIDIA GPU temperature is within the normal range. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU operation mode Confirm if the NVIDIA GPU is operating in the normal mode. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA-SMI status code Review the nvidia-smi status code. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
PCI configuration read/write Determine if PCI configuration read/write operations are failing. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
PCI address access Ensure lspci is able to read the GPU configuration space. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
GPU bandwidth Validate if the GPU bandwidth is functioning properly. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
GPU power consumption alert Verify if the GPU power consumption is within normal levels. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
GPU driver accessibility Ensure the GPU driver is being accessed correctly. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU recognition Validate that the NVIDIA GPUs on the bus are recognized by both the driver and nvidia-smi. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA-Container-Toolkit version Confirm if the NVIDIA-Container-Toolkit version matches the cluster version. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA-Container-Toolkit configuration Ensure that the NVIDIA-Container-Toolkit is configured correctly in the container runtime. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA-Container-Toolkit status Check if the NVIDIA-Container-Toolkit is functioning normally. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
Abnormal process on GPU nodes Identify any abnormal processes running on the GPU node. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
HAS status Verify that the HAS status is operating normally. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
HAS version Ensure the HAS version is supported. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU disconnection status Check if the NVIDIA GPU has been disconnected from the bus or is otherwise inaccessible. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA ECC error limit Inspect if NVIDIA GPU ECC memory correction errors have exceeded the permissible limit. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU interconnect link mode Determine if the NVIDIA GPU interconnect link mode is functioning properly (SYS or NODE mode may cause speed degradation). Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU interconnect link alert Check for NVLink & NVSwitch errors Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA GPU interconnect service error Verify if the GPU interconnect service (FabricManager) is operating normally. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVLink status Verify if NVLink is either disconnected or inactive. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
CUDA version Confirm whether the installed CUDA version is supported. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
GPU driver version Confirm whether the GPU driver version is supported. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVIDIA device power cable connection Ensure that the NVIDIA GPU device power cables are properly connected. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NVLink quantity Verify if there is any reduction in the number of NVLink connections. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
GPU-NIC connection type Ensure that the NIC is inserted into the correct slot. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
Node network interface card status Overall status of the NIC. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NIC PCI address unavailability Confirm whether the NIC PCI address is accessible. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NIC channel quantity Verify if the number of NIC channels has reached the maximum supported. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.
NIC bandwidth Check whether the NIC bandwidth has reached its maximum supported capacity. Attempt to restart the GPU node; if the issue persists after the restart, submit a support ticket.

Pod diagnostic items & descriptions


Diagnostic item name Description Recommendations
Number of pod container restarts Count the number of container restarts in the pod to identify any abnormal restart patterns. Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting
Pod container image download Verify if the node hosting the pod is facing image download interference from other pods (to prevent resource contention). Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting
Pod image pull secrets validity Confirm whether the secrets required for the pod’s image pull are valid (to avoid errors during image retrieval). Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting
Pod memory utilization Check if the pod’s memory usage ≤ 95% (to avoid OOM affecting services) 1. Edit the workload YAML via the workloads page or kubectl, find the resources field, and adjust resource quotas (request, limit).
2. Increase the desired number of pods via the Workloads page (click Scale) or edit the workload YAML via kubectl.
3. Configure auto scaling (HPA).
Pod CPU utilization Ensure that the pod’s CPU usage is ≤ 95% (to avoid resource contention affecting services). 1. Edit the workload YAML via the workloads page or kubectl, find the resources field, and adjust resource quotas (request, limit).
2. Increase the desired number of pods via the Workloads page (click Scale) or edit the workload YAML via kubectl.
3. Configure auto scaling (HPA).
Pod-to-CoreDNS pod connectivity Check if the pod can access CoreDNS pods normally Test the connectivity between the pod and the CoreDNS pod.
Pod-to-CoreDNS service connectivity Check if the pod can access the CoreDNS service normally Test the connectivity between the pod and the CoreDNS pod.
Pod-to-host network DNS connectivity Check if the pod can access the host network’s DNS server normally Test the DNS connectivity between the pod and the host network.
Pod initialization status Check if the pod has completed its initialization and entered the normal running state. Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting
Pod scheduling status Verify if the pod has been successfully scheduled onto its target node. Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting
Pod schedulability Confirm whether the pod meets scheduling requirements and can be allocated to an appropriate node. Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting
Pod status Verify if the current status of the pod aligns with expectations (e.g., running, pending). Check the Pod status and logs. For more information, see Pod Anomaly Troubleshooting

Previous
GPU Runtime Environment Check
Next
Cloud-native AI