Cluster Inspection

CCE CCE

  • Function Release Records
  • Common Tools
    • Command Line Scenario Examples
  • API Reference
    • Overview
    • Common Headers and Error Responses
    • General Description
  • Product Announcement
    • Announcement on the Discontinuation of CCE Standalone Clusters
    • CCE New Cluster Management Release Announcement
    • Upgrade Announcement for CCE Cluster Audit Component kube-external-auditor
    • CCE Console Upgrade Announcement
    • Announcement on Management Fees for CCE Managed Clusters
    • Container Runtime Version Release Notes
    • Announcement on the Decommissioning of CCE Image Repository
    • Kubernetes Version Release Notes
      • CCE Release of Kubernetes v1_26 History
      • CCE Kubernetes Version Update Notes
      • CCE Release of Kubernetes v1_24 History
      • CCE Release of Kubernetes v1_30 History
      • CCE Release of Kubernetes v1_22 History
      • CCE Release of Kubernetes v1_18 History
      • CCE Release of Kubernetes v1_20 History
      • CCE Release of Kubernetes v1_28 History
      • Release Notes for CCE Kubernetes 1_31 Version
      • Kubernetes Version Overview and Mechanism
    • Security Vulnerability Fix Announcement
      • Vulnerability CVE-2019-5736 Fix Announcement
      • Vulnerability CVE-2021-30465 Fix Announcement
      • CVE-2025-1097, CVE-2025-1098, and Other Vulnerabilities Fix Announcement
      • CVE-2020-14386 Vulnerability Fix Announcement
      • Impact Statement on runc Security Issue (CVE-2024-21626)
  • Service Level Agreement (SLA)
    • CCE Service Level Agreement SLA (V1_0)
  • Typical Practices
    • Pod Anomaly Troubleshooting
    • Adding CGroup V2 Node
    • Common Linux System Configuration Parameters Description
    • Encrypting etcd Data Using KMS
    • Configuring Container Network Parameters Using CNI
    • CCE - Public Network Access Practice
    • Practice of using private images in CCE clusters
    • Unified Access for Virtual Machines and Container Services via CCE Ingress
    • User Guide for Custom CNI Plugins
    • CCE Cluster Network Description and Planning
    • Cross-Cloud Application Migration to Baidu CCE Using Velero
    • CCE Resource Recommender User Documentation
    • Continuous Deployment with Jenkins in CCE Cluster
    • CCE Best Practice-Guestbook Setup
    • CCE Best Practice-Container Network Mode Selection
    • CCE Usage Checklist
    • VPC-ENI Mode Cluster Public Network Access Practice
    • CCE Container Runtime Selection
    • Cloud-native AI
      • Elastic and Fault-Tolerant Training Using CCE AITraining Operator
      • Deploy the TensorFlow Serving inference service
      • Best Practice for GPU Virtualization with Optimal Isolation
  • FAQs
    • How do business applications use load balancer
    • Using kubectl on Windows
    • Cluster management FAQs
    • Common Questions Overview
    • Auto scaling FAQs
    • Create a simple service via kubectl
  • Operation guide
    • Prerequisites for use
    • Identity and access management
    • Permission Management
      • Configure IAM Tag Permission Policy
      • Permission Overview
      • Configure IAM Custom Permission Policy
      • Configure Predefined RBAC Permission Policy
      • Configure IAM Predefined Permission Policy
      • Configure Cluster OIDC Authentication
    • Configuration Management
      • Configmap Management
      • Secret Management
    • Traffic access
      • BLB ingress annotation description
      • Use K8S_Service via CCE
      • Use K8S_Ingress via CCE
      • Implement Canary Release with CCE Based on Nginx-Ingress
      • Create CCE_Ingress via YAML
      • LoadBalancer Service Annotation Description
      • Service Reuses Existing Load Balancer BLB
      • Use Direct Pod Mode LoadBalancer Service
      • NGINX Ingress Configuration Reference
      • Create LoadBalancer_Service via YAML
      • Use NGINX Ingress
    • Virtual Node
      • Configuring BCIPod
      • Configuring bci-profile
      • Managing virtual nodes
    • Node management
      • Add a node
      • Managing Taints
      • Setting Node Blocking
      • Setting GPU Memory Sharing
      • Remove a node
      • Customizing Kubelet Parameters
      • Kubelet Container Monitor Read-Only Port Risk Warning
      • Managing Node Tag
      • Drain node
    • Component Management
      • CCE CSI CDS Plugin Description
      • CCE Fluid Description
      • CCE CSI PFS L2 Plugin
      • CCE Calico Felix Description
      • CCE Ingress Controller Description
      • CCE QoS Agent Description
      • CCE GPU Manager Description
      • CCE Ingress NGINX Controller Description
      • CCE P2P Accelerator Description
      • CCE Virtual Kubelet Component
      • CoreDNS Description
      • CCE Log Operator Description
      • CCE Node Remedier Description
      • CCE Descheduler Description
      • CCE Dynamic Scheduling Plugin Description
      • Kube Scheduler Documentation
      • CCE NPU Manager Description
      • CCE CronHPA Controller Description
      • CCE LB Controller Description
      • Kube ApiServer Description
      • CCE Backup Controller Description
      • CCE Network Plugin Description
      • CCE CSI PFS Plugin Description
      • CCE Credential Controller Description
      • CCE Deep Learning Frameworks Operator Description
      • Component Overview
      • CCE Image Accelerate Description
      • CCE CSI BOS Plugin Description
      • CCE Onepilot Description
      • Description of Kube Controller Manager
      • CCE_Hybrid_Manager Description
      • CCE NodeLocal DNSCache Description
      • CCE Node Problem Detector Description
      • CCE Ascend Mindx DL Description
      • CCE RDMA Device Plugin Description
      • CCE AI Job Scheduler Description
    • Image registry
      • Image Registry Basic Operations
      • Using Container Image to Build Services
    • Helm Management
      • Helm Template
      • Helm Instance
    • Cluster management
      • Upgrade Cluster Kubernetes Version
      • CCE Node CDS Dilatation
      • Managed Cluster Usage Instructions
      • Create cluster
      • CCE Supports GPUSharing Cluster
      • View Cluster
      • Connect to Cluster via kubectl
      • CCE Security Group
      • CCE Node Resource Reservation Instructions
      • Operate Cluster
      • Cluster Snapshot
    • Serverless Cluster
      • Product overview
      • Using Service in Serverless Cluster
      • Creating a Serverless Cluster
    • Storage Management
      • Using Cloud File System
      • Overview
      • Using Parallel File System PFS
      • Using RapidFS
      • Using Object Storage BOS
      • Using Parallel File System PFS L2
      • Using Local Storage
      • Using Cloud Disk CDS
    • Inspection and Diagnosis
      • Cluster Inspection
      • GPU Runtime Environment Check
      • Fault Diagnosis
    • Cloud-native AI
      • Cloud-Native AI Overview
      • AI Monitoring Dashboard
        • Connecting to a Prometheus Instance and Starting a Job
        • NVIDIA Chip Resource Observation
          • AI Job Scheduler component
          • GPU node resources
          • GPU workload resources
          • GPUManager component
          • GPU resource pool overview
        • Ascend Chip Resource Observation
          • Ascend resource pool overview
          • Ascend node resource
          • Ascend workload resource
      • Task Management
        • View Task Information
        • Create TensorFlow Task
        • Example of RDMA Distributed Training Based on NCCL
        • Create PaddlePaddle Task
        • Create AI Training Task
        • Delete task
        • Create PyTorch Task
        • Create Mxnet Task
      • Queue Management
        • Modify Queue
        • Create Queue
        • Usage Instructions for Logical Queues and Physical Queues
        • Queue deletion
      • Dataset Management
        • Create Dataset
        • Delete dataset
        • View Dataset
        • Operate Dataset
      • AI Acceleration Kit
        • AIAK Introduction
        • Using AIAK-Training PyTorch Edition
        • Deploying Distributed Training Tasks Using AIAK-Training
        • Accelerating Inference Business Using AIAK-Inference
      • GPU Virtualization
        • GPU Exclusive and Shared Usage Instructions
        • Image Build Precautions in Shared GPU Scenarios
        • Instructions for Multi-GPU Usage in Single-GPU Containers
        • GPU Virtualization Adaptation Table
        • GPU Online and Offline Mixed Usage Instructions
        • MPS Best Practices & Precautions
        • Precautions for Disabling Node Video Memory Sharing
    • Elastic Scaling
      • Container Timing Horizontal Scaling (CronHPA)
      • Container Horizontal Scaling (HPA)
      • Implementing Second-Level Elastic Scaling with cce-autoscaling-placeholder
      • CCE Cluster Node Auto-Scaling
    • Network Management
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC-ENI Mode)
      • Container Access to External Services in CCE Clusters
      • CCE supports dual-stack networks of IPv4 and IPv6
      • Using NetworkPolicy Network Policy
      • Traffic Forwarding Configuration for Containers in Peering Connections Scenarios
      • CCE IP Masquerade Agent User Guide
      • Creating VPC-ENI Mode Cluster
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC Network Mode)
      • Using NetworkPolicy in CCE Clusters
      • Network Orchestration
        • Container Network QoS Management
        • VPC-ENI Specified Subnet IP Allocation (Container Network v2)
        • Cluster Pod Subnet Topology Distribution (Container Network v2)
      • Network Connectivity
        • Container network accesses the public network via NAT gateway
      • Network Maintenance
        • Common Error Code Table for CCE Container Network
      • DNS
        • CoreDNS Component Manual Dilatation Guide
        • DNS Troubleshooting Guide
        • DNS Principle Overview
    • Namespace Management
      • Set Limit Range
      • Set Resource Quota
      • Basic Namespace Operations
    • Workload
      • CronJob Management
      • Set Workload Auto-Scaling
      • Deployment Management
      • Job Management
      • View the Pod
      • StatefulSet Management
      • Password-Free Pull of Container Image
      • Create Workload Using Private Image
      • DaemonSet Management
    • Monitor Logs
      • Monitor Cluster with Prometheus
      • CCE Event Center
      • Cluster Service Profiling
      • CCE Cluster Anomaly Event Alerts
      • Java Application Monitor
      • Cluster Audit Dashboard
      • Logging
      • Cluster Audit
      • Log Center
        • Configure Collection Rules Using CRD
        • View Cluster Control Plane Logs
        • View Business Logs
        • Log Overview
        • Configure Collection Rules in Cloud Container Engine Console
    • Application management
      • Overview
      • Secret
      • Configuration dictionary
      • Deployment
      • Service
      • Pod
    • NodeGroup Management
      • NodeGroup Management
      • NodeGroup Node Fault Detection and Self-Healing
      • Configuring Scaling Policies
      • NodeGroup Introduction
      • Adding Existing External Nodes
      • Custom NodeGroup Kubelet Configuration
      • Adding Alternative Models
      • Dilatation NodeGroup
    • Backup Center
      • Restore Management
      • Backup Overview
      • Backup Management
      • Backup repository
  • Quick Start
    • Quick Deployment of Nginx Application
    • CCE Container Engine Usage Process Overview
  • Product pricing
    • Product pricing
  • Product Description
    • Application scenarios
    • Introduction
    • Usage restrictions
    • Features
    • Advantages
    • Core concepts
  • Solution-Fabric
    • Fabric Solution
  • Development Guide
    • EFK Log Collection System Deployment Guide
    • Using Network Policy in CCE Cluster
    • Creating a LoadBalancer-Type Service
    • Prometheus Monitoring System Deployment Guide
    • kubectl Management Configuration
  • API_V2 Reference
    • Overview
    • Common Headers and Error Responses
    • Cluster Related Interfaces
    • Instance Related Interfaces
    • Service domain
    • General Description
    • Kubeconfig Related Interfaces
    • RBAC Related Interfaces
    • Autoscaler Related Interfaces
    • Network Related Interfaces
    • InstanceGroup Related Interfaces
    • Appendix
    • Component management-related APIs
    • Package adaptation-related APIs
    • Task Related Interfaces
  • Solution-Xchain
    • Hyperchain Solution
  • SDK
    • Go-SDK
      • Overview
      • NodeGroup Management
      • Initialization
      • Install the SDK Package
      • Cluster management
      • Node management
All documents
menu
No results found, please re-enter

CCE CCE

  • Function Release Records
  • Common Tools
    • Command Line Scenario Examples
  • API Reference
    • Overview
    • Common Headers and Error Responses
    • General Description
  • Product Announcement
    • Announcement on the Discontinuation of CCE Standalone Clusters
    • CCE New Cluster Management Release Announcement
    • Upgrade Announcement for CCE Cluster Audit Component kube-external-auditor
    • CCE Console Upgrade Announcement
    • Announcement on Management Fees for CCE Managed Clusters
    • Container Runtime Version Release Notes
    • Announcement on the Decommissioning of CCE Image Repository
    • Kubernetes Version Release Notes
      • CCE Release of Kubernetes v1_26 History
      • CCE Kubernetes Version Update Notes
      • CCE Release of Kubernetes v1_24 History
      • CCE Release of Kubernetes v1_30 History
      • CCE Release of Kubernetes v1_22 History
      • CCE Release of Kubernetes v1_18 History
      • CCE Release of Kubernetes v1_20 History
      • CCE Release of Kubernetes v1_28 History
      • Release Notes for CCE Kubernetes 1_31 Version
      • Kubernetes Version Overview and Mechanism
    • Security Vulnerability Fix Announcement
      • Vulnerability CVE-2019-5736 Fix Announcement
      • Vulnerability CVE-2021-30465 Fix Announcement
      • CVE-2025-1097, CVE-2025-1098, and Other Vulnerabilities Fix Announcement
      • CVE-2020-14386 Vulnerability Fix Announcement
      • Impact Statement on runc Security Issue (CVE-2024-21626)
  • Service Level Agreement (SLA)
    • CCE Service Level Agreement SLA (V1_0)
  • Typical Practices
    • Pod Anomaly Troubleshooting
    • Adding CGroup V2 Node
    • Common Linux System Configuration Parameters Description
    • Encrypting etcd Data Using KMS
    • Configuring Container Network Parameters Using CNI
    • CCE - Public Network Access Practice
    • Practice of using private images in CCE clusters
    • Unified Access for Virtual Machines and Container Services via CCE Ingress
    • User Guide for Custom CNI Plugins
    • CCE Cluster Network Description and Planning
    • Cross-Cloud Application Migration to Baidu CCE Using Velero
    • CCE Resource Recommender User Documentation
    • Continuous Deployment with Jenkins in CCE Cluster
    • CCE Best Practice-Guestbook Setup
    • CCE Best Practice-Container Network Mode Selection
    • CCE Usage Checklist
    • VPC-ENI Mode Cluster Public Network Access Practice
    • CCE Container Runtime Selection
    • Cloud-native AI
      • Elastic and Fault-Tolerant Training Using CCE AITraining Operator
      • Deploy the TensorFlow Serving inference service
      • Best Practice for GPU Virtualization with Optimal Isolation
  • FAQs
    • How do business applications use load balancer
    • Using kubectl on Windows
    • Cluster management FAQs
    • Common Questions Overview
    • Auto scaling FAQs
    • Create a simple service via kubectl
  • Operation guide
    • Prerequisites for use
    • Identity and access management
    • Permission Management
      • Configure IAM Tag Permission Policy
      • Permission Overview
      • Configure IAM Custom Permission Policy
      • Configure Predefined RBAC Permission Policy
      • Configure IAM Predefined Permission Policy
      • Configure Cluster OIDC Authentication
    • Configuration Management
      • Configmap Management
      • Secret Management
    • Traffic access
      • BLB ingress annotation description
      • Use K8S_Service via CCE
      • Use K8S_Ingress via CCE
      • Implement Canary Release with CCE Based on Nginx-Ingress
      • Create CCE_Ingress via YAML
      • LoadBalancer Service Annotation Description
      • Service Reuses Existing Load Balancer BLB
      • Use Direct Pod Mode LoadBalancer Service
      • NGINX Ingress Configuration Reference
      • Create LoadBalancer_Service via YAML
      • Use NGINX Ingress
    • Virtual Node
      • Configuring BCIPod
      • Configuring bci-profile
      • Managing virtual nodes
    • Node management
      • Add a node
      • Managing Taints
      • Setting Node Blocking
      • Setting GPU Memory Sharing
      • Remove a node
      • Customizing Kubelet Parameters
      • Kubelet Container Monitor Read-Only Port Risk Warning
      • Managing Node Tag
      • Drain node
    • Component Management
      • CCE CSI CDS Plugin Description
      • CCE Fluid Description
      • CCE CSI PFS L2 Plugin
      • CCE Calico Felix Description
      • CCE Ingress Controller Description
      • CCE QoS Agent Description
      • CCE GPU Manager Description
      • CCE Ingress NGINX Controller Description
      • CCE P2P Accelerator Description
      • CCE Virtual Kubelet Component
      • CoreDNS Description
      • CCE Log Operator Description
      • CCE Node Remedier Description
      • CCE Descheduler Description
      • CCE Dynamic Scheduling Plugin Description
      • Kube Scheduler Documentation
      • CCE NPU Manager Description
      • CCE CronHPA Controller Description
      • CCE LB Controller Description
      • Kube ApiServer Description
      • CCE Backup Controller Description
      • CCE Network Plugin Description
      • CCE CSI PFS Plugin Description
      • CCE Credential Controller Description
      • CCE Deep Learning Frameworks Operator Description
      • Component Overview
      • CCE Image Accelerate Description
      • CCE CSI BOS Plugin Description
      • CCE Onepilot Description
      • Description of Kube Controller Manager
      • CCE_Hybrid_Manager Description
      • CCE NodeLocal DNSCache Description
      • CCE Node Problem Detector Description
      • CCE Ascend Mindx DL Description
      • CCE RDMA Device Plugin Description
      • CCE AI Job Scheduler Description
    • Image registry
      • Image Registry Basic Operations
      • Using Container Image to Build Services
    • Helm Management
      • Helm Template
      • Helm Instance
    • Cluster management
      • Upgrade Cluster Kubernetes Version
      • CCE Node CDS Dilatation
      • Managed Cluster Usage Instructions
      • Create cluster
      • CCE Supports GPUSharing Cluster
      • View Cluster
      • Connect to Cluster via kubectl
      • CCE Security Group
      • CCE Node Resource Reservation Instructions
      • Operate Cluster
      • Cluster Snapshot
    • Serverless Cluster
      • Product overview
      • Using Service in Serverless Cluster
      • Creating a Serverless Cluster
    • Storage Management
      • Using Cloud File System
      • Overview
      • Using Parallel File System PFS
      • Using RapidFS
      • Using Object Storage BOS
      • Using Parallel File System PFS L2
      • Using Local Storage
      • Using Cloud Disk CDS
    • Inspection and Diagnosis
      • Cluster Inspection
      • GPU Runtime Environment Check
      • Fault Diagnosis
    • Cloud-native AI
      • Cloud-Native AI Overview
      • AI Monitoring Dashboard
        • Connecting to a Prometheus Instance and Starting a Job
        • NVIDIA Chip Resource Observation
          • AI Job Scheduler component
          • GPU node resources
          • GPU workload resources
          • GPUManager component
          • GPU resource pool overview
        • Ascend Chip Resource Observation
          • Ascend resource pool overview
          • Ascend node resource
          • Ascend workload resource
      • Task Management
        • View Task Information
        • Create TensorFlow Task
        • Example of RDMA Distributed Training Based on NCCL
        • Create PaddlePaddle Task
        • Create AI Training Task
        • Delete task
        • Create PyTorch Task
        • Create Mxnet Task
      • Queue Management
        • Modify Queue
        • Create Queue
        • Usage Instructions for Logical Queues and Physical Queues
        • Queue deletion
      • Dataset Management
        • Create Dataset
        • Delete dataset
        • View Dataset
        • Operate Dataset
      • AI Acceleration Kit
        • AIAK Introduction
        • Using AIAK-Training PyTorch Edition
        • Deploying Distributed Training Tasks Using AIAK-Training
        • Accelerating Inference Business Using AIAK-Inference
      • GPU Virtualization
        • GPU Exclusive and Shared Usage Instructions
        • Image Build Precautions in Shared GPU Scenarios
        • Instructions for Multi-GPU Usage in Single-GPU Containers
        • GPU Virtualization Adaptation Table
        • GPU Online and Offline Mixed Usage Instructions
        • MPS Best Practices & Precautions
        • Precautions for Disabling Node Video Memory Sharing
    • Elastic Scaling
      • Container Timing Horizontal Scaling (CronHPA)
      • Container Horizontal Scaling (HPA)
      • Implementing Second-Level Elastic Scaling with cce-autoscaling-placeholder
      • CCE Cluster Node Auto-Scaling
    • Network Management
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC-ENI Mode)
      • Container Access to External Services in CCE Clusters
      • CCE supports dual-stack networks of IPv4 and IPv6
      • Using NetworkPolicy Network Policy
      • Traffic Forwarding Configuration for Containers in Peering Connections Scenarios
      • CCE IP Masquerade Agent User Guide
      • Creating VPC-ENI Mode Cluster
      • How to Continue Dilatation When Container Network Segment Space Is Exhausted (VPC Network Mode)
      • Using NetworkPolicy in CCE Clusters
      • Network Orchestration
        • Container Network QoS Management
        • VPC-ENI Specified Subnet IP Allocation (Container Network v2)
        • Cluster Pod Subnet Topology Distribution (Container Network v2)
      • Network Connectivity
        • Container network accesses the public network via NAT gateway
      • Network Maintenance
        • Common Error Code Table for CCE Container Network
      • DNS
        • CoreDNS Component Manual Dilatation Guide
        • DNS Troubleshooting Guide
        • DNS Principle Overview
    • Namespace Management
      • Set Limit Range
      • Set Resource Quota
      • Basic Namespace Operations
    • Workload
      • CronJob Management
      • Set Workload Auto-Scaling
      • Deployment Management
      • Job Management
      • View the Pod
      • StatefulSet Management
      • Password-Free Pull of Container Image
      • Create Workload Using Private Image
      • DaemonSet Management
    • Monitor Logs
      • Monitor Cluster with Prometheus
      • CCE Event Center
      • Cluster Service Profiling
      • CCE Cluster Anomaly Event Alerts
      • Java Application Monitor
      • Cluster Audit Dashboard
      • Logging
      • Cluster Audit
      • Log Center
        • Configure Collection Rules Using CRD
        • View Cluster Control Plane Logs
        • View Business Logs
        • Log Overview
        • Configure Collection Rules in Cloud Container Engine Console
    • Application management
      • Overview
      • Secret
      • Configuration dictionary
      • Deployment
      • Service
      • Pod
    • NodeGroup Management
      • NodeGroup Management
      • NodeGroup Node Fault Detection and Self-Healing
      • Configuring Scaling Policies
      • NodeGroup Introduction
      • Adding Existing External Nodes
      • Custom NodeGroup Kubelet Configuration
      • Adding Alternative Models
      • Dilatation NodeGroup
    • Backup Center
      • Restore Management
      • Backup Overview
      • Backup Management
      • Backup repository
  • Quick Start
    • Quick Deployment of Nginx Application
    • CCE Container Engine Usage Process Overview
  • Product pricing
    • Product pricing
  • Product Description
    • Application scenarios
    • Introduction
    • Usage restrictions
    • Features
    • Advantages
    • Core concepts
  • Solution-Fabric
    • Fabric Solution
  • Development Guide
    • EFK Log Collection System Deployment Guide
    • Using Network Policy in CCE Cluster
    • Creating a LoadBalancer-Type Service
    • Prometheus Monitoring System Deployment Guide
    • kubectl Management Configuration
  • API_V2 Reference
    • Overview
    • Common Headers and Error Responses
    • Cluster Related Interfaces
    • Instance Related Interfaces
    • Service domain
    • General Description
    • Kubeconfig Related Interfaces
    • RBAC Related Interfaces
    • Autoscaler Related Interfaces
    • Network Related Interfaces
    • InstanceGroup Related Interfaces
    • Appendix
    • Component management-related APIs
    • Package adaptation-related APIs
    • Task Related Interfaces
  • Solution-Xchain
    • Hyperchain Solution
  • SDK
    • Go-SDK
      • Overview
      • NodeGroup Management
      • Initialization
      • Install the SDK Package
      • Cluster management
      • Node management
  • Document center
  • arrow
  • CCECCE
  • arrow
  • Operation guide
  • arrow
  • Inspection and Diagnosis
  • arrow
  • Cluster Inspection
Table of contents on this page
  • Prerequisites
  • Enable cluster inspection
  • View inspection results
  • Cluster inspection items and solutions

Cluster Inspection

Updated at:2025-10-27

Cloud Container Engine (CCE) provides cluster inspection features to help identify potential risks in container service clusters. These risks, regularly updated, include resource quotas, cluster vulnerabilities, resource status, and more. It also provides solutions for abnormal inspection results, enhancing O&M efficiency. This document explains how to use the cluster inspection feature to detect potential cluster issues.

Prerequisites

  • A K8S cluster CCE has been created. For specific operations, please refer to Create Cluster
  • This ensures that the Kubernetes cluster operates normally. Access the CCE console, navigate to the Cluster List page, and verify the status of the target cluster. If the cluster status shows running, the cluster is functioning as expected.

Enable cluster inspection

Important: When using the cluster inspection feature, certain checks will initiate containers in your cluster to gather data. The collected data includes system version, load, the operational status of Docker and kubelet, and critical error information in system logs. The data collection process does not collect your business or sensitive information.

  1. Log in to the Baidu AI Cloud Management Console, navigate to Product Services > Cloud Native > Cloud Container Engine (CCE), and click Cluster Management > Cluster List to access the cluster list page.
  2. Click the name of the target cluster. In the left navigation bar, select Inspection & Diagnosis > Cluster Inspection.
  3. In the upper-right corner of the Cluster Inspection page, click Automatic Inspection & Subscription Settings to set up the schedule for automatic inspections, as well as the delivery time and method for subscribed inspection reports.
  4. Alternatively, you can manually inspect the cluster by clicking Start Inspection on the Cluster Inspection page. Once the inspection is complete, the relevant information will be displayed in the Report List section.

View inspection results

  1. Log in to the Baidu AI Cloud Management Console, navigate to Product Services > Cloud Native > Cloud Container Engine (CCE), and click Cluster Management > Cluster List to access the cluster list page.
  2. Click the name of the target cluster. In the left navigation bar, select Inspection & Diagnosis > Cluster Inspection.
  3. In the Operation column of the inspection report list (on the Cluster Inspection page), click the ID of the target inspection report to view details.
  • Cluster inspection risks are categorized by severity: low, medium, and high. If certain inspection items fail to execute due to unknown reasons, the status will display as Unknown. If needed, you can submit a support ticket.
  • Detailed cluster inspection content includes the risk level, name of the risk item, impact of anomalies, and solutions. For more information on common risk warnings and remedies for cluster inspections, refer to Cluster Inspection Items and Solutions.
  1. On the inspection report page, review the risk items, their impacts, and the recommended solutions.

Cluster inspection items and solutions

Types Inspection item Impact of anomaly Recommendations
Resource quota Tight VPC routing rule quota Checks if the remaining route table entry quota in the VPC is less than 5.
In VPC routing mode, each cluster node consumes one route table rule. When route table rules are exhausted, new nodes cannot be added to the cluster. (In VPC-ENI mode, clusters do not use VPC route tables)
Go to the Quota Center to apply for an increase in VPC routing rule quota.
Tight EIP instance quota Checks if the number of personal/enterprise EIP instances creatable in the cluster’s region is less than 5.
Insufficient EIP quota may affect public network access for clusters, nodes, and services.
Go to the Quota Center to apply for an increase in EIP instance quota.
Tight ENI instance quota Checks if the number of elastic network interfaces that can be created (but not attached to instances) per VPC is less than 5.
Insufficient ENI quota may prevent node creation and addition.
Go to the Quota Center to apply for an increase in ENI instance quota.
Tight BLB instance quota Checks if the number of BLB instances creatable in the cluster’s region is less than 5.
Insufficient BLB quota may affect the creation of services and ingresses.
Go to the Quota Center to apply for an increase in BLB instance quota.
Tight on-demand BCC instance quota Checks if the number of on-demand BCC instances in the cluster’s region exceeds 95% of the quota.
Insufficient quota affects node creation.
Go to the Quota Center to apply for an increase in on-demand BCC instance quota.
Tight CDS capacity Checks if CDS disk usage in the cluster’s region exceeds 95% of the total capacity (TB).
Insufficient quota affects node and persistent volume creation.` | Go to the Quota Center to apply for an increase in CDS capacity quota.
Tight available stock for node group instance specifications Check if the available stock of node group instance specifications is less than 15. Insufficient stock may impact the scaling of the node group. Submit a BCC ticket to increase the available stock of the instance specification, or use other BCC instance specifications
Resource utilization Insufficient cluster allocable CPU Check if the allocated CPU on nodes exceeds 80%.
If allocable CPU is less than the pod request value, new pods cannot be created.
1. Increase the number of nodes.
2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit).
Insufficient cluster allocable memory Check if the allocated memory on nodes exceeds 80%.
If allocable memory is less than the pod request value, new pods cannot be created.
1. Increase the number of nodes.
2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit).
High node real-time CPU usage Check if node CPU usage exceeds 80%.
Excessively high usage may cause CPU resource preemption and affect normal business operations.
1. Increase the number of nodes.
2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit).
High node real-time memory utilization Check if node memory usage exceeds 80%.
Excessively high usage may cause OOM (Out of Memory) and affect normal business operations.
1. Increase the number of nodes.
2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit).
High pod CPU usage Checks if the CPU load of the workload exceeds 95%.
Excessively high workload may cause CPU resource preemption and affect normal business operations.
Edit the workload YAML via the workloads page or kubectl, find the resources field, and adjust resource quotas (request, limit).
Increase the desired number of pods via the Workloads page (click Scale) or edit the workload YAML via kubectl.
Configure auto scaling policy (HPA).
High pod memory utilization Check if the memory load of the workload exceeds 95%.
Excessively high workload may cause OOM (Out of Memory) and affect normal business operations.
Edit the workload YAML via the workloads page or kubectl, find the resources field, and adjust resource quotas (request, limit).
Increase the desired number of pods via the Workloads page (click Scale) or edit the workload YAML via kubectl.
Configure auto scaling policy (HPA).
High node disk usage Check if node disk usage exceeds 80%.
Excessively high usage may cause pod eviction and affect normal business operations.
Clean up temporary files.
Increase disk capacity.
High node root directory usage Check if node root directory usage exceeds 80%.
Excessively high usage may affect normal business operations.
Clean up temporary files.
Increase disk capacity.
High cluster PFS usage Check if PFS usage exceeds 80%.
When PFS usage reaches 100%, no incremental data can be written to the file system, affecting normal business operations.
Submit a PFS ticket to request capacity expansion.
Insufficient remaining pod CIDR blocks (VPC routing mode) Check if the remaining available pod CIDR blocks in the cluster (VPC routing mode) are fewer than 5. Each node consumes one pod CIDR block; having fewer than 5 blocks available means fewer than 5 new nodes can be added. Pods on new nodes will not function properly if pod CIDR blocks are depleted. Submit a CCE ticket to request capacity expansion.
Insufficient remaining subnet IPs (VPC-ENI mode) Check if the remaining IPs in the cluster's assigned subnet (VPC-ENI mode) are fewer than 10. Each pod requires one IP. When IP resources are depleted, new pods cannot be assigned IPs and will fail to start properly. Access the CCE cluster details, locate the container network, and add an appropriate subnet.
High weekly node CPU usage Check if node CPU usage in the past week exceeds 80%.
Excessively high usage may cause CPU resource preemption and affect normal business operations.
1. Increase the number of nodes.
2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit).
High weekly node memory utilization Check if the average memory usage of the node in the past week exceeds 80%.
Excessively high usage may cause OOM (Out of Memory) and affect normal business operations.
1. Increase the number of nodes.
2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit).
High daily node CPU usage Check if node CPU usage in the past day exceeds 80%.
Excessively high usage may cause CPU resource preemption and affect normal business operations.
1. Increase the number of nodes.
2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit).
High daily node memory utilization Check if the average memory usage of the node in the past day exceeds 80%.
Excessively high usage may cause OOM (Out of Memory) and affect normal business operations.
1. Increase the number of nodes.
2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit).
Cluster risks Outdated Kubernetes version Check if the cluster’s Kubernetes version is about to expire or has expired.
CCE only ensures stable operation for the latest three even-numbered Kubernetes versions. Outdated clusters risk unstable operation and upgrade failures.
Refer to Upgrade Cluster Kubernetes Version
Excessive node count Check if the number of cluster nodes exceeds the cluster specification limit.
Exceeding the limit may exhaust control plane resources and cause node group scaling failures.
Submit a CCE ticket to upgrade the cluster specifications.
Cluster deletion protection disabled Check if cluster deletion protection is enabled.
If disabled, the cluster may be accidentally deleted via the console or API, causing business failures.
Enable the cluster deletion protection feature. (Go to Cluster Details > Basic Info > Cluster Deletion Protection)
Audit logs disabled Check if the audit logs are enabled.
Enabling cluster audit logs facilitates daily troubleshooting.
Enable cluster audit.
Insufficient worker nodes (ready) Checks if the number of worker nodes in the cluster is less than 2.
Clusters with a single node have a single point of failure risk.
Add a node to the cluster.
Abnormal CoreDNS status Check if the CoreDNS component is in a non-Running state.
Component anomalies cause intra-claster DNS resolution errors, preventing access via service names.
Check the status of the CoreDNS component and fix any detected anomalies.
Outdated CoreDNS version Check if CoreDNS has the latest version.
An outdated CoreDNS version in the cluster may cause business DNS resolution issues. The latest CoreDNS version offers better stability and new features.
Upgrade CoreDNS (in the cluster's left navigation pane, go to O&M & Management > Component Management > Network > CoreDNS, and click the Upgrade button located in the component’s lower right corner). For detailed manual upgrade instructions, visit: https://cloud.baidu.com/doc/CCE/s/glto9zt0l
CoreDNS high availability not ensured Check if the number of CoreDNS replicas exceeds 2 and replicas are deployed on different nodes.
If not met, CoreDNS lacks high availability and faces single-point failure risks. CoreDNS will be unavailable when a node goes down or restarts, affecting business.
Ensure there are at least 2 CoreDNS replicas, and distribute them across different nodes.
Abnormal DNS service Check if the cluster IP of the cluster DNS service is normally assigned.
DNS service anomalies cause cluster function failures and affect business.
Review the running status and logs of CoreDNS pods to diagnose and resolve DNS-related issues.
Abnormal APIServer BLB 6443 port listening Check the cluster API server BLB 6443 port listening configuration.
Anomalies prevent cluster access.
1. Go to the application BLB instance page, find the BLB associated with the cluster, and check the BLB instance’s listening configuration.
2. If no BLB instance is found, submit a CCE ticket.
API server BLB instance existence Check if the cluster APIServer load balancing instance exists.
Missing instances render the cluster unavailable.
1. Check if the BLB instance associated with the cluster exists (go to the application BLB instance page).
2. If no BLB instance is found, submit a CCE ticket.
Abnormal APIServer BLB instance status Check the cluster API server BLB instance status.
Anomalies affect cluster availability.
1. Go to the application BLB instance page, find the BLB instance associated with the cluster, and check the BLB instance status in the instance details.
2. If no BLB instance is found, submit a CCE ticket.
Kubelet version lower than control plane Check if the Kubelet version is lower than control plane.
A lower kubelet version may cause compatibility and security issues.
Update the kubelet version.
Security group rule Check if inbound/outbound rules of node security groups meet cluster access requirements.
Inadequate rules may affect container network connectivity.
Access VPC Access Control > Security Groups to modify the rules as needed.
Node not associated with CCE security group Check if the cluster node is associated with CCE security group.
Missing association may affect container network connectivity.
Locate the target BCC instance, view its details, select the network card on the instance’s security group page, and bind it to the CCE default security group.
Abnormal API server BLB 6443 port target group Check if the target group configuration for the APIServer BLB 6443 port is normal.
Anomalies prevent cluster access.
1. Go to the application BLB instance page, find the BLB associated with the cluster, and check the BLB instance’s target group configuration.
2. If no BLB instance is found, submit a CCE ticket.
Expired APIServer Loopback certificate Check if the APIServer Loopback certificate is expired.
Expiration may affect internal APIServer communication.
Restart the API Server.
Component risks Abnormal cluster component status Check if installed components (in Component Management) are in the expected state.
Abnormal status may disable component services and affect business.
Verify the status of the components.
Outdated cluster components Check if key cluster components need version updates.
New versions offer new features and better stability.
Perform a component upgrade.
Resource status Node status Checks for NotReady nodes in the cluster.
Abnormal node status prevents pod scheduling to the node.
Inspect the node status and scale nodes up or down as required.
Mismatched workload replicas Checks if the desired number of workload replicas matches the actual number.
Mismatch fails to meet high reliability requirements.
Identify abnormal workloads, address the issues, and update the replica count accordingly.
DaemonSets replica mismatch (check if the number of DaemonSets replicas matches the number of nodes) Check if the number of DaemonSets replicas matches the number of nodes.
Mismatch may cause related function anomalies.
Analyze the causes of the abnormal replica count, resolve any issues, and adjust the replica count.

Previous
Storage Management
Next
GPU Runtime Environment Check