Cluster Inspection
Cloud Container Engine (CCE) provides cluster inspection features to help identify potential risks in container service clusters. These risks, regularly updated, include resource quotas, cluster vulnerabilities, resource status, and more. It also provides solutions for abnormal inspection results, enhancing O&M efficiency. This document explains how to use the cluster inspection feature to detect potential cluster issues.
Prerequisites
- A K8S cluster CCE has been created. For specific operations, please refer to Create Cluster
- This ensures that the Kubernetes cluster operates normally. Access the CCE console, navigate to the Cluster List page, and verify the status of the target cluster. If the cluster status shows running, the cluster is functioning as expected.
Enable cluster inspection
Important: When using the cluster inspection feature, certain checks will initiate containers in your cluster to gather data. The collected data includes system version, load, the operational status of Docker and kubelet, and critical error information in system logs. The data collection process does not collect your business or sensitive information.
- Log in to the Baidu AI Cloud Management Console, navigate to Product Services > Cloud Native > Cloud Container Engine (CCE), and click Cluster Management > Cluster List to access the cluster list page.
- Click the name of the target cluster. In the left navigation bar, select Inspection & Diagnosis > Cluster Inspection.
- In the upper-right corner of the Cluster Inspection page, click Automatic Inspection & Subscription Settings to set up the schedule for automatic inspections, as well as the delivery time and method for subscribed inspection reports.
- Alternatively, you can manually inspect the cluster by clicking Start Inspection on the Cluster Inspection page. Once the inspection is complete, the relevant information will be displayed in the Report List section.
View inspection results
- Log in to the Baidu AI Cloud Management Console, navigate to Product Services > Cloud Native > Cloud Container Engine (CCE), and click Cluster Management > Cluster List to access the cluster list page.
- Click the name of the target cluster. In the left navigation bar, select Inspection & Diagnosis > Cluster Inspection.
- In the Operation column of the inspection report list (on the Cluster Inspection page), click the ID of the target inspection report to view details.
- Cluster inspection risks are categorized by severity: low, medium, and high. If certain inspection items fail to execute due to unknown reasons, the status will display as Unknown. If needed, you can submit a support ticket.
- Detailed cluster inspection content includes the risk level, name of the risk item, impact of anomalies, and solutions. For more information on common risk warnings and remedies for cluster inspections, refer to Cluster Inspection Items and Solutions.
- On the inspection report page, review the risk items, their impacts, and the recommended solutions.
Cluster inspection items and solutions
| Types | Inspection item | Impact of anomaly | Recommendations |
|---|---|---|---|
| Resource quota | Tight VPC routing rule quota | Checks if the remaining route table entry quota in the VPC is less than 5. In VPC routing mode, each cluster node consumes one route table rule. When route table rules are exhausted, new nodes cannot be added to the cluster. (In VPC-ENI mode, clusters do not use VPC route tables) |
Go to the Quota Center to apply for an increase in VPC routing rule quota. |
| Tight EIP instance quota | Checks if the number of personal/enterprise EIP instances creatable in the cluster’s region is less than 5. Insufficient EIP quota may affect public network access for clusters, nodes, and services. |
Go to the Quota Center to apply for an increase in EIP instance quota. | |
| Tight ENI instance quota | Checks if the number of elastic network interfaces that can be created (but not attached to instances) per VPC is less than 5. Insufficient ENI quota may prevent node creation and addition. |
Go to the Quota Center to apply for an increase in ENI instance quota. | |
| Tight BLB instance quota | Checks if the number of BLB instances creatable in the cluster’s region is less than 5. Insufficient BLB quota may affect the creation of services and ingresses. |
Go to the Quota Center to apply for an increase in BLB instance quota. | |
| Tight on-demand BCC instance quota | Checks if the number of on-demand BCC instances in the cluster’s region exceeds 95% of the quota. Insufficient quota affects node creation. |
Go to the Quota Center to apply for an increase in on-demand BCC instance quota. | |
| Tight CDS capacity | Checks if CDS disk usage in the cluster’s region exceeds 95% of the total capacity (TB). Insufficient quota affects node and persistent volume creation.` | Go to the Quota Center to apply for an increase in CDS capacity quota. |
||
| Tight available stock for node group instance specifications | Check if the available stock of node group instance specifications is less than 15. Insufficient stock may impact the scaling of the node group. | Submit a BCC ticket to increase the available stock of the instance specification, or use other BCC instance specifications | |
| Resource utilization | Insufficient cluster allocable CPU | Check if the allocated CPU on nodes exceeds 80%. If allocable CPU is less than the pod request value, new pods cannot be created. |
1. Increase the number of nodes. 2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit). |
| Insufficient cluster allocable memory | Check if the allocated memory on nodes exceeds 80%. If allocable memory is less than the pod request value, new pods cannot be created. |
1. Increase the number of nodes. 2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit). |
|
| High node real-time CPU usage | Check if node CPU usage exceeds 80%. Excessively high usage may cause CPU resource preemption and affect normal business operations. |
1. Increase the number of nodes. 2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit). |
|
| High node real-time memory utilization | Check if node memory usage exceeds 80%. Excessively high usage may cause OOM (Out of Memory) and affect normal business operations. |
1. Increase the number of nodes. 2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit). |
|
| High pod CPU usage | Checks if the CPU load of the workload exceeds 95%. Excessively high workload may cause CPU resource preemption and affect normal business operations. |
Edit the workload YAML via the workloads page or kubectl, find the resources field, and adjust resource quotas (request, limit). Increase the desired number of pods via the Workloads page (click Scale) or edit the workload YAML via kubectl. Configure auto scaling policy (HPA). |
|
| High pod memory utilization | Check if the memory load of the workload exceeds 95%. Excessively high workload may cause OOM (Out of Memory) and affect normal business operations. |
Edit the workload YAML via the workloads page or kubectl, find the resources field, and adjust resource quotas (request, limit). Increase the desired number of pods via the Workloads page (click Scale) or edit the workload YAML via kubectl. Configure auto scaling policy (HPA). |
|
| High node disk usage | Check if node disk usage exceeds 80%. Excessively high usage may cause pod eviction and affect normal business operations. |
Clean up temporary files. Increase disk capacity. |
|
| High node root directory usage | Check if node root directory usage exceeds 80%. Excessively high usage may affect normal business operations. |
Clean up temporary files. Increase disk capacity. |
|
| High cluster PFS usage | Check if PFS usage exceeds 80%. When PFS usage reaches 100%, no incremental data can be written to the file system, affecting normal business operations. |
Submit a PFS ticket to request capacity expansion. | |
| Insufficient remaining pod CIDR blocks (VPC routing mode) | Check if the remaining available pod CIDR blocks in the cluster (VPC routing mode) are fewer than 5. Each node consumes one pod CIDR block; having fewer than 5 blocks available means fewer than 5 new nodes can be added. Pods on new nodes will not function properly if pod CIDR blocks are depleted. | Submit a CCE ticket to request capacity expansion. | |
| Insufficient remaining subnet IPs (VPC-ENI mode) | Check if the remaining IPs in the cluster's assigned subnet (VPC-ENI mode) are fewer than 10. Each pod requires one IP. When IP resources are depleted, new pods cannot be assigned IPs and will fail to start properly. | Access the CCE cluster details, locate the container network, and add an appropriate subnet. | |
| High weekly node CPU usage | Check if node CPU usage in the past week exceeds 80%. Excessively high usage may cause CPU resource preemption and affect normal business operations. |
1. Increase the number of nodes. 2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit). |
|
| High weekly node memory utilization | Check if the average memory usage of the node in the past week exceeds 80%. Excessively high usage may cause OOM (Out of Memory) and affect normal business operations. |
1. Increase the number of nodes. 2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit). |
|
| High daily node CPU usage | Check if node CPU usage in the past day exceeds 80%. Excessively high usage may cause CPU resource preemption and affect normal business operations. |
1. Increase the number of nodes. 2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit). |
|
| High daily node memory utilization | Check if the average memory usage of the node in the past day exceeds 80%. Excessively high usage may cause OOM (Out of Memory) and affect normal business operations. |
1. Increase the number of nodes. 2. Obtain the pod YAML via Workloads > Pods or kubectl, find the resources field, and check the pod resource quotas (request, limit). |
|
| Cluster risks | Outdated Kubernetes version | Check if the cluster’s Kubernetes version is about to expire or has expired. CCE only ensures stable operation for the latest three even-numbered Kubernetes versions. Outdated clusters risk unstable operation and upgrade failures. |
Refer to Upgrade Cluster Kubernetes Version |
| Excessive node count | Check if the number of cluster nodes exceeds the cluster specification limit. Exceeding the limit may exhaust control plane resources and cause node group scaling failures. |
Submit a CCE ticket to upgrade the cluster specifications. | |
| Cluster deletion protection disabled | Check if cluster deletion protection is enabled. If disabled, the cluster may be accidentally deleted via the console or API, causing business failures. |
Enable the cluster deletion protection feature. (Go to Cluster Details > Basic Info > Cluster Deletion Protection) | |
| Audit logs disabled | Check if the audit logs are enabled. Enabling cluster audit logs facilitates daily troubleshooting. |
Enable cluster audit. | |
| Insufficient worker nodes (ready) | Checks if the number of worker nodes in the cluster is less than 2. Clusters with a single node have a single point of failure risk. |
Add a node to the cluster. | |
| Abnormal CoreDNS status | Check if the CoreDNS component is in a non-Running state. Component anomalies cause intra-claster DNS resolution errors, preventing access via service names. |
Check the status of the CoreDNS component and fix any detected anomalies. | |
| Outdated CoreDNS version | Check if CoreDNS has the latest version. An outdated CoreDNS version in the cluster may cause business DNS resolution issues. The latest CoreDNS version offers better stability and new features. |
Upgrade CoreDNS (in the cluster's left navigation pane, go to O&M & Management > Component Management > Network > CoreDNS, and click the Upgrade button located in the component’s lower right corner). For detailed manual upgrade instructions, visit: https://cloud.baidu.com/doc/CCE/s/glto9zt0l | |
| CoreDNS high availability not ensured | Check if the number of CoreDNS replicas exceeds 2 and replicas are deployed on different nodes. If not met, CoreDNS lacks high availability and faces single-point failure risks. CoreDNS will be unavailable when a node goes down or restarts, affecting business. |
Ensure there are at least 2 CoreDNS replicas, and distribute them across different nodes. | |
| Abnormal DNS service | Check if the cluster IP of the cluster DNS service is normally assigned. DNS service anomalies cause cluster function failures and affect business. |
Review the running status and logs of CoreDNS pods to diagnose and resolve DNS-related issues. | |
| Abnormal APIServer BLB 6443 port listening | Check the cluster API server BLB 6443 port listening configuration. Anomalies prevent cluster access. |
1. Go to the application BLB instance page, find the BLB associated with the cluster, and check the BLB instance’s listening configuration. 2. If no BLB instance is found, submit a CCE ticket. |
|
| API server BLB instance existence | Check if the cluster APIServer load balancing instance exists. Missing instances render the cluster unavailable. |
1. Check if the BLB instance associated with the cluster exists (go to the application BLB instance page). 2. If no BLB instance is found, submit a CCE ticket. |
|
| Abnormal APIServer BLB instance status | Check the cluster API server BLB instance status. Anomalies affect cluster availability. |
1. Go to the application BLB instance page, find the BLB instance associated with the cluster, and check the BLB instance status in the instance details. 2. If no BLB instance is found, submit a CCE ticket. |
|
| Kubelet version lower than control plane | Check if the Kubelet version is lower than control plane. A lower kubelet version may cause compatibility and security issues. |
Update the kubelet version. | |
| Security group rule | Check if inbound/outbound rules of node security groups meet cluster access requirements. Inadequate rules may affect container network connectivity. |
Access VPC Access Control > Security Groups to modify the rules as needed. | |
| Node not associated with CCE security group | Check if the cluster node is associated with CCE security group. Missing association may affect container network connectivity. |
Locate the target BCC instance, view its details, select the network card on the instance’s security group page, and bind it to the CCE default security group. | |
| Abnormal API server BLB 6443 port target group | Check if the target group configuration for the APIServer BLB 6443 port is normal. Anomalies prevent cluster access. |
1. Go to the application BLB instance page, find the BLB associated with the cluster, and check the BLB instance’s target group configuration. 2. If no BLB instance is found, submit a CCE ticket. |
|
| Expired APIServer Loopback certificate | Check if the APIServer Loopback certificate is expired. Expiration may affect internal APIServer communication. |
Restart the API Server. | |
| Component risks | Abnormal cluster component status | Check if installed components (in Component Management) are in the expected state. Abnormal status may disable component services and affect business. |
Verify the status of the components. |
| Outdated cluster components | Check if key cluster components need version updates. New versions offer new features and better stability. |
Perform a component upgrade. | |
| Resource status | Node status | Checks for NotReady nodes in the cluster. Abnormal node status prevents pod scheduling to the node. |
Inspect the node status and scale nodes up or down as required. |
| Mismatched workload replicas | Checks if the desired number of workload replicas matches the actual number. Mismatch fails to meet high reliability requirements. |
Identify abnormal workloads, address the issues, and update the replica count accordingly. | |
| DaemonSets replica mismatch (check if the number of DaemonSets replicas matches the number of nodes) | Check if the number of DaemonSets replicas matches the number of nodes. Mismatch may cause related function anomalies. |
Analyze the causes of the abnormal replica count, resolve any issues, and adjust the replica count. |
