CCE Node Problem Detector Description

Updated at：2025-10-27

Component introduction

In real Kubernetes cluster usage, nodes may experience various issues that impact overall service availability. Additionally, by default, the Kubernetes control plane does not promptly detect node anomalies, leading to instances being scheduled to unavailable nodes. Cluster users often are not immediately aware of node or instance failures, delaying corrective actions. When such failures occur, users expect the platform to enable automatic fault recovery, such as migrating instances from fault nodes.

Node-Problem-Detector enhances the cluster's node fault detection capabilities. When users install this component in the cluster, it operates as a DaemonSet to monitor real-time node anomalies and reports results to the upstream Kube-APIServer.

Component function

Provide node fault detection capabilities
Supported fault reporting methods include
- Node condition: Pod may not be able to run on this node
- Event: Temporary issues affecting the node, but meaningful for system diagnosis
- Metrics: Prometheus metrics are generated for both of the above scenarios, available for collection and display

Limitations

Cluster version 1.18.9 or higher

Install component

Sign in to the Baidu AI Cloud official website and enter the management console.
Go to Product Services - Cloud Native - Cloud Container Engine (CCE) to access the CCE management console.
Click on Cluster Management > Cluster List in the left navigation bar.
Click on the target cluster name in the Cluster List page to navigate to the cluster management page.
On the Cluster Management page, navigate to Component Management.
In the component management list, find the CCE Node Problem Detector component and click Install.
Click the OK button to finalize the component installation.

View cluster health check status

After installing the component, you can view fault statuses on the node health check interface.

Sign in to the Baidu AI Cloud Official Website and enter the management console.
Go to Product Services - Cloud Native - Cloud Container Engine (CCE) to access the CCE management console.
Click on Cluster Management > Cluster List in the left navigation bar.
Click on the target cluster name in the Cluster List page to navigate to the cluster management page.
On the Cluster Management page, go to Node Management > Worker to access the node management interface.
Click the Health Check icon to display the node’s health check status:

image (1).png

Review the health check statuses. Currently, only NetworkUnavailable and Ready states affect node availability.

image (2).png

Fault status

Node Conditions

After installing CCE-Node-Problem-Detector, the following conditions will be added to nodes:

Condition Type	Default value	Description	Impact
FrequentKubeletRestart	False	Whether Kubelet has restarted more than 5 times within 20 min	Affect container creation
FrequentContainerRuntimeRestart	False	Whether Docker or Containerd has restarted more than 5 times within 20 min	Affect container creation
ContainerRuntimeUnhealthy	False	Whether the node's container runtime is available	Affect container creation
CorruptDockerOverlay2	False	Docker Overlay2 file system errors	Affect the normal operation of Docker runtime
InodesPressure	False	Whether node Inode usage exceeds 80%	Affect directory/file creation
KernelDeadlock	False	Whether there is a deadlock in the kernel	Container creation, deletion and operation may be affected
ReadonlyFilesystem	False	Whether the file system is read-only	Processes cannot create new directories/files or perform write operations
GPUUnhealthy	False	Whether there is GPU fault (EBC models with GPUs only)	1. GPU fails and unavailable; 2. GPU training/inference tasks may be interrupted
NICUnhealthy	False	Whether there is a network interface card fault (EBC models with GPUs only)	1. Network communication may be interrupted between nodes/Pods; 2. GPU training/inference tasks may be interrupted
MemoryUnhealthy	False	Whether there is a memory fault (EBC models only)	Memory unavailable, task interrupted

Introduction to the EBC model fault detection

For elastic baremetal compute (EBC), Node-Problem-Detector integrates with Baidu AI Cloud’s hardware awareness component HAS-agent, adding hardware health detection capabilities for GPU/RDMA network interface cards/CPUs/memory, etc.

Prerequisites

The Baidu AI Cloud’s hardware awareness component HAS-agent is installed on the machine.

GPU detection

Detection dimension	Description	Handling methods
GPU drop out	GPU drop out, with GPU scenario unrecognized	Update Node Condition, GPUUnhealthy: True
GPU memory	Scenarios such as EccError and RemappedPending	Update Node Condition, GPUUnhealthy: True
GPU chain	Scenarios such as Nvlink failure or bandwidth anomalies	Update Node Condition, GPUUnhealthy: True
XID	Xid fault of Nvidia GPU	Handling varies by XID as follows: 1. For XIDs 48, 62, 64, 74, 79, 95, 109, 122, 123, 124: Update Node Condition (GPUUnhealthy: True); 2. For other XIDs not listed above: Only print events
Other faults	Scenarios like overheating, abnormal power consumption, or drive anomaly	Update Node Condition, GPUUnhealthy: True

CPU/memory detection

Detection dimension	Description	Handling methods
CPU	Common scenarios such as CPU Cache read/write errors	Only print events
Memory	Common scenarios like unrecoverable ECC faults	Updates Node Condition, MemoryUnhealthy: True
Memory	Scenarios like recoverable ECC error storms and isolatable faults	Only print events

Network interface card detection

Detection dimension	Description	Handling methods
RDMA network interface card up/down	Frequent RDMA network interface card up/down	Update Node Condition, NICUnhealthy: True
RDMA network interface card slowdown	RDMA network interface card pcie slowdown, failing to reach design speed	Update Node Condition, NICUnhealthy: True

Implement a fault alarm using monitor metrics

Introduction to Node-Problem-Detector metrics

Node-Problem-Detector exposes the following Prometheus metrics:

Metric name	Metric type	Metric description
problem_counter	counter	The number of faults that have occurred at the current moment
problem_gauge	gauge	Whether a specific type of fault exists at the current moment

The LabelSet included in the metric is as follows:

instance: Fault node
namespace: The namespace where the NPD instance is deployed
type: Fault type, corresponding to Node Condition Type
reason: Fault reason, corresponding to Node Condition Reason
region: The region where the node is located

Users can configure Prometheus monitors and alerts based on the above metrics as needed;

Add metric collection task for Node-Problem-Detector in CProm

Before adding metric collection rules for Node-Problem-Detector in CProm, you need to associate the K8S Cluster CCE with a CProm instance and add a metric collection task first. Operation references:

Add the following metric collection task:

YAML

1job_name: node-problem-detector
2scrape_interval: 3s
3kubernetes_sd_configs:
4  - role: endpoints
5relabel_configs:
6  - source_labels: [ __meta_kubernetes_service_annotation_npd_prometheus_scrape ]
7    action: keep
8    regex: true
9  - source_labels: [ __meta_kubernetes_service_annotation_npd_prometheus_scheme ]
10    action: replace
11    target_label: __scheme__
12    regex: (https?)
13  - source_labels: [ __meta_kubernetes_service_annotation_npd_prometheus_path ]
14    action: replace
15    target_label: __metrics_path__
16    regex: (.+)
17  - source_labels: [ __address__, __meta_kubernetes_service_annotation_npd_prometheus_port ]
18    action: replace
19    target_label: __address__
20    regex: ([^:]+)(?::\d+)?;(\d+)
21    replacement: $1:$2
22  - source_labels: [ __meta_kubernetes_endpoint_node_name ]
23    action: replace
24    target_label: node
25  - source_labels: [ __meta_kubernetes_namespace ]
26    action: replace
27    target_label: namespace
28  - source_labels: [ __meta_kubernetes_service_name ]
29    action: replace
30    target_label: service
31  - source_labels: [ __meta_kubernetes_pod_name ]
32    action: replace
33    target_label: pod

Version records

Version No.	Cluster version compatibility	Update time	Update content	Impact
0.8.25	v1.18+	2024-03-21	The component adds fault detection for machine gid sequence jumps	--
0.8.24	v1.18+	2024-03-01	The component adds Kubelet fault detection ; support configuring to disable bce-instance-id in Metrics Labels ; optimize component configuration	--
0.8.23	v1.18+	2024-01-08	Adapt to ubuntu 22.04 OS ; update the base image to reduce the image size 2 G -> 278 M ; remove dependencies on the systemctl command ; optimize dependencies on container runtime (docker and containerd)	--
0.8.22	v1.18+	2023-12-28	Optimize MPS fault detection performance, and support detection based on the user who starts the mps-server ; provide an MPS fault detection switch (disabled by default, can be enabled on demand) ; users who have already used MPS detection need to re-enable MPS detection after upgrading	--
0.8.21	v1.18+	2023-12-27	The data source for network interface card fault detection is connected to has-agent ; support configuring to ignore expected Xids when detecting GPU Xid faults	--
0.8.20	v1.18+	2023-12-07	Support MPS fault detection and monitor	--
0.8.19	v1.18+	2023-11-30	Fault detection collection tasks support collecting the short ID of the instance corresponding to the node	--
0.8.18	v1.18+	2023-11-23	Optimize NPD deployment affinity configuration and monitor configuration	--
0.8.17	v1.18+	2023-11-20	Optimize the timeout duration for memory fault monitor ; fix the error when accessing has-agent on non-EBC nodes where NPD is deployed but no HAS service exists	--
0.8.16	v1.18+	2023-10-17	The data source for GPU Xid fault detection is connected to has-agent, and kernel log retrieval is canceled ; custom ignoring of Xids is not supported in GPU fault detection	--
0.8.15	v1.18+	2023-08-11	Support CPU and memory fault detection, available only for EBC models. Add fault types: CPUUnhealthy and MemoryUnhealthy ; optimize GPU Xid fault detection logic ; optimize RDMA network interface card fault detection logic	--
0.8.14	v1.18+	2023-06-03	Bugfix for network interface card fault detection logic	--
0.8.13	v1.18+	2023-04-08	NPD supports GPU fault detection and network interface card fault detection, available only for EBC models. Add fault types: GPUUnhealthy and NICUnhealthy{-1-; introduce community optimization and bugfixe	--
0.8.12	v1.18+	2023-03-07	NPD is released through the component center. The following fault detections are supported: 1. FrequentKubeletRestart 2. FrequentContainerRuntimeRestart 3. ContainerRuntimeUnhealthy 4. CorruptDockerOverlay2 5. InodesPressure 6. KernelDeadlock 7. ReadonlyFilesystem	--

CCE NodeLocal DNSCache Description

CCE Ascend Mindx DL Description

CCE CCE