NodeGroup Node Fault Detection and Self-Healing

Updated at：2025-10-27

As the backbone of a cluster, the operational stability of nodes is crucial for ensuring seamless service delivery. Unstable infrastructure and unpredictable environments often lead to system failures in various ways. To further aid users in minimizing maintenance efforts, Cloud Container Engine (CCE) offers built-in failure detection and self-healing for nodes. This guide explains how to set up the node detection and self-healing features.

Function overview

In Cloud Container Engine (CCE), the node failure detection function supports integration with the BCC repair platform to achieve automatic detection and self-healing of failures in node server. For more information, refer to repair Platform Overview.

Detection item introduction

Category	Item	Description	Recommended self-healing actions
Node server hardware failure	Hard disk hardware failure	Server hard disk hardware failure requires repair for recovery	Node cordoning, node drain, node removal, repair authorization
	Other hardware failures	Server CPU, GPU or other hardware failures require repair for recovery	Node cordoning, node drain, repair authorization, recovery detection

Description

Node server hardware failure: Hardware issues that disrupt normal service operations and necessitate the repair or replacement of hardware components.

Self-healing operations introduction

Operation item	Description
Node cordoning	The node will be blocked, and pod can't be scheduled
Drain node	Pods running on the node will be evicted
Node removal	The node will be removed from the cluster with the option to either retain or synchronously delete the server instance.
Repair authorization	Servers will undergo authorized repair (auto or manual), and repair records are generated; users can check the repair progress on the Repair Platform
Recovery detection	Node cordoning is automatically released upon detection of successful failure repair

Operation steps

Prerequisites

The CCE Node Problem Detector and CCE Node Remedier have been installed in the Cluster Component Management module.

Step 1: Create detection and self-healing rules

Sign in to Cloud Container Engine Console (CCE).
On the Cluster List page, click the target cluster name, and select Node Management > Failure Self-healing in the left navigation bar.
Click Create Self-Healing Rules to enter the Create Self-Healing Rules page and complete the configuration.

Item	Description
Rule name	Set a name for the detection and self-healing rules.
Rule configuration	Choose the detection items from the predefined detection list and configure whether self-healing should activate upon anomalies.
Self-healing configuration	Define customized self-healing actions for the chosen detection item.

After configuration, check I acknowledge and have installed HAS-agent, then click OK to complete the configuration.

Description

The node server hardware failure detection is realized through HAS-agent that is installed on the server. The successful installation of HAS-agent shall be guaranteed, otherwise, detection is impossible. For more information, see HAS-agent Component Installation and Upgrade.

Step 2: Bind self-healing rules to node group

You can link detection and self-healing rules while creating a node group or editing an existing one. For example, during the creation of a node group, detection and self-healing rules can be bound.

Sign in to Cloud Container Engine Console (CCE).
On the Cluster List page, click the target cluster name, and select Node Management > Node Group in the left navigation bar.
Click Create Node Group to create a node group and bind detection and self-healing rules at Failure Detection and Self-Healing.

Item	Description
Failure detection and self-healing	Decide whether to activate the node detection and self-healing feature.
Failure self-healing rules	Choose the self-healing rules to associate with the node group.

Click OK to complete binding.

Subsequent operations

View the self-healing task details.

Sign in to Cloud Container Engine Console (CCE).
On the Cluster List page, click the target cluster name, and select Node Management > Node Group in the left navigation bar.
Click on the target node group in the "Node Group List" page to access the management page.
Select the Self-Healing Activities tab to view task records and details of self-healing operations triggered by failures detected in all nodes of the current node group.

NodeGroup Management

Configuring Scaling Policies

CCE CCE