NodeGroup Node Fault Detection and Self-Healing
As the backbone of a cluster, the operational stability of nodes is crucial for ensuring seamless service delivery. Unstable infrastructure and unpredictable environments often lead to system failures in various ways. To further aid users in minimizing maintenance efforts, Cloud Container Engine (CCE) offers built-in failure detection and self-healing for nodes. This guide explains how to set up the node detection and self-healing features.
Function overview
In Cloud Container Engine (CCE), the node failure detection function supports integration with the BCC repair platform to achieve automatic detection and self-healing of failures in node server. For more information, refer to repair Platform Overview.
Detection item introduction
| Category | Item | Description | Recommended self-healing actions |
|---|---|---|---|
| Node server hardware failure | Hard disk hardware failure | Server hard disk hardware failure requires repair for recovery | Node cordoning, node drain, node removal, repair authorization |
| Other hardware failures | Server CPU, GPU or other hardware failures require repair for recovery | Node cordoning, node drain, repair authorization, recovery detection |
Description
- Node server hardware failure: Hardware issues that disrupt normal service operations and necessitate the repair or replacement of hardware components.
Self-healing operations introduction
| Operation item | Description |
|---|---|
| Node cordoning | The node will be blocked, and pod can't be scheduled |
| Drain node | Pods running on the node will be evicted |
| Node removal | The node will be removed from the cluster with the option to either retain or synchronously delete the server instance. |
| Repair authorization | Servers will undergo authorized repair (auto or manual), and repair records are generated; users can check the repair progress on the Repair Platform |
| Recovery detection | Node cordoning is automatically released upon detection of successful failure repair |
Operation steps
Prerequisites
- The CCE Node Problem Detector and CCE Node Remedier have been installed in the Cluster Component Management module.
Step 1: Create detection and self-healing rules
- Sign in to Cloud Container Engine Console (CCE).
- On the Cluster List page, click the target cluster name, and select Node Management > Failure Self-healing in the left navigation bar.
- Click Create Self-Healing Rules to enter the Create Self-Healing Rules page and complete the configuration.
| Item | Description |
|---|---|
| Rule name | Set a name for the detection and self-healing rules. |
| Rule configuration | Choose the detection items from the predefined detection list and configure whether self-healing should activate upon anomalies. |
| Self-healing configuration | Define customized self-healing actions for the chosen detection item. |
- After configuration, check I acknowledge and have installed HAS-agent, then click OK to complete the configuration.
Description
- The node server hardware failure detection is realized through HAS-agent that is installed on the server. The successful installation of HAS-agent shall be guaranteed, otherwise, detection is impossible. For more information, see HAS-agent Component Installation and Upgrade.
Step 2: Bind self-healing rules to node group
You can link detection and self-healing rules while creating a node group or editing an existing one. For example, during the creation of a node group, detection and self-healing rules can be bound.
- Sign in to Cloud Container Engine Console (CCE).
- On the Cluster List page, click the target cluster name, and select Node Management > Node Group in the left navigation bar.
- Click Create Node Group to create a node group and bind detection and self-healing rules at Failure Detection and Self-Healing.
| Item | Description |
|---|---|
| Failure detection and self-healing | Decide whether to activate the node detection and self-healing feature. |
| Failure self-healing rules | Choose the self-healing rules to associate with the node group. |
- Click OK to complete binding.
Subsequent operations
View the self-healing task details.
- Sign in to Cloud Container Engine Console (CCE).
- On the Cluster List page, click the target cluster name, and select Node Management > Node Group in the left navigation bar.
- Click on the target node group in the "Node Group List" page to access the management page.
- Select the Self-Healing Activities tab to view task records and details of self-healing operations triggered by failures detected in all nodes of the current node group.
