Evacuate Faulty Instance
This document explains how to redeploy malfunctioning instances through evacuation.
Function description
The evacuation process is a key method for improving service availability and ensuring stability.
If you use EBC (Elastic Baremetal Compute) or EHC (Elastic High-performance Computing Cluster) instances in baremetal form, Baidu AI Cloud provides an evacuation capability to recover instances that encounter CRITICAL-level alert events or become unresponsive due to unexpected operations. This capability restores your instance on a fault-free host while maintaining all vital information consistent with the original instance, including:
- instance ID, name, hostname, and other basic information of the instance
- VPC, subnet IP address, as well as the secondary IP address of the primary network interface and elastic network interface IPs
- RDMA IP (if applicable) and other information
- Mounting status of cloud disk (data disk), elastic network interface, etc.
Prerequisites
To use the evacuation function normally, ensure that all data disks in the /etc/fstabconfiguration file of EBC/EHC instances have the nofail parameter added. You can follow the instance below to adjust the nofail attribute of the data disk using vim command:
1/dev/nvme0n1 /data1 ext4 defaults,barrier=0,nofail 0 0
| Parameters | Description |
|---|---|
| /dev/nvme0n1 | The local disk device name can be viewed in the instance using commands like df -horlsblk.Formats vary by local disk type, following Linux device display logic, e.g., /dev/sda, or vdb, or nvme0n1, without indicating disk partition. |
| /data1 | Mount points of local disk can be queried using the grep statement of mount command or directly via the MOUNTPOINTof lsblk. |
| ext4 | The system type of ext4 files can be queried via the blkid /dev/nvme0n1 command, defaulting to ext4 under standard logic. |
| barrier=0 | Mount options: the barrier is disabled in the file system. |
| nofail | If a local disk is listed in the file system but is physically missing, the instance's startup process will not be interrupted. |
Usage restrictions
- Because the evacuation feature has specific requirements for underlying resource logic and typically requires support from a technical account manager, it is currently available only to selected users. In the future, this capability will be offered as an additional feature of the repair platform to a broader range of users.
- Instance evacuation could fail. To prevent abnormal deletion or data loss, Baidu AI Cloud will automatically roll back your instance in case of failure.
- Currently, only EBC/EHC products support evacuation actions for faulty instances.
- Data on the local disk will be completely lost after evacuating EBC/EHC instances.Please back up relevant data before using this feature.
