High Availability

Last Updated：2021-05-06

Service High Availability

The service high availability is composed of such modules as Xmaster, Xagent, Noah, and cold backup service. With the help of 7*24h OPS staff on duty, it ensures the availability and data security of the data link service and handles the exception in the database. Also, it can send an alarm if there is an exception that cannot be repaired automatically. In this case, the OPS staff gets notified of the alarm.

Also, the user can create an instance and adopt proper data replication means in the Multi-AZ-supported regions. Thus, it further improves the RDS service’s high availability.

XMaster

You can deploay the XMaster on a cluster central control unit. The XMaster module receives heartbeat information (health state) from the XAgent module and analyzes heartbeat information of master nodes, hot backup nodes, and read-only nodes. It automatically initiates assignments, such as master-back hot switching and auto repairing of nodes, ensuring high availability of the RDS service.

For example:

Auto repair to abnormal interruption of master-slave synchronization.
Auto repair to table-level corruption of master, backup, and read-only nodes.
Auto repair and site reservation to crash of master, backup, and read-only nodes.
In case of master node crash or network unavailability, automatically turns the hot standby node into the master node to guarantee the service.
The XMaster module also can report the identity change and status change of the master, slave, and read-only nodes to the Load Balance and Proxy. Thus, it ensures the users’ access to correct nodes. For the exceptional status that cannot get repaired, the module can report it to the Noah, so that the OPS staff get notified to take immediate actions.

XAgent

The XAgent module is responsible for detecting the health status of the RDS service nodes (for example, whether or not the service is normal, whether or not the occupancy of resources like CPU, memory, and the disk is normal, and so on). The XAgent module can report every node’s health status to the XMaster via the heartbeat information with a time interval of 3s. Then, the XMaster Module judges if it initiates high-availability maintenance assignments, such as master-backup hot switch and auto repair.

The XAgent Module can also monitor those exceptions that cannot be automatically repaired, such as no-jitter network exception, disk usage beyond the alert line, exceptional interruption of the master-slave synchronization, and overly long master-slave delay. The module can also report the information to Noah Module for alarm, and notify the OPS staff to take immediate actions, ensuring the high availability of the RDS service in extreme cases.

BCM

BCM, a monitoring and alarm module, is used for receiving exception information from the XAgent and give a notification to the OPS staff via email, SMS, and telephone. Then, the staff can take immediate actions, ensuring high availability of the RDS service in extreme cases.

Cold Backup Service

Cold Backup Service backs up Baidu AI Cloud database RDS service data fully and periodically and works with hot backup nodes to ensure the users’ data security.

Instructions on Relations of High-availability Modules

The XMaster detects the master node anomaly as per health information from the XAgent, and then initiates the master-slave hot switch. At this moment, user traffic is directed to the new master node that is just defined. Then, the switched nodes gets reparied automatically. In case of failure to repair it, it can send the exception information to the BCM, which in turn notifies the OPS staff.

The XMaster module can detects the master-slave interruption of the hot backup node, in light of the health information from the XAgent. And, the module can find that the interruption cannot get repaired after an attempt to perform the auto repair. Then, it reports the anomaly information to BCM, which in turn notifies BCM OPS staff.

Multi-AZ

The multi-AZ is a logic zone. It integrates with multiple single-AZs in a given region based on the single-AZ. The master and slave nodes are deployed on different single-AZs. Relative to single-AZ Baidu AI Cloud database RDS service, multi-AZ Baidu AI Cloud database RDS service may deal with disasters of a higher level.

For example, the single-AZ RDS service can handle faults at the server and rack levels. However, the multi-AZ RDS service can handle faults at the data center level.

Presently, no fees are additionally charged for the multi-AZ RDS service. The user in a region where the multi-AZ is enabled can directly purchase a multi-AZ RDS service. Or, the user can take advantage of the RDS service to transfer the single-AZ RDS instance to the multi-AZ RDS service composed of different availability zones.

Note: Since there is network delay between multi-AZs, multi-AZ Baidu AI Cloud database RDS instance has a longer time of response to a single update as compared to its response to a single-AZ instance when semi-synchronous data replication mode is adopted. In this case, it’s better to raise the overall throughput by raising concurrency.

Data Replication Mode

You may select proper data replication mode as per your service characteristics, to improve the availability. Presently, Baidu AI Cloud database RDS for MySQL provides two data replication modes, as detailed below:

Asynchronous replication: The asynchronous replication means that when the application sends updating requests (i.e., data addition, deletion, and modification), the master database immediately responds to the application after completing appropriate operations. Meanwhile, the master database replicates data asynchronously to a backup database. Thus, operations on the master database are free from influence when the backup database is unavailable. However, where the master database shutdown occurs when the operation and submission are complete, and the data have not been sent to the backup database, no submitted transaction modification data are available in the backup database. In this case, if the backup database is switched to the new master database, it leads to data loss. On the whole, there is a small probability of data inconsistency between the master database and backup database resulting from the unavailable master database.
Semi-synchronous: to resolve data loss in asynchronous replication, MySQL introduces the semi-synchronous replication feature. Semi-synchronous replication means the state in which the transaction log needs to be written to the master database and transmitted to at least one backup database and written to this backup database during the MySQL replication process and after the transaction is submitted on the master database, and only after a notice is given to the master database can the transactions on the master database be successfully submitted. In the semi-synchronous replication mode, transactions, which have been submitted on the master database, can be retrieved on at least one backup database, which in turn helps resolve data loss in asynchronous replication. However, at least one backup database must be subjected to data persistence, and confirmation must be sent. Time delay for transaction submission on master database increases, so TPS decreases. If master-backup replication is abnormal, replication mode changes into asynchronous replication. Only after the abnormal situation is resolved, can the replication mode be restored. Thus, the unavailable master database still leads to a small probability of data inconsistency,

Data Migration

Read-Write Separation

百度智能云

Relational Database Service

High Availability