RDS Monitoring and Alarm Configuration

Last Updated：2021-05-10

Background

After the creation of the RDS instance, two alarm strategies (namely, disk usage rate and CPU occupancy rate) gets configured automatically by default. To learn about the database running status more promptly and accurately, recommend you to configure more exhaustive monitoring policies in BCM autonomously. BCM provides an RDS-related monitoring data collection solution, and you may perform selection and configuration according to your needs.

BCM for RDS Monitoring Configuration Method

See Monitoring and Alarm Operations Guide

Recommended Alarm Thresholds for RDS for MySQL Monitoring Items

Monitoring Iitems	Statistical Cycle	Statistical Method	Recommended Threshold	Alarm after User-defined Repetitions
CPU occupancy rate	1min	Mean	> 80%	3
Data space disk usage rate	1min	Mean	> 80%	3
System space disk usage rate	1min	Mean	> 80%	3
Memory utilization	1min	Mean	> 90%	3
Slow log	1min	Mean	> two times current instance's CPU cores	3
Master-slave delay	1min	Mean	300	3
Total number of connections	1min	Mean	> 80% of current instance parameter "max_connections"	3
Number of active connections	1min	Mean	> two times current instance's CPU cores	3
Maximum transaction execution time	1min	Mean	60	3

Best Practices for RDS Disk Monitoring

Disk monitoring curve

Data space disk usage rate

Note: It means the usage rate of the data space disk. Calculation formula: Disk space for data/ purchased disk space, i.e., user data(including table file, shared table space, and temporary file)/purchased disk space). See the following blue monitoring curve. Influence: If the usage rate of the data disk space exceeds 100%, set the "rds" instance to read-only mode, so that the user cannot write the data.
System space disk usage rate

Note: system space disk usage rate, calculation formula: (data usage disk space plus log usage disk space)/purchased disk space, namely, (user data +log (mysql.log, slow.log, mysql.err, binlog, system collection log ))/(purchased disk space). See the following red monitoring curve, influence: If the usage rate of the system space disk reaches 100%, the disk is full, leading to failure to continue the data write-in.

Case

One customer purchases a Dual High-availability instance initializes the data, and then views the following disk monitoring information:

Usage rate of the data space disk

System space disk usage: 14.42%

To ensure data security and audit, the customer enables full log and relatively long cycle for reserving "binlog". After a period of running, the customer receives a call from "rds": disk usage rate rises sharply and hits 87% within one hour, so the "Full Disk" risk exists. See the following figure:

Authorize DBA to locate the cause for the sharp rise of disk usage rate: use of SQL against the rules leads to a sharp increase of log files like "mysql.log", "slow.log", and "mysql.err", etc.

The solution is to upgrade the disk package reasonably, and optimize SQL, and purge abnormal log files. After that, the system disk usage rate declines to normal. See the following figure:

Slow Query Alarm Handling Method

CPU Alarm Handling Method

百度智能云

Relational Database Service