RDS Monitoring and Alarm Configuration
Background
After the creation of the RDS instance, two alarm strategies (namely, disk usage rate and CPU occupancy rate) gets configured automatically by default. To learn about the database running status more promptly and accurately, recommend you to configure more exhaustive monitoring policies in BCM autonomously. BCM provides an RDS-related monitoring data collection solution, and you may perform selection and configuration according to your needs.
BCM for RDS Monitoring Configuration Method
See Monitoring and Alarm Operations Guide
Recommended Alarm Thresholds for RDS for MySQL Monitoring Items
Monitoring Iitems | Statistical Cycle | Statistical Method | Recommended Threshold | Alarm after User-defined Repetitions |
---|---|---|---|---|
CPU occupancy rate | 1min | Mean | > 80% | 3 |
Data space disk usage rate | 1min | Mean | > 80% | 3 |
System space disk usage rate | 1min | Mean | > 80% | 3 |
Memory utilization | 1min | Mean | > 90% | 3 |
Slow log | 1min | Mean | > two times current instance's CPU cores | 3 |
Master-slave delay | 1min | Mean | 300 | 3 |
Total number of connections | 1min | Mean | > 80% of current instance parameter "max_connections" | 3 |
Number of active connections | 1min | Mean | > two times current instance's CPU cores | 3 |
Maximum transaction execution time | 1min | Mean | 60 | 3 |
Best Practices for RDS Disk Monitoring
Disk monitoring curve
-
Data space disk usage rate
Note: It means the usage rate of the data space disk. Calculation formula: Disk space for data/ purchased disk space, i.e., user data(including table file, shared table space, and temporary file)/purchased disk space). See the following blue monitoring curve. Influence: If the usage rate of the data disk space exceeds 100%, set the "rds" instance to read-only mode, so that the user cannot write the data.
-
System space disk usage rate
Note: system space disk usage rate, calculation formula: (data usage disk space plus log usage disk space)/purchased disk space, namely, (user data +log (mysql.log, slow.log, mysql.err, binlog, system collection log ))/(purchased disk space). See the following red monitoring curve, influence: If the usage rate of the system space disk reaches 100%, the disk is full, leading to failure to continue the data write-in.
Case
One customer purchases a Dual High-availability instance initializes the data, and then views the following disk monitoring information:
Usage rate of the data space disk
System space disk usage: 14.42%
To ensure data security and audit, the customer enables full log and relatively long cycle for reserving "binlog". After a period of running, the customer receives a call from "rds": disk usage rate rises sharply and hits 87% within one hour, so the "Full Disk" risk exists. See the following figure:
Authorize DBA to locate the cause for the sharp rise of disk usage rate: use of SQL against the rules leads to a sharp increase of log files like "mysql.log", "slow.log", and "mysql.err", etc.
The solution is to upgrade the disk package reasonably, and optimize SQL, and purge abnormal log files. After that, the system disk usage rate declines to normal. See the following figure: