Recommended Monitoring and Maintenance Tasks
This section lists monitoring and maintenance operations recommended to ensure high availability and consistent performance of your Cloudberry Database cluster.
The tables in the following sections suggest operations that a Cloudberry Database system administrator can perform periodically to ensure that all components of the system are operating optimally. Monitoring operations help you to detect and diagnose problems early. Maintenance operations help you to keep the system up-to-date and avoid deteriorating performance, for example, from bloated system tables or diminishing free disk space.
It is not necessary to implement all of these suggestions in every cluster; use the frequency and severity recommendations as a guide to implement measures according to your service requirements.
Database state monitoring operations
Operations | Procedure | Corrective Actions |
---|---|---|
List segments that are currently down. If any rows are returned, this should generate a warning or alert. Recommended frequency: run every 5 to 10 minutes Severity: IMPORTANT | Run the following query in the
| If the query returns any rows, follow these steps to correct the problem:
|
Check for segments that are up and not in sync. If rows are returned, this should generate a warning or alert. Recommended frequency: run every 5 to 10 minutes | Execute the following query in the
| If the query returns rows, then the segment might be in the process of
moving from |
Check for segments that are not operating in their preferred role but
are marked as up and Recommended frequency: run every 5 to 10 minutes Severity: IMPORTANT | Execute the following query in the
| When the segments are not running in their preferred role, processing
might be skewed. Run |
Run a distributed query to test that it runs on all segments. One row should be returned for each primary segment. Recommended frequency: run every 5 to 10 minutes Severity: CRITICAL | Execute the following query in the
| If this query fails, there is an issue dispatching to some segments in the cluster. This is a rare event. Check the hosts that are not able to be dispatched to ensure there is no hardware or networking issue. |
Test the state of coordinator mirroring on Cloudberry Database. If the value is not "STREAMING", an alert or warning will be raised. Recommended frequency: run every 5 to 10 minutes Severity: IMPORTANT | Run the following
| Check the log file from the coordinator and standby coordinator for
errors. If there are no unexpected errors and the machines are up, run
the |
Perform a basic check to see whether the coordinator is up and functioning. Recommended frequency: run every 5 to 10 minutes Severity: CRITICAL | Run the following query in the
| If this query fails, the active coordinator might be down. Try to start the database on the original coordinator if the server is up and running. If that fails, try to activate the standby coordinator as coordinator. |
Hardware and operating system monitoring
Operations | Procedure | Corrective Actions |
---|---|---|
Check disk space usage on volumes used for Cloudberry Database data storage and the OS. Recommended frequency: every 5 to 30 minutes Severity: CRITICAL |
| Use |
Check for errors or dropped packets on the network interfaces. Recommended frequency: hourly Severity: IMPORTANT | Set up a network interface checks. | Work with network and OS teams to resolve errors. |
Check for RAID errors or degraded RAID performance. Recommended frequency: every 5 minutes Severity: CRITICAL | Set up a RAID check. |
|
Check for adequate I/O bandwidth and I/O skew. Recommended frequency: when create a cluster or when hardware issues are suspected. | Run the Cloudberry Database
| The cluster might be under-specified if data transfer rates are not similar to the following:
If transfer rates are lower than expected, consult with your data architect regarding performance expectations. If the machines on the cluster display an uneven performance profile, work with the system administration team to fix faulty machines. |
Catalog monitoring
Operations | Procedure | Corrective Actions |
---|---|---|
Run catalog consistency checks in each database to ensure the catalog on each host in the cluster is consistent and in a good state. You might run this command while the database is up and running. Recommended frequency: weekly Severity: IMPORTANT | Run the Cloudberry Database
Note: With the
| Run the repair scripts for any issues identified. |
Check for Recommended frequency: monthly Severity: IMPORTANT | With no users on the system, run the Cloudberry Database
| Run the repair scripts for any issues identified. |
Check for leaked temporary schema and missing schema definition. Recommended frequency: monthly Severity: IMPORTANT | During a downtime, with no users on the system, run the Cloudberry Database
| Run the repair scripts for any issues identified. |
Check constraints on randomly distributed tables. Recommended frequency: monthly Severity: IMPORTANT | With no users on the system, run the Cloudberry Database
| Run the repair scripts for any issues identified. |
Check for dependencies on non-existent objects. Recommended frequency: monthly Severity: IMPORTANT | During a downtime, with no users on the system, run the Cloudberry Database
| Run the repair scripts for any issues identified. |
Data maintenance
Operations | Procedure | Corrective Actions |
---|---|---|
Check for missing statistics on tables. | Check the
| Run |
Check for tables that have bloat (dead space) in data files that cannot
be recovered by a regular Recommended frequency: weekly or monthly Severity: WARNING | Check the
|
|
Database maintenance
Operation | Procedure | Corrective Actions |
---|---|---|
Reclaim space occupied by deleted rows in the heap tables so that the space they occupy can be reused. Recommended frequency: daily Severity: CRITICAL | Vacuum user tables:
| Vacuum updated tables regularly to prevent bloating. |
Update table statistics. Recommended frequency: after loading data and before executing queries Severity: CRITICAL | Analyze user tables. You can use the
| Analyze updated tables regularly so that the optimizer can produce efficient query execution plans. |
Backup the database data. Recommended frequency: daily, or as required by your backup plan Severity: CRITICAL | Run the | Best practice is to have a current backup ready in case the database must be restored. |
Vacuum, reindex, and analyze system catalogs to maintain an efficient catalog. Recommended frequency: weekly, or more often if database objects are created and dropped frequently |
| The optimizer retrieves information from the system tables to create
query plans. If system tables and indexes are allowed to become bloated
over time, scanning the system tables increases query execution time. It
is important to run |