Apache Cloudberry Crash Course
This crash course provides an extensive overview of Apache Cloudberry, an open-source Massively Parallel Processing (MPP) database. It covers key concepts, features, utilities, and hands-on exercises to become proficient with Cloudberry.
Topics include:
- Lesson 0. Prerequisite
- Lesson 1. Where to read the official documentation
- Lesson 2. How to install Cloudberry
- Lesson 3. Cluster architecture
- Lesson 4. Management utilities
- Lesson 5. Start and stop a cluster
- Lesson 6. Check cluster state
- Lesson 7. How Cloudberry segment mirroring works
- Lesson 8. Cloudberry's fault tolerance and segment recovery
- Lesson 9. Set up and manage the standby coordinator instance in Cloudberry
- Lesson 10. Expand a cluster
- Lesson 11. Check cluster performance
- Lesson 12. User data and table distribution
- Lesson 13. Database catalog
- Lesson 14. Cloudberry data directories
- Lesson 15. Instance processes
- Lesson 16. Database log files
- Lesson 17. Table types in Cloudberry: heap, AO, and AOCO
- Lesson 18. External tables
- Lesson 19. Workload management
Lesson 0. Prerequisite
Before starting this crash course, spend some time going through the Apache Cloudberry Tutorials Based on Single-Node Installation to get familiar with Apache Cloudberry and how it works.
Lesson 1. Where to read the official documentation
Take a quick look at the official Cloudberry Documentation. No need to worry if you do not understand everything.
Lesson 2. How to install Cloudberry
To begin your journey with Cloudberry, you are expected to install Cloudberry in your preferred environment. The following options are available:
- For testing or trying out Cloudberry in a sandbox environment, see Install Cloudberry in a Sandbox.
- For deploying Cloudberry in other environments (including the production environment) and the prerequisite software/hardware configuration, see Cloudberry Deployment Guide.
Lesson 3. Cluster architecture
A Cloudberry cluster has one coordinator host (usually named cdw
) and multiple segment hosts (usually named sdwXX
).
If someone is referring to cdw
, he is referring to the "coordinator host". Similarly, when somebody is referring to "sdw10", he is referring to the 10th segment host.
A coordinator host usually contains only one instance - the coordinator instance. The segment hosts might contain many worker instances. Every instance has its own set of processes, data directory, and listening port. For example, usually, the listening port of the coordinator instance (where all clients will connect to) is 5432
.
Every segment instance has its own listening port, and the base port is specified in the cluster configuration file.
Instances can have 2 roles - primary and mirror. Primary instances serve database queries. Mirror instances simply track and record data changes in primary instances, but do not serve database queries. If the primary instance goes down for some reason, then the corresponding mirror instance transitions to the primary role and starts serving queries (the original primary instance, currently down, is marked as mirror).
The cluster information is stored in the gp_segment_configuration
system table, which looks like this (use the "psql" command to connect to the database to execute queries):
[gpadmin@cdw ~]$ psql
psql (14.4, server 14.4)
Type "help" for help.
gpadmin=# SELECT * FROM gp_segment_configuration;
dbid | content | role | preferred_role | mode | status | port | hostname | address | datadir
------+---------+------+----------------+------+--------+-------+----------+---------+--------------------------------
1 | -1 | p | p | n | u | 5432 | cdw | cdw | /data0/database/coordinator/gpseg-1
2 | 0 | p | p | n | u | 40000 | cdw | cdw | /data0/database/primary/gpseg0
3 | 1 | p | p | n | u | 40001 | cdw | cdw | /data0/database/primary/gpseg1
(3 rows)
The columns of this system are described as follows.
dbid
: uniquely identifies a segment.content
: uniquely identifies segment pairs (primary and mirror). The primary and the corresponding mirror will have the samecontent
ID, but differentdbid
values. The coordinator has thecontent
value of-1
. The worker instances have incremental content values of0
,1
,2
,3
...role
: the current role of the segment.preferred_role
: the role of the segment in the original configuration. Note that if an original mirror instance has taken over and become primary now, the role will be changed. This column records the original role.mode
: the mode of the segment. The value options ares
(in sync),c
(in change tracking), andr
(in recovery).status
: the status of the segment. The value options areu
(up) andd
(down).port
: the listening port of the segment. For clients, only the listening port of the coordinator is important. The segment listening ports are important for the coordinator to communicate with them.hostname
: the hostname of the segment.address
: each host can have different network controllers with different IP addresses and different names associated.datadir
: the data directory where data is stored for each segment.
Exercise
Connect to the Cloudberry cluster that you have created and take a look at the gp_segment_configuration
table. Try to learn the rows and columns. Take a look at the cluster configuration file that you used to create the cluster.
Lesson 4. Management utilities
Management utilities in Cloudberry are command-line tools used to administer and manage the database cluster. Some key points:
- They allow performing tasks like starting, stopping, and configuring the database.
- Help monitor the health and status of the cluster.
- Used for maintenance like recovering nodes and rebalancing data.
- Help scale out the cluster by expanding with more nodes.
- Utilities work across coordinator, standby coordinator, and multiple segment instances.
In summary, management utilities are command-line programs and scripts used by DBAs to administer, monitor, maintain and manage a Cloudberry cluster. The following are some common utilities.
gpstop
: stops database cluster.gpstart
: starts database cluster.psql
: a command-line client.gpconfig
: shows or changes configuration parameters.gpdeletesystem
: deletes a cluster.pg_dump
,gpbackup
,gprestore
: performs backup and restore operations.gpinitstanby
,gpactivatestandby
: manages the standby coordinator instance.gprecoverseg
: recovers segment.gpfdist
,gpload
: operates with external tables.gpssh
,gpscp
,gpssh-exkeys
: for cluster navigation.- Logging - all utilities write log files under
~/gpAdminLogs/
- one file per day
Exercise
Read the help information for these tools (<tool_name> --help
).
Lesson 5. Start and stop a cluster
-
Start a Cloudberry cluster using
gpstart
:[gpadmin@cdw ~]$ gpstart -a