Apache Cloudberry Crash Course
This crash course provides an extensive overview of Apache Cloudberry, an open-source Massively Parallel Processing (MPP) database. It covers key concepts, features, utilities, and hands-on exercises to become proficient with Cloudberry.
Topics include:
- Lesson 0. Prerequisite
- Lesson 1. Where to read the official documentation
- Lesson 2. How to install Cloudberry
- Lesson 3. Cluster architecture
- Lesson 4. Management utilities
- Lesson 5. Start and stop a cluster
- Lesson 6. Check cluster state
- Lesson 7. How Cloudberry segment mirroring works
- Lesson 8. Cloudberry's fault tolerance and segment recovery
- Lesson 9. Set up and manage the standby coordinator instance in Cloudberry
- Lesson 10. Expand a cluster
- Lesson 11. Check cluster performance
- Lesson 12. User data and table distribution
- Lesson 13. Database catalog
- Lesson 14. Cloudberry data directories
- Lesson 15. Instance processes
- Lesson 16. Database log files
- Lesson 17. Table types in Cloudberry: heap, AO, and AOCO
- Lesson 18. External tables
- Lesson 19. Workload management
Lesson 0. Prerequisite
Before starting this crash course, spend some time going through the Apache Cloudberry Tutorials Based on Single-Node Installation to get familiar with Apache Cloudberry and how it works.
Lesson 1. Where to read the official documentation
Take a quick look at the official Cloudberry Documentation. No need to worry if you do not understand everything.
Lesson 2. How to install Cloudberry
To begin your journey with Cloudberry, you are expected to install Cloudberry in your preferred environment. The following options are available:
- For testing or trying out Cloudberry in a sandbox environment, see Install Cloudberry in a Sandbox.
- For deploying Cloudberry in other environments (including the production environment) and the prerequisite software/hardware configuration, see Cloudberry Deployment Guide.
Lesson 3. Cluster architecture
A Cloudberry cluster has one coordinator host (usually named cdw
) and multiple segment hosts (usually named sdwXX
).
If someone is referring to cdw
, he is referring to the "coordinator host". Similarly, when somebody is referring to "sdw10", he is referring to the 10th segment host.
A coordinator host usually contains only one instance - the coordinator instance. The segment hosts might contain many worker instances. Every instance has its own set of processes, data directory, and listening port. For example, usually, the listening port of the coordinator instance (where all clients will connect to) is 5432
.
Every segment instance has its own listening port, and the base port is specified in the cluster configuration file.
Instances can have 2 roles - primary and mirror. Primary instances serve database queries. Mirror instances simply track and record data changes in primary instances, but do not serve database queries. If the primary instance goes down for some reason, then the corresponding mirror instance transitions to the primary role and starts serving queries (the original primary instance, currently down, is marked as mirror).
The cluster information is stored in the gp_segment_configuration
system table, which looks like this (use the "psql" command to connect to the database to execute queries):
[gpadmin@cdw ~]$ psql
psql (14.4, server 14.4)
Type "help" for help.