Lesson 5: Queries and Performance Tuning
This lesson provides an overview of how Apache Cloudberry processes queries. Understanding this process can be useful when you write and tune queries.
Concepts
Users submit queries to Apache Cloudberry as they would to any database management system. They connect to the database instance on the Apache Cloudberry coordinator host using a client application such as psql and submit SQL statements.
Understand query planning and dispatch
The coordinator host receives, parses, and optimizes the query. The resulting query plan is either parallel or targeted. The coordinator dispatches parallel query plans to all segments, as shown in Figure 1. Each segment is responsible for executing local database operations on its own set of data query plans.
Most database operations such as table scan, join, aggregation and sort will be executed across all segments in parallel. Each operation is performed on one segment database independent of the data stored on other segment databases.
Figure 1. Dispatch the parallel query plan
Understand query plans
A query plan is a set of operations Apache Cloudberry will perform to produce the answer to a query. Each node or step in the plan represents a database operation such as a table scan, join, aggregation or sort. Plans are read and executed from bottom to top.
In addition to common database operations such as tables scan and join, Apache Cloudberry has an additional operation type called "motion". A motion operation involves moving tuples between segments during query processing.
To achieve maximum parallelism during query execution, Apache Cloudberry divides the work of a query plan into slices. A slice is a portion of the plan that segments can work on independently. A query plan is sliced wherever a motion operation occurs in the plan with one slice on each side of the motion.
Understand parallel query execution
Apache Cloudberry creates a number of database processes to handle the work of a query. On the coordinator, the query worker process is called "query dispatcher" or "QD". QD is responsible for creating and dispatching query plan. It also accumulates and presents the final results. On segments, a query worker process is called "query executor" or "QE". QE is responsible for completing its portion of work and communicating its intermediate results to other worker processes.
There is at least one worker process assigned to each slice of the query plan. A worker process works on its assigned portion of the query plan independently. During query execution, each segment will have a number of processes working on the query in parallel.
Related processes that are working on the same slice of the query plan but on different segments are called "gangs". As a portion of work is completed, tuples flow up the query plan from one gang of processes to the next. This inter-process communication between segments is referred to as the interconnect component of Apache Cloudberry.
The following section introduces some of the basic principles of query and performance tuning in a Apache Cloudberry.
Some items to consider in performance tuning:
- VACUUM and ANALYZE
- Explain plans
- Indexing
- Column or row orientation
- Set based vs. row based
- Distribution and partitioning
Exercises
After doing the following exercises, you are expected to finish the previous tutorial Lesson 4: Data Loading.
Analyze the tables
Apache Cloudberry uses Multi-version Concurrency Control (MVCC) to guarantee data isolation, one of the ACID properties of relational databases. MVCC allows multiple users of the database to obtain consistent results for a query, even if the data is changing as the query is being executed. There can be multiple versions of rows in the database, but a query sees a snapshot of the database at a single point in time, containing only the versions of rows that are valid at that point in time. When a row is updated or deleted and no active transactions continue to reference it, it can be removed. The VACUUM
command removes older versions that are no longer needed, leaving free space that can be reused.
In a Apache Cloudberry, regular OLTP operations do not create the need for vacuuming out old rows, but loading data while tables are in use might create such a need. It is a best practice to VACUUM
a table after a load. If the table is partitioned, and only a single partition is being altered, then a VACUUM
on that partition might suffice.
The VACUUM FULL
command behaves much differently than VACUUM
, and its use is not recommended in Apache Cloudberry. It can be expensive in CPU and I/O consumption, cause bloat in indexes, and lock data for long periods of time.
The ANALYZE command generates statistics about the distribution of data in a table. In particular, it stores histograms about the values in each of the columns. The query optimizer depends on these statistics to select the best plan for executing a query. For example, the optimizer can use distribution data to decide on join orders. One of the optimizer's goals in a join is to minimize the volume of data that must be analyzed and potentially moved between segments by using the statistics to choose the smallest result set to work with first.
-
Connect to the database as
gpadmin
and run theANALYZE
command on each of the tables:$ psql -U gpadmin tutorial
tutorial=# ANALYZE faa.d_airports;
ANALYZE
tutorial=# ANALYZE faa.d_airlines;
ANALYZE
tutorial=# ANALYZE faa.d_wac;
ANALYZE
tutorial=# ANALYZE faa.d_cancellation_codes;
ANALYZE
tutorial=# ANALYZE faa.faa_otp_load;
ANALYZE
tutorial=# ANALYZE faa.otp_r;
ANALYZE
tutorial=# ANALYZE faa.otp_c;
ANALYZE
View EXPLAIN
plans
An EXPLAIN
plan explains the method the optimizer has chosen to produce a result set. Depending on the query, there can be a variety of methods to produce a result set. The optimizer calculates the cost for each method and chooses the one with the lowest cost. In large queries, cost is generally measured by the amount of I/O to be performed.
An EXPLAIN
plan does not do any actual query processing work. Explain plans use statistics generated by the ANALYZE
command, so plans generated before and after running ANALYZE
can be quite different. This is especially true for queries with multiple joins, because the order of the joins can have a great impact on performance.
In the following exercise, you will generate some small tables that you can query and view some explain plans.
-
Enable timing so that you can see the effects of different performance tuning measures.
tutorial=# \timing on
-
View the
create_sample_table.sql
script, and then run it.tutorial=# \i create_sample_table.sql
psql:create_sample_table.sql:1: NOTICE: table "sample" does not exist, skipping
DROP TABLE
Time: 8.996 ms
SET
Time: 0.509 ms
CREATE TABLE
Time: 20.419 ms
INSERT 0 1000000
Time: 28598.022 ms (00:28.598)
UPDATE 1000000
Time: 5176.394 ms (00:05.176)
UPDATE 50000
Time: 408.038 ms
UPDATE 1000000
Time: 3148.945 ms (00:03.149) -
Request the explain plan for the
COUNT()
aggregate.tutorial=# EXPLAIN SELECT COUNT(*) FROM sample WHERE id > 100;
QUERY PLAN
------------------------------------------------------------------------------------
Aggregate (cost=0.00..431.00 rows=1 width=8)
-> Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..431.00 rows=1 width=1)
-> Seq Scan on sample (cost=0.00..431.00 rows=1 width=1)
Filter: (id > 100)
Optimizer: Pivotal Optimizer (GPORCA)
(5 rows)
Time: 5.635 msYou are expected to read query plans from bottom to top. In this example, there are 4 steps. First, there is a sequential scan on each segment server to access the rows. Then there is an aggregation on each segment server to produce a count of the number of rows from that segment. Then there is a gathering of the count value to a single location. Finally, the counts from each segment are aggregated to produce the final result.
The cost number on each step has a start and stop value. For the sequential scan, this begins at time zero and goes until 431.00. This is a fictional number created by the optimizer. It is not a number of seconds or I/O operations.
The cost numbers are cumulative, so the cost for the second operation includes the cost for the first operation. Notice that nearly all the time to process this query is in the sequential scan.
-
The
EXPLAIN ANALYZE
command actually runs the query (without returning the result set). The cost numbers reflect the actual timings. It also produces some memory and I/O statistics.tutorial=# EXPLAIN ANALYZE SELECT COUNT(*) FROM sample WHERE id > 100;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=0.00..463.54 rows=1 width=8) (actual time=329.600..329.602 rows=1 loops=1)
-> Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..463.54 rows=1 width=8) (actual time=325.897..329.586 rows=2 loops=1)
-> Partial Aggregate (cost=0.00..463.54 rows=1 width=8) (actual time=324.713..324.716 rows=1 loops=1)
-> Seq Scan on sample (cost=0.00..463.54 rows=499954 width=1) (actual time=30.992..296.384 rows=500184 loops=1)
Filter: (id > 100)
Rows Removed by Filter: 53
Planning Time: 5.192 ms
(slice0) Executor memory: 37K bytes.
(slice1) Executor memory: 122K bytes avg x 2 workers, 122K bytes max (seg0).
Memory used: 128000kB
Optimizer: Pivotal Optimizer (GPORCA)
Execution Time: 338.004 ms
(12 rows)
Time: 343.866 ms
Change optimizers
By default, the sandbox instance disables the Pivotal Query Optimizer and you might see "legacy query optimizer" listed in the EXPLAIN
output under "Optimizer status."
-
Check whether the Pivotal Query Optimizer is enabled.
$ gpconfig -s optimizer
Values on all segments are consistent
GUC : optimizer
Coordinator value: on
Segment value: on -
Disable the Pivotal Query Optimizer.
$ gpconfig -c optimizer -v off --coordinatoronly
20230726:14:42:31:031343 gpconfig:cdw:gpadmin-[INFO]:-completed successfully with parameters '-c optimizer -v on --coordinatoronly'
-
Reload the configuration on coordinator and segment instances.
$ gpstop -u
20230726:14:42:49:031465 gpstop:cdw:gpadmin-[INFO]:-Starting gpstop with args: -u
20230726:14:42:49:031465 gpstop:cdw:gpadmin-[INFO]:-Gathering information and validating the environment...
20230726:14:42:49:031465 gpstop:cdw:gpadmin-[INFO]:-Obtaining Cloudberry Coordinator catalog information
20230726:14:42:49:031465 gpstop:cdw:gpadmin-[INFO]:-Obtaining Segment details from coordinator...
20230726:14:42:49:031465 gpstop:cdw:gpadmin-[INFO]:-Cloudberry Version: 'postgres (Apache Cloudberry) 1.0.0 build dev'
20230726:14:42:49:031465 gpstop:cdw:gpadmin-[INFO]:-Signalling all postmaster processes to reload
Indexes and performance
Apache Cloudberry does not depend upon indexes to the same degree as traditional data warehouse systems. Because the segments execute table scans in parallel, each segment scanning a small part of the table, the traditional performance advantage from indexes is gone. Indexes consume large amounts of space and require considerable CPU time slot to compute during data loads. There are, however, times when indexes are useful, especially for highly selective queries. When a query looks up a single row, an index can dramatically improve performance.
In this exercise, you work with the legacy optimizer to know how index can improve performance. You first run a single row lookup on the sample table without an index, then rerun the query after creating an index.
tutorial=# SELECT * FROM sample WHERE big = 12345;
id | big | wee | stuff
-------+-------+-----+-------
12345 | 12345 | 0 |
(1 row)
Time: 251.304 ms
tutorial=#
tutorial=# EXPLAIN SELECT * FROM sample WHERE big = 12345;
QUERY PLAN
--------------------------------------------------------------------------------
Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..8552.02 rows=1 width=15)
-> Seq Scan on sample (cost=0.00..8552.00 rows=1 width=15)
Filter: (big = 12345)
Optimizer: Postgres query optimizer
(4 rows)
Time: 0.709 ms
tutorial=#
tutorial=#
tutorial=# CREATE INDEX sample_big_index ON sample(big);
CREATE INDEX
Time: 1574.117 ms (00:01.574)
tutorial=#
tutorial=#
tutorial=# SELECT * FROM sample WHERE big = 12345;
id | big | wee | stuff
-------+-------+-----+-------
12345 | 12345 | 0 |
(1 row)
Time: 2.774 ms
tutorial=# EXPLAIN SELECT * FROM sample WHERE big = 12345;
QUERY PLAN
--------------------------------------------------------------------------------------
Gather Motion 2:1 (slice1; segments: 2) (cost=0.17..8.21 rows=1 width=15)
-> Index Scan using sample_big_index on sample (cost=0.17..8.19 rows=1 width=15)
Index Cond: (big = 12345)
Optimizer: Postgres query optimizer
(4 rows)
Time: 0.627 ms
Notice the difference in timing between the single-row SELECT
with and without the index. The difference would have been much greater for a larger table, because indexes improve performance for queries on large datasets. Note that even when an index exists, the optimizer might choose not to use it if another more efficient plan is available.
View the following EXPLAIN
plans to compare plans for some other common types of queries.
tutorial=# EXPLAIN SELECT * FROM sample WHERE big = 12345;
tutorial=# EXPLAIN SELECT * FROM sample WHERE big > 12345;
tutorial=# EXPLAIN SELECT * FROM sample WHERE big = 12345 OR big = 12355;
tutorial=# DROP INDEX sample_big_index;
tutorial=# EXPLAIN SELECT * FROM sample WHERE big = 12345 OR big = 12355;
Row vs. column orientation
Apache Cloudberry offers the ability to store a table in either row or column orientation. Both storage options have advantages, depending upon data compression characteristics, the kinds of queries executed, the row length, and the complexity, and the number of join columns.
As a general rule, very wide tables are better stored in row orientation, especially if there are joins on many columns. Column orientation works well to save space with compression and to reduce I/O when there is much duplicated data in columns.
In this exercise, you will create a column-oriented version of the fact table and compare it with the row-oriented version.
-
Create a column-oriented version of the FAA On Time Performance fact table and insert the data from the row-oriented version.
tutorial=# CREATE TABLE FAA.OTP_C (LIKE faa.otp_r) WITH (appendonly=true,
orientation=column)
DISTRIBUTED BY (UniqueCarrier, FlightNum) PARTITION BY RANGE(FlightDate)
( PARTITION mth START('2009-06-01'::date) END ('2010-10-31'::date)
EVERY ('1 mon'::interval));
CREATE TABLE
tutorial=#