Group By and Having Clauses
After passing the WHERE
filter, the derived input table might be subject to grouping, using the GROUP BY
clause, and elimination of group rows using the HAVING
clause.
SELECT select_list
FROM ...
[WHERE ...]
GROUP BY grouping_column_reference [, grouping_column_reference]...
The GROUP BY
clause is used to group together those rows in a table that have the same values in all the columns listed. The order in which the columns are listed does not matter. The effect is to combine each set of rows having common values into one group row that represents all rows in the group. This is done to eliminate redundancy in the output and/or compute aggregates that apply to these groups. For instance:
=> SELECT * FROM test1;
x | y
---+---
a | 3
c | 2
b | 5
a | 1
(4 rows)
=> SELECT x FROM test1 GROUP BY x;
x
---
a
b
c
(3 rows)
In the second query, we could not have written SELECT * FROM test1 GROUP BY x
, because there is no single value for the column y that could be associated with each group. The grouped-by columns can be referenced in the select list because they have a single value in each group.
In general, if a table is grouped, columns that are not listed in GROUP BY
cannot be referenced except in aggregate expressions. An example with aggregate expressions is:
=> SELECT x, sum(y) FROM test1 GROUP BY x;
x | sum
---+-----
a | 4
b | 5
c | 2
(3 rows)
Here sum is an aggregate function that computes a single value over the entire group.
Grouping without aggregate expressions effectively calculates the set of distinct values in a column. This can also be achieved using the DISTINCT
clause.
Here is another example: it calculates the total sales for each product (rather than the total sales of all products):
SELECT product_id, p.name, (sum(s.units) * p.price) AS sales
FROM products p LEFT JOIN sales s USING (product_id)
GROUP BY product_id, p.name, p.price;
In this example, the columns product_id
, p.name
, and p.price
must be in the GROUP BY
clause because they are referenced in the query select list (but see below). The column s.units
does not have to be in the GROUP BY
list because it is only used in an aggregate expression (sum(...)
), which represents the sales of a product. For each product, the query returns a summary row about all sales of the product.
If the products table is set up so that, say, product_id
is the primary key, then it would be enough to group by product_id
in the above example, because name and price would be functionally dependent on the product ID, and so there would be no ambiguity about which name and price value to return for each product ID group.
In strict SQL, GROUP BY
can only group by columns of the source table but Apache Cloudberry extends this to also allow GROUP BY
to group by columns in the select list. Grouping by value expressions instead of simple column names is also allowed.
If a table has been grouped using GROUP BY
, but only certain groups are of interest, the HAVING
clause can be used, much like a WHERE
clause, to eliminate groups from the result. The syntax is:
SELECT select_list FROM ... [WHERE ...] GROUP BY ... HAVING boolean_expression
Expressions in the HAVING
clause can refer both to grouped expressions and to ungrouped expressions (which necessarily involve an aggregate function).
Example:
=> SELECT x, sum(y) FROM test1 GROUP BY x HAVING sum(y) > 3;
x | sum
---+-----
a | 4
b | 5
(2 rows)
=> SELECT x, sum(y) FROM test1 GROUP BY x HAVING x < 'c';
x | sum
---+-----
a | 4
b | 5
(2 rows)
Again, a more realistic example:
SELECT product_id, p.name, (sum(s.units) * (p.price - p.cost)) AS profit
FROM products p LEFT JOIN sales s USING (product_id)
WHERE s.date > CURRENT_DATE - INTERVAL '4 weeks'
GROUP BY product_id, p.name, p.price, p.cost
HAVING sum(p.price * s.units) > 5000;
In the example above, the WHERE
clause is selecting rows by a column that is not grouped (the expression is only true for sales during the last four weeks), while the HAVING
clause restricts the output to groups with total gross sales over 5000. Note that the aggregate expressions do not necessarily need to be the same in all parts of the query.
If a query contains aggregate function calls, but no GROUP BY
clause, grouping still occurs: the result is a single group row (or perhaps no rows at all, if the single row is then eliminated by HAVING
). The same is true if it contains a HAVING
clause, even without any aggregate function calls or GROUP BY
clause.