Created 2024/12/08 at 10:44PM
Last Modified 2025/01/04 at 01:01AM
I have often seen people forget about a powerful SQL clause PARTITION BY
and end up writing queries using GROUP BY
and joins, which are way slower and less performant.
Let's do a quick go through of GROUP BY
and PARTITION BY
first.
GROUP BY
collapses multiple rows into a single row for each unique combination of the grouped columns, and is used with aggregation functions (such as COUNT()
, SUM()
, MIN()
etc).
sort-based
approach or a hash-based
approach.PARTITION BY
could be your friend), you often have to either:PARTITION BY
does not collapse rows; it produces a value for each row based on a subset ("partition") of the data. It is used with window functions (such as SUM(...) OVER (PARTITION BY ...)
, AVG(...) OVER (PARTITION BY ...)
etc) which are applied over each subset.
PARTITION BY
.sort-based
approach or hash-based
approach to split data into partitions.Let's consider the following example to analyze the difference between the two and power of PARTITION BY
clause.
We have a executions
table which has millions of records and its schema looks like
> DESC executions;
+----------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+----------------+
| id | int | NO | PRI | NULL | auto_increment |
| exchange | varchar(16) | NO | | NULL | |
| ticker | varchar(16) | NO | | NULL | |
| execution_date | date | NO | | NULL | |
| quantity | int unsigned | NO | | NULL | |
+----------------+--------------+------+-----+---------+----------------+
> SELECT * FROM executions LIMIT 5;
+----+----------+--------+----------------+----------+
| id | exchange | ticker | execution_date | quantity |
+----+----------+--------+----------------+----------+
| 1 | NYSE | FSM | 2024-09-01 | 637 |
| 2 | LSE | RKLB | 2024-09-01 | 880 |
| 3 | CSE | APPF | 2024-09-01 | 108 |
| 4 | NYSE | DLB | 2024-09-01 | 6 |
| 5 | JPX | CHWY | 2024-09-01 | 692 |
+----+----------+--------+----------------+----------+
Requirement: Calculate daily total quantity traded across each ticker and report the results alongside individual row details.
We calculate aggregate and join it back to our original table.
SELECT
e.*,
t.daily_ticker_quantity
FROM executions e
JOIN (
SELECT ticker, execution_date, SUM(quantity) AS daily_ticker_quantity
FROM executions
GROUP BY ticker, execution_date
) t ON e.ticker = t.ticker AND e.execution_date = t.execution_date;
# 9540371 rows in set (42.51 sec) [average over 10 runs]
SELECT *,
SUM(quantity) OVER (
PARTITION BY ticker, execution_date
) AS daily_ticker_quantity
FROM executions;
# 9540371 rows in set (41.67 sec) [average over 10 runs]
So we see that the PARTITION BY
is slightly faster, and is easier on the eyes to read and understand.
Lets say that the database engine follows hash-based
approach.
O(N)
on average to build and populate the hash table (assuming good hashing and enough memory), which produces M resulting aggregate rows. Then the join back to original table would require building a hash table with M grouped rows, then scanning N rows to match which results in O(M + N)
. The space complexity would be O(N)
for grouping and O(N)
for joining.
It would require potentially O(N)
on average to build partitions via hashing, then compute sums, since it's a single pass. And the space complexity would again be O(N)
.