Created 2024/12/08 at 10:44PM

Last Modified 2025/01/04 at 01:01AM

I have often seen people forget about a powerful SQL clause PARTITION BY and end up writing queries using GROUP BY and joins, which are way slower and less performant.

Let's do a quick go through of GROUP BY and PARTITION BY first.

GROUP BY

What is it?

GROUP BY collapses multiple rows into a single row for each unique combination of the grouped columns, and is used with aggregation functions (such as COUNT(), SUM(), MIN() etc).

How it works?

Characteristics

PARTITION BY

What is it?

PARTITION BY does not collapse rows; it produces a value for each row based on a subset ("partition") of the data. It is used with window functions (such as SUM(...) OVER (PARTITION BY ...), AVG(...) OVER (PARTITION BY ...) etc) which are applied over each subset.

How it works?


Let's consider the following example to analyze the difference between the two and power of PARTITION BY clause.

We have a executions table which has millions of records and its schema looks like

> DESC executions;
+----------------+--------------+------+-----+---------+----------------+
| Field          | Type         | Null | Key | Default | Extra          |
+----------------+--------------+------+-----+---------+----------------+
| id             | int          | NO   | PRI | NULL    | auto_increment |
| exchange       | varchar(16)  | NO   |     | NULL    |                |
| ticker         | varchar(16)  | NO   |     | NULL    |                |
| execution_date | date         | NO   |     | NULL    |                |
| quantity       | int unsigned | NO   |     | NULL    |                |
+----------------+--------------+------+-----+---------+----------------+

> SELECT * FROM executions LIMIT 5;
+----+----------+--------+----------------+----------+
| id | exchange | ticker | execution_date | quantity |
+----+----------+--------+----------------+----------+
|  1 | NYSE     | FSM    | 2024-09-01     |      637 |
|  2 | LSE      | RKLB   | 2024-09-01     |      880 |
|  3 | CSE      | APPF   | 2024-09-01     |      108 |
|  4 | NYSE     | DLB    | 2024-09-01     |        6 |
|  5 | JPX      | CHWY   | 2024-09-01     |      692 |
+----+----------+--------+----------------+----------+

Requirement: Calculate daily total quantity traded across each ticker and report the results alongside individual row details.

Solving using GROUP BY

We calculate aggregate and join it back to our original table.

SELECT
    e.*,
    t.daily_ticker_quantity
FROM executions e
JOIN (
    SELECT ticker, execution_date, SUM(quantity) AS daily_ticker_quantity
    FROM executions
    GROUP BY ticker, execution_date
) t ON e.ticker = t.ticker AND e.execution_date = t.execution_date;

# 9540371 rows in set (42.51 sec) [average over 10 runs]

Solving using PARTITION BY

SELECT *,
       SUM(quantity) OVER (
              PARTITION BY ticker, execution_date
       ) AS daily_ticker_quantity
FROM executions;

# 9540371 rows in set (41.67 sec) [average over 10 runs]

So we see that the PARTITION BY is slightly faster, and is easier on the eyes to read and understand.

Time & Space Complexity Analysis

Lets say that the database engine follows hash-based approach.

GROUP BY

O(N) on average to build and populate the hash table (assuming good hashing and enough memory), which produces M resulting aggregate rows. Then the join back to original table would require building a hash table with M grouped rows, then scanning N rows to match which results in O(M + N). The space complexity would be O(N) for grouping and O(N) for joining.

PARTITION BY

It would require potentially O(N) on average to build partitions via hashing, then compute sums, since it's a single pass. And the space complexity would again be O(N).