Blog

Created 2024/12/08 at 10:44PM

Last Modified 2025/01/04 at 01:01AM

Database

I have often seen people forget about a powerful SQL clause PARTITION BY and end up writing queries using GROUP BY and joins, which are way slower and less performant.

Let's do a quick go through of GROUP BY and PARTITION BY first.

GROUP BY

What is it?

GROUP BY collapses multiple rows into a single row for each unique combination of the grouped columns, and is used with aggregation functions (such as COUNT(), SUM(), MIN() etc).

How it works?

The database engine scans through the rows in a table.
Aggregate functions are then applied to each group to produce a single row per group. The database might perform grouping and aggregation either via sort-based approach or a hash-based approach.
- Sort Aggregate
  - The database engine sorts the data by the GROUP BY columns.
  - It iterates through sorted rows, keeping track of when the group key changes.
  - Each time the group changes, it outputs the accumulated aggregate and resets counters for the new group.
- Hash Aggregate
  - The database engine creates a hash table in memory. As it scans through rows one by one, it uses a hash function on the grouping columns to find the correct bucket.
  - The database engine then either initializes or updates the aggregate for that bucket.
  - After processing all rows, the engine calculates the aggregated results for each bucket.

Characteristics

Each selected column in the SQL query should either belong to the grouped columns, or should be used inside an aggregate function.
Since multiple rows are collapsed into fewer results, if you need to keep row-level detail and see group-level metrics (this is where PARTITION BY could be your friend), you often have to either:
- Run separate queries and then combine results
- Join an aggregated subquery back to the original table (which can be less efficient)

PARTITION BY

What is it?

PARTITION BY does not collapse rows; it produces a value for each row based on a subset ("partition") of the data. It is used with window functions (such as SUM(...) OVER (PARTITION BY ...), AVG(...) OVER (PARTITION BY ...) etc) which are applied over each subset.

How it works?

The database engine scans through the rows in a table.
The database engine divides the rows into partitions based on the columns specified in PARTITION BY.
The database engine then calculates the required window function(s) (e.g., rolling sums, row numbers, cumulative averages) for each row within its partition. It also can use either a sort-based approach or hash-based approach to split data into partitions.

Let's consider the following example to analyze the difference between the two and power of PARTITION BY clause.

We have a executions table which has millions of records and its schema looks like

> DESC executions;
+----------------+--------------+------+-----+---------+----------------+
| Field          | Type         | Null | Key | Default | Extra          |
+----------------+--------------+------+-----+---------+----------------+
| id             | int          | NO   | PRI | NULL    | auto_increment |
| exchange       | varchar(16)  | NO   |     | NULL    |                |
| ticker         | varchar(16)  | NO   |     | NULL    |                |
| execution_date | date         | NO   |     | NULL    |                |
| quantity       | int unsigned | NO   |     | NULL    |                |
+----------------+--------------+------+-----+---------+----------------+

> SELECT * FROM executions LIMIT 5;
+----+----------+--------+----------------+----------+
| id | exchange | ticker | execution_date | quantity |
+----+----------+--------+----------------+----------+
|  1 | NYSE     | FSM    | 2024-09-01     |      637 |
|  2 | LSE      | RKLB   | 2024-09-01     |      880 |
|  3 | CSE      | APPF   | 2024-09-01     |      108 |
|  4 | NYSE     | DLB    | 2024-09-01     |        6 |
|  5 | JPX      | CHWY   | 2024-09-01     |      692 |
+----+----------+--------+----------------+----------+

Requirement: Calculate daily total quantity traded across each ticker and report the results alongside individual row details.

Solving using GROUP BY

We calculate aggregate and join it back to our original table.

SELECT
    e.*,
    t.daily_ticker_quantity
FROM executions e
JOIN (
    SELECT ticker, execution_date, SUM(quantity) AS daily_ticker_quantity
    FROM executions
    GROUP BY ticker, execution_date
) t ON e.ticker = t.ticker AND e.execution_date = t.execution_date;

# 9540371 rows in set (42.51 sec) [average over 10 runs]

Solving using PARTITION BY

SELECT *,
       SUM(quantity) OVER (
              PARTITION BY ticker, execution_date
       ) AS daily_ticker_quantity
FROM executions;

# 9540371 rows in set (41.67 sec) [average over 10 runs]

So we see that the PARTITION BY is slightly faster, and is easier on the eyes to read and understand.

Time & Space Complexity Analysis

Lets say that the database engine follows hash-based approach.

GROUP BY

O(N) on average to build and populate the hash table (assuming good hashing and enough memory), which produces M resulting aggregate rows. Then the join back to original table would require building a hash table with M grouped rows, then scanning N rows to match which results in O(M + N). The space complexity would be O(N) for grouping and O(N) for joining.

PARTITION BY

It would require potentially O(N) on average to build partitions via hashing, then compute sums, since it's a single pass. And the space complexity would again be O(N).

Understanding PARTITION BY in SQL

GROUP BY

What is it?

How it works?

Characteristics

PARTITION BY

What is it?

How it works?

Solving using GROUP BY

Solving using PARTITION BY

Time & Space Complexity Analysis

GROUP BY

PARTITION BY