4 minutes
SQL - Getting oldest/youngest from associated table
Goal: Gettting the most recently changed record of a “child” table, per “parent” table.
This is, by no means, advanced SQL wizardry. Still, it may be handy to some people. By the way, this post uses PostgreSQL but it would be basically the same for MySQL and others.
For the example I’ll use the concept of API and API Version. Essentially, we want to build a query that returns all APIs and their version that was most recently updated (another example could be to get users and their most recently updated order).
Here’s our sample schema and data:
CREATE TABLE apis (
id SERIAL PRIMARY KEY,
name TEXT UNIQUE,
updated_at TIMESTAMP DEFAULT now()
);
CREATE TABLE api_versions (
id SERIAL PRIMARY KEY,
api_id INTEGER REFERENCES apis(id),
title TEXT,
version TEXT,
updated_at TIMESTAMP DEFAULT now()
);
INSERT INTO apis (name) VALUES
('petstore'),
('httpbin'),
('futurama');
INSERT INTO api_versions (api_id, title, version, updated_at) VALUES
(1, 'pets', 'v1.0.0', '2019-11-03 21:00'),
(1, 'pets', 'v2.0.0', '2020-11-08 21:00'),
(2, 'httbin v3', '3.10.1342', '2020-12-02 13:00'),
(2, 'httbin v3', '4.01.0002', '2019-07-01 13:00'),
(3, 'futurama v10', '10', '2018-01-31 09:00'),
(3, 'futurama v11', '11', '2020-06-24 09:00'),
(3, 'futurama v12', '12', '2019-02-21 09:00');
An the expected result:
id | name | title | version | updated_at
----+----------+--------------+-----------+---------------------
1 | petstore | pets | v2.0.0 | 2020-11-08 21:00:00
2 | httpbin | httbin v3 | 3.10.1342 | 2020-12-02 13:00:00
3 | futurama | futurama v11 | 11 | 2020-06-24 09:00:00
WHERE Subquery
Straightforward approach. Makes use of a WHERE clause to filter by the updated_at
field of api_versions
, by the maximum value of updated_at
.
A potential problem would be if two API versions had the same update timestamp. Highly unlikely but still worth considering. However, since we’re trying to get the most recent version and two would have the same timestamp, it doesn’t really matter which one is selected. This may not be the case in other situations.
Would probably make sense to add an index on api_versions.updated_at
.
SELECT [select_fields]
FROM apis
JOIN api_versions ON apis.id = api_versions.api_id
WHERE api_versions.updated_at = (
SELECT MAX(v.updated_at)
FROM api_versions v
WHERE v.api_id = apis.id
);
And the query execution plan:
Hash Join (cost=38.25..787.00 rows=4 width=112)
Hash Cond: ((api_versions.api_id = apis.id) AND (api_versions.updated_at = (SubPlan 1)))
-> Seq Scan on api_versions (cost=0.00..17.50 rows=750 width=80)
-> Hash (cost=21.30..21.30 rows=1130 width=36)
-> Seq Scan on apis (cost=0.00..21.30 rows=1130 width=36)
SubPlan 1
-> Aggregate (cost=19.39..19.40 rows=1 width=8)
-> Seq Scan on api_versions v (cost=0.00..19.38 rows=4 width=8)
Filter: (api_id = apis.id)
JOIN Subquery
Here’s an alternative where the max updated_at
is determined by API in a join subquery. Same comments about the updated_at
comparison from above.
SELECT [select_fields]
FROM (
SELECT
api_id,
MAX(updated_at) AS updated_at
FROM api_versions
GROUP BY api_id
) recent_versions
JOIN apis ON apis.id = recent_versions.api_id
JOIN api_versions
ON api_versions.api_id = recent_versions.api_id
AND api_versions.updated_at = recent_versions.updated_at;
And the query execution plan:
Nested Loop (cost=28.40..50.67 rows=1 width=112)
Join Filter: (api_versions_1.api_id = apis.id)
-> Hash Join (cost=28.25..49.69 rows=4 width=84)
Hash Cond: ((api_versions.api_id = api_versions_1.api_id) AND (api_versions.updated_at = (max(api_versions_1.updated_at))))
-> Seq Scan on api_versions (cost=0.00..17.50 rows=750 width=80)
-> Hash (cost=25.25..25.25 rows=200 width=12)
-> HashAggregate (cost=21.25..23.25 rows=200 width=12)
Group Key: api_versions_1.api_id
-> Seq Scan on api_versions api_versions_1 (cost=0.00..17.50 rows=750 width=12)
-> Index Scan using apis_pkey on apis (cost=0.15..0.23 rows=1 width=36)
Index Cond: (id = api_versions.api_id)
I’m sure there are a few more solutions to this. Which one to use? I guess it depends on how often you expect the query to be executed, how large is the dataset, how maintainable you want this query to be (in terms of readability).
Extra
What if an API has no version and we want to include it in the result set?
Taking the first solution, we just need to change the JOIN into a LEFT (OUTER) one and fix the WHERE clause to include APIs with no version (api_versions.id IS NULL
) OR what we already had.
SELECT [select_fields]
FROM apis
LEFT JOIN api_versions ON apis.id = api_versions.api_id
WHERE api_versions.id IS NULL
OR api_versions.updated_at = (
SELECT MAX(v.updated_at)
FROM api_versions v
WHERE v.api_id = apis.id
);
669 Words
2020-11-08