Re: Do we want a hashset type? - Mailing list pgsql-hackers
From | Joel Jacobson |
---|---|
Subject | Re: Do we want a hashset type? |
Date | |
Msg-id | f4ad4ec9-160f-48a6-851d-b2b34db49ebc@app.fastmail.com Whole thread Raw |
In response to | Re: Do we want a hashset type? (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Do we want a hashset type?
|
List | pgsql-hackers |
On Wed, May 31, 2023, at 18:59, Tomas Vondra wrote: > How does storing just the IDs solves anything? Isn't the main challenge > the random I/O when fetching the adjacent nodes? This does not really > improve that, no? I'm thinking of a recursive query where a lot of time is just spent following all friends-of-friends, where the graph traversal is the heavy part, where the final set of nodes are only fetched at the end. > It's entirely possible I'm missing some important aspect. It'd be very > useful to have a couple example queries illustrating the issue, that we > could use to actually test different approaches. Here is an example using a real anonymised social network. wget https://snap.stanford.edu/data/soc-pokec-relationships.txt.gz gunzip soc-pokec-relationships.txt.gz CREATE TABLE edges (from_node INT, to_node INT); \copy edges from soc-pokec-relationships.txt; ALTER TABLE edges ADD PRIMARY KEY (from_node, to_node); SET work_mem TO '1GB'; CREATE VIEW friends_of_friends AS WITH RECURSIVE friends_of_friends AS ( SELECT ARRAY[5867::bigint] AS current, ARRAY[5867::bigint] AS found, 0 AS depth UNION ALL SELECT new_current, found || new_current, friends_of_friends.depth + 1 FROM friends_of_friends CROSS JOIN LATERAL ( SELECT array_agg(DISTINCT edges.to_node) AS new_current FROM edges WHERE from_node = ANY(friends_of_friends.current) ) q WHERE friends_of_friends.depth < 3 ) SELECT depth, coalesce(array_length(current, 1), 0) FROM friends_of_friends WHERE depth = 3; ; SELECT COUNT(*) FROM edges; count ---------- 30622564 (1 row) SELECT COUNT(DISTINCT from_node) FROM edges; count --------- 1432693 (1 row) -- Most connected user (worst-case) is user 5867 with 8763 friends: SELECT from_node, COUNT(*) FROM edges GROUP BY from_node ORDER BY COUNT(*) DESC LIMIT 1; from_node | count -----------+------- 5867 | 8763 (1 row) -- Friend-of-friends query exactly at depth three: EXPLAIN ANALYZE SELECT * FROM friends_of_friends; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- CTE Scan on friends_of_friends (cost=6017516.90..6017517.60 rows=1 width=8) (actual time=2585.881..2586.334 rows=1 loops=1) Filter: (depth = 3) Rows Removed by Filter: 3 CTE friends_of_friends -> Recursive Union (cost=0.00..6017516.90 rows=31 width=68) (actual time=0.005..2581.664 rows=4 loops=1) -> Result (cost=0.00..0.01 rows=1 width=68) (actual time=0.002..0.002 rows=1 loops=1) -> Subquery Scan on "*SELECT* 2" (cost=200583.71..601751.66 rows=3 width=68) (actual time=645.036..645.157 rows=1loops=4) -> Nested Loop (cost=200583.71..601751.54 rows=3 width=68) (actual time=641.880..641.972 rows=1 loops=4) -> WorkTable Scan on friends_of_friends friends_of_friends_1 (cost=0.00..0.22 rows=3 width=68) (actualtime=0.001..0.002 rows=1 loops=4) Filter: (depth < 3) Rows Removed by Filter: 0 -> Aggregate (cost=200583.71..200583.72 rows=1 width=32) (actual time=850.997..850.998 rows=1 loops=3) -> Bitmap Heap Scan on edges (cost=27656.38..196840.88 rows=1497133 width=4) (actual time=203.239..423.534rows=3486910 loops=3) Recheck Cond: (from_node = ANY (friends_of_friends_1.current)) Heap Blocks: exact=117876 -> Bitmap Index Scan on edges_pkey (cost=0.00..27282.10 rows=1497133 width=0) (actualtime=198.047..198.047 rows=3486910 loops=3) Index Cond: (from_node = ANY (friends_of_friends_1.current)) Planning Time: 1.414 ms Execution Time: 2588.288 ms (19 rows) I tested on PostgreSQL 15.2. For some reason I got a different slower query on HEAD: SELECT * FROM friends_of_friends; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- CTE Scan on friends_of_friends (cost=6576.67..6577.37 rows=1 width=8) (actual time=6412.693..6413.335 rows=1 loops=1) Filter: (depth = 3) Rows Removed by Filter: 3 CTE friends_of_friends -> Recursive Union (cost=0.00..6576.67 rows=31 width=68) (actual time=0.008..6407.134 rows=4 loops=1) -> Result (cost=0.00..0.01 rows=1 width=68) (actual time=0.005..0.005 rows=1 loops=1) -> Subquery Scan on "*SELECT* 2" (cost=219.05..657.64 rows=3 width=68) (actual time=1600.747..1600.934 rows=1loops=4) -> Nested Loop (cost=219.05..657.52 rows=3 width=68) (actual time=1594.906..1595.035 rows=1 loops=4) -> WorkTable Scan on friends_of_friends friends_of_friends_1 (cost=0.00..0.22 rows=3 width=68) (actualtime=0.001..0.002 rows=1 loops=4) Filter: (depth < 3) Rows Removed by Filter: 0 -> Aggregate (cost=219.05..219.06 rows=1 width=32) (actual time=2118.105..2118.105 rows=1 loops=3) -> Sort (cost=207.94..213.49 rows=2221 width=4) (actual time=1780.770..1925.853 rows=3486910loops=3) Sort Key: edges.to_node Sort Method: quicksort Memory: 393217kB -> Index Only Scan using edges_pkey on edges (cost=0.56..84.48 rows=2221 width=4) (actualtime=0.077..762.408 rows=3486910 loops=3) Index Cond: (from_node = ANY (friends_of_friends_1.current)) Heap Fetches: 0 Planning Time: 8.229 ms Execution Time: 6446.421 ms (20 rows)
pgsql-hackers by date: