Recursively traversing a partially ordered set - Mailing list pgsql-sql
From | Jason Grout |
---|---|
Subject | Recursively traversing a partially ordered set |
Date | |
Msg-id | 465C3380.1080207@creativetrax.com Whole thread Raw |
Responses |
Re: Recursively traversing a partially ordered set
|
List | pgsql-sql |
I have a performance related question (and a few code snippets to contribute to the list archives). I am trying to write a function that recursively walks a hierarchal structure to find all ancestors or descendants of a particular item or list of items. My relationships are set up with the table: CREATE TABLE relation (parent_id integer, child_id integer); CREATE INDEX relation_parent ON relation(parent_id); CREATE INDEX relation_child ON relation(child_id); My data is *not* a tree (otherwise I'd probably use Celko's method). Instead, my structure is a (graded) partially ordered set, which is more general than a tree. In other words, each item has a level in the hierarchy and can be connected to a number of nodes in the level above it and a number of nodes in the level below it. Items on the same level have no relation between them, but they generally have many common parents and many common children. In my case, items will generally have far more parents than children. Right now my relation table has several hundred thousand rows and 9 levels, but will eventually have tens of millions of rows and more levels. Note that all of these algorithms work just fine for trees too. Given a list of item ids, I need to determine all items above them in the hierarchy (or below them). Since determining ancestors is the inverse problem of determining descendants, I'll only treat the ancestors problem in the rest of this email. Note that the list I am given may include items on different levels. After playing with it for a few days, I ended up with four functions (after trying lots of variations on these, based on things found in the mailing list, documentation, and thinking about the problem). The functions are listed from slowest to fastest. My questions are at the end of the email. 1. this rather nice, succinct, recursive (and slow) SQL function: CREATE OR REPLACE FUNCTION get_ancestors_sql(integer[]) RETURNS SETOF integer AS $BODY$ SELECT DISTINCT parent_id from subgraphs WHERE child_id = ANY ($1 || CASE WHEN EXISTS(SELECT child_id FROM relation WHERE child_id = ANY($1)) THEN ARRAY(SELECT * FROM get_ancestors_sql( ARRAY(SELECT parent_id FROM relation WHERE child_id=ANY($1)))) ELSE '{}'::int[] END) $BODY$ LANGUAGE 'sql' STABLE; 2. this more verbose recursive (and still slow) pl/pgsql function: CREATE OR REPLACE FUNCTION get_ancestors_pg(integer[]) RETURNS SETOF integer AS $BODY$ DECLARE r record; s record; BEGIN SELECT INTO s child_id FROM relation WHERE child_id=ANY($1) LIMIT 1; IF FOUND THEN FOR r IN SELECT DISTINCT parent_id from relation WHERE child_id = ANY ($1) OR child_id IN (SELECT * FROM get_ancestors_pg( ARRAY(SELECT DISTINCT parent_id FROM relation WHERE child_id=ANY($1)))) LOOP RETURNNEXT r.parent_id; END LOOP; END IF; END; $BODY$ LANGUAGE 'plpgsql' STABLE; 3. this faster function that uses a table to cache results. This could be modified so that consecutive calls use the cache to provide much quicker responses, but currently I delete the information at the end of a function call. CREATE OR REPLACE FUNCTION get_ancestors_pg_temp(ids integer[]) RETURNS SETOF integer AS $BODY$ DECLARE -- This is a "session" variable so that two concurrent calls -- to the function don't step on each other and mutilate -- each other's data. randvar integer:=round(random()*2000000)::int; i integer:=0; r record; BEGIN INSERT INTO temp SELECT DISTINCT g.parent_id, randvar from relation as g WHERE g.child_id=ANY(ids); LOOP INSERT INTO temp SELECT DISTINCT g.parent_id, randvar+1 from relation as g WHERE g.child_id IN (SELECT parent_idFROM temp WHERE session=randvar); EXIT WHEN NOT FOUND; i:=i+1; randvar:=randvar+1; END LOOP; FOR r IN SELECT parent_id FROM temp WHERE session BETWEEN randvar-i AND randvar LOOP RETURN NEXT r.parent_id; END LOOP; DELETE FROM temp WHERE session BETWEEN randvar-i and randvar; END; $BODY$ LANGUAGE 'plpgsql' VOLATILE; 4. This much, much faster version that loops and just throws everything into an array and then spits out the array at the end. CREATE OR REPLACE FUNCTION get_ancestors_pg_array(ids integer[]) RETURNS integer[] AS $BODY$ DECLARE result int[]:='{}'::int[]; nextiter int[]:='{}'::int[]; temp int[]:=ids; BEGIN LOOP SELECT INTO nextiter ARRAY(SELECT DISTINCT g.parent_id from relation as g WHERE g.child_id=ANY(temp)); result:=nextiter||result; EXIT WHEN nextiter[1] IS NULL; temp:=nextiter; END LOOP; RETURN result; END; $BODY$ LANGUAGE 'plpgsql' STABLE; The fourth function, get_ancestors_pg_array, seems to execute in about a tenth the time that the other functions execute in. My questions: 1. Is there a huge overhead for recursion? In general, is it better to stuff temporary results into a table (incurring the cost of an INSERT) rather than recurse? 2. Is there a big difference in speed between using an array versus using a SELECT in a WHERE condition? In other words, which is generally going to be faster: SELECT * from table where field IN (some function returning a SETOF); or SELECT * from table where field = ANY(some function returning an array); 3. Is there a strong reason I should strip out duplicates in either of the two cases in question 2? Or is the performance about the same when doing the queries whether or not the SETOF or arrays contain duplicates? 4. Can you see any obvious optimizations to the above functions (particularly the last one)? Thanks for your help. Thanks for the absolutely wonderful database and solid documentation. I originally did this project in MySQL and had the weirdest errors (the errors turned out to be due to the default case-insensitive collation of MySQL!). That's when I decided to move to postgresql when I updated the project. Thanks, Jason