Re: Making a tree with "millions and millions" of dynamic - Mailing list pgsql-general
From | Arjen van der Meijden |
---|---|
Subject | Re: Making a tree with "millions and millions" of dynamic |
Date | |
Msg-id | 3FD62927.70904@vulcanus.its.tudelft.nl Whole thread Raw |
In response to | Re: Making a tree with "millions and millions" of dynamic nodes (Christian Fowler <google@NOSPAM.gravesweeper.com>) |
List | pgsql-general |
Christian Fowler wrote: > Here is another approach (only first half of the page applies here) > I was emailed personally in response to my orginal post: > > http://fungus.teststation.com/~jon/treehandling/TreeHandling.htm > > Basically, the (too obivious for me to think of ;-) concept appears to > be storing a mapping of the Node / Parent for every parent in the > chain. Basically it splits up the materialized view concept into > multiple rows, or one can think of it as the adjancency list with all > parents stored. Perhaps call this the "Materialized List". Afaik, this is what Celko (in SQL for smarties) calls the "transitive closure version of a adjancecy list" or something like that. It's in the same chapter as adjanceny list-based trees near the end, if you want to read that book. > Personally, this is my likely favorite since I tend to "trust" the > performance of giant tables that are nothing more than 2 columns of > integers. Seems like this approach would handle most of the desired > tree cases (select path & child count, insert, delete, move) very > well. I have now (and for now, not sure which aproach is best/good enough) a combination of this, nested sets and adjancency lists, but I'm getting quite a bit of overhead for a few relations ;) I have two types of data, categories (in a tree with interlinks between categories, besides the parent-child relations) and documents. I've a nested-set + adjancency list for the tree itselves and a "transative closure"-table for the documents-category relations (that saves me some 70% of the querieing time, if I'd looked up those relations using the nested set-table on the fly). Anyway, my tables won't be multimillion, although they might be heavily used in webpages, and will be relatively static. So I can allow quite a bit of overhead. > Hopefully postgres can handle a very active, very narrow 50 million > row table without breaking a sweat. It probably can, bear in mind that if you disable OIDs for that table, it saves you 50M * 4bytes of storage orso... And I don't thing you'd really need them (unless you want to do some lookups on the "newly inserted row"). Best regards, Arjen > "Arjen van der Meijden" <acmmailing@vulcanus.its.tudelft.nl> wrote: > >>>Rick Gigger wrote: >>> >>>I was glad to see this topic come up on the list as I was >>>about to start asking about some of these issues myself. I >>>would like to discuss each of the methods I have researched >>>so far for doing trees in sql and see if anyone has some >>>experience or insite into the topic that could help me. >> >>Have a look here: http://www.scenic-route.com/program/db/lists.htm >>It is quite a nice overview with some advantages and disadvantages of >>each method. >> >> >>>I have a lot of thouhts on this but here is my first question: >>> >>>In the first article at the end of the section on materialized paths >>>and the beginning of the nested set section Tropashko basically says >>>that neither one really does a good job at answering "who are all of >>>my ancestors" but that the materialized path method has a nice kludge >>>that you can do with some funky string manipulation. >> >>I don't see how the nested intervals-query is better/faster than the >>nested set approach, but both seem to be possibly using indices. >>The nested set uses a simple "where left between parent.left and >>parent.right", if you happen to have those numbers in your application's >>memory, you could even build your query directly like that and save >>yourself a self join. >>Another approach, I used, is something like: >>Select * from nset n1, (select (distinct) * from nset where nset.id = >>$parentid) as n2 where n1.left between n2.left and n2.right >>(the distinct will force a materialization, which might improve the >>performance or not, in my tests it did) >> >>With the materialized path you'll have to use a like-search (child.path >>like parent.path || '%'), where hopefully the optimiser notices that it >>can chop off the end of the string and just look at the first part >> >> >>>Is this true? Celko in his articles seems to make it sound >>>like this query will be very fast. But Tropashko it seems >> >>It'll use an index range-search, so I don't see how it can be faster, >>except for a hash/index lookup or so (which is possible with the >>transitive closure version of an adjacency list). >> >> >>>is saying that it will be slow. Am I reading this right? If >>>so is Tropashko right? Any ideas on this? Any articles / papers >>>that might shed some light on this specific question or the >>>topic in general? >> >>The above article is a nice one to read I think. >> >>I'll try to summarize my thoughts on them, since I'm looking into an >>efficient tree approach myself aswell: >>All methods have both advantages and disadvantages and it'll depend on >>your situation whether that is a showstopper or not. All are bad when >>you want to remove a node from the tree, but the insert/update penalties >>depend on the situation. >> >>* The adjancency list is lightning fast with inserts, updates but a bit >>clumsy with deletions (you need to connect the direct children to >>another element, that's all). It'll be very fast with the two simple >>queries "who are my direct children" and "who is my parent", but you >>need either a recursive approach or some enhancement like a transitive >>closure-graph to enhance queries like "who are my predecessors" and "who >>are all my ancestors". >> >>So if you'll have a large tree, only few a levels deep and a lot of >>inserts and/or subtree movements, this seems to be the best approach. >>And you can store "max int"-number of nodes. >> >>* The materialized path is not so very efficient in terms of storage, it >>does inserts fast, but updates and deletions are slow. Retrieval of all >>ancestors is fast, retrieval of the predecessors is extremely trivial, >>but retrieval of the direct parent and/or children is somewhat more >>difficult (you'll need to look at the depth, if you do that often a >>functional-index might be handy). >>Perhaps the ltree-implementation in contrib/ltree is worth a look and >>encoding the trees in much more dense encodings (like a 256bit encoding >>I saw mentioned earlier and skipping the dot for just a single-bit >>terminator) makes it less inefficient. >> >>So if you do a lot of inserts, not so many updates and deletes and have >>much need for the entire list of predecessors/ancestors, I suppose this >>one is quite nice. But it is not so efficient in terms of storage, so >>very large/deep trees impose a huge overhead since each node stores its >>entire path. >> >>* The nested set is inefficient with insertions, updates and deletions >>but much more efficient with storage than the MP. To speed up the >>process of selecting only the direct parent/children you can use a depth >>field (redundant, but handy) which'll save you one or two extra >>selfjoins. >>For relatively static trees the nested set is good, but actively >>changing trees are very bad for your performance. You can store "max >>int"/2 nodes in this approach. >> >>* The nested intervals use the same approach as the nested set, but uses >>a static encoding. This allows you to do very efficient inserts, so the >>author claims. I have my doubts dough, if you don't know the number of >>children of your parent and just want to do "connect this child to that >>parent", it'll be more difficult, you'll need to find that parent, count >>it's children, decide the encoding-number of this new child and then you >>can calculate its new numbers. >>Still more efficient than the nested set, but not so efficient and you >>DO need to look into your database, while the author claimed you didn't >>need to. >>I havent't worked with this approach yet, but I saw the author use a lot >>of functions in his queries, I'm not sure whether that is a good idea, >>it might affect the usability of your indices... >>Another disadvantage is its very inefficient use of the integer-values >>available, it can only have a depth of 32 with normal integers, afaik >>and not so many children on a level as you'd like perhaps. The author >>suggests defining a unlimited integer, which is not so good for your >>performance at all (search for '"nested interval" sql' in google or so >>and you'll find his messages on the dbforums). >> >>* His other approach, the single-integer version, is even worse in terms >>of efficient integer-use, you can have only 32 levels at the best case >>(the left most branch only!) and only about 30 nodes next to each other >>in the same bestcase (since the numbers increase exponantional: level1 = >>5, 11, 23, 47, 95, etc). The worst case (the right most branch) can >>possibly store only one or two levels with only one or two nodes... >> >>So the claimed advantage of the better performance is a bit limited, >>you'll either need *very* large integers to store large trees, or you >>are simply limited in the size of your tree by some relatively small >>numbers. And that is a bit sad, since bad performance is only a >>problem/visible with large(r) trees. >> >>I'm not sure whether the above summary is correct, anyone any idea? I've >>to use it in a research paper for my study anyway, so expanding my >>thoughts in an email is a nice test :) >>Please comment. >> >>Best regards, >> >>Arjen van der Meijden >> >> >> >> >>---------------------------(end of broadcast)--------------------------- >>TIP 9: the planner will ignore your desire to choose an index scan if your >> joining column's datatypes do not match >> > >
pgsql-general by date: