Re: Mark/Restore and avoiding RandomAccess sorts - Mailing list pgsql-hackers
From | Bruce Momjian |
---|---|
Subject | Re: Mark/Restore and avoiding RandomAccess sorts |
Date | |
Msg-id | 200701062206.l06M6av04563@momjian.us Whole thread Raw |
In response to | Mark/Restore and avoiding RandomAccess sorts ("Simon Riggs" <simon@2ndquadrant.com>) |
Responses |
Re: Mark/Restore and avoiding RandomAccess sorts
Re: Mark/Restore and avoiding RandomAccess sorts |
List | pgsql-hackers |
I saw no replies to this. --------------------------------------------------------------------------- Simon Riggs wrote: > Merge Joins require us to potentially Mark and Restore positions in the > tuples arriving from executor sub-nodes. > > This currently means that if the tuples arrive from a Sort node, as they > often do in an MJ, the sort node will be instructed to prepare a random > access version of the sort result. That requires a full final merge of > the output, so as to allow rewinding the input when a Restore operation > is called. > > An MJ doesn't actually need random access, it just needs to be able to > rewind. The question is: how far does it need to rewind? In many cases, > the Restore operation moves back a small number of tuples, with a unique > inner scan requiring a rewind of just one tuple. > > It would certainly be cheaper, in most cases, for the Sort node to > maintain a variable size rewind buffer, where the history of prior > tuples is truncated each time we perform a Mark operation. This could be > implemented as a modified Tuplestore that could then be trimmed down > each time a Mark operation took place. If the tuplestore has not yet > spilled to disk this could be a trivial operation. > > Doing that would almost completely remove the overhead of the final > merge step in the sort. The final merge often doubles elapsed time in > cases where the sort is larger than work_mem, which it often is. > > Implementing the variable mark/restore buffer as a dumb Tuplestore would > mean that the space usage of the Sort could in worst case go as high as > x2 total space. The worst case is where the inner scan is all a single > value. The best case is where the inner scan is sufficiently unique over > all its values that it never writes back to disk at all. > > So a further refinement of this idea would be to simply defer the final > merge operation for the sort until the history required for the Mark > operation exceeded, say, 10% of the sort size. That would then be > sufficient to improve performance for most common cases, without risking > massive space overflow for large and highly non-unique data. There's no > problem with running the final merge slightly later than before; > everything's still there to allow it. Reusing space in the tuplestore is > also straightforward since that's exactly what the final merge already > does, so some rework of that code should be sufficient. > > This is a separate, but related idea of being able to avoid > mark/restores completely when the outer scan is provably unique. > > I'm not intending to implement this idea just yet, but it seemed worth > recording since it occurred to me - and discussing it as a TODO item. > > Comments? > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > > > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
pgsql-hackers by date: