Home > mailing lists

Re: Mark/Restore and avoiding RandomAccess sorts - Mailing list pgsql-hackers

From	Bruce Momjian
Subject	Re: Mark/Restore and avoiding RandomAccess sorts
Date	January 6, 2007 18:06:42
Msg-id	200701062206.l06M6av04563@momjian.us Whole thread Raw
In response to	Mark/Restore and avoiding RandomAccess sorts ("Simon Riggs" <simon@2ndquadrant.com>)
Responses	Re: Mark/Restore and avoiding RandomAccess sorts Re: Mark/Restore and avoiding RandomAccess sorts
List	pgsql-hackers

Tree view

I saw no replies to this.

---------------------------------------------------------------------------

Simon Riggs wrote:
> Merge Joins require us to potentially Mark and Restore positions in the
> tuples arriving from executor sub-nodes.
> 
> This currently means that if the tuples arrive from a Sort node, as they
> often do in an MJ, the sort node will be instructed to prepare a random
> access version of the sort result. That requires a full final merge of
> the output, so as to allow rewinding the input when a Restore operation
> is called.
> 
> An MJ doesn't actually need random access, it just needs to be able to
> rewind. The question is: how far does it need to rewind? In many cases,
> the Restore operation moves back a small number of tuples, with a unique
> inner scan requiring a rewind of just one tuple. 
> 
> It would certainly be cheaper, in most cases, for the Sort node to
> maintain a variable size rewind buffer, where the history of prior
> tuples is truncated each time we perform a Mark operation. This could be
> implemented as a modified Tuplestore that could then be trimmed down
> each time a Mark operation took place. If the tuplestore has not yet
> spilled to disk this could be a trivial operation.
> 
> Doing that would almost completely remove the overhead of the final
> merge step in the sort. The final merge often doubles elapsed time in
> cases where the sort is larger than work_mem, which it often is.
> 
> Implementing the variable mark/restore buffer as a dumb Tuplestore would
> mean that the space usage of the Sort could in worst case go as high as
> x2 total space. The worst case is where the inner scan is all a single
> value. The best case is where the inner scan is sufficiently unique over
> all its values that it never writes back to disk at all. 
> 
> So a further refinement of this idea would be to simply defer the final
> merge operation for the sort until the history required for the Mark
> operation exceeded, say, 10% of the sort size. That would then be
> sufficient to improve performance for most common cases, without risking
> massive space overflow for large and highly non-unique data. There's no
> problem with running the final merge slightly later than before;
> everything's still there to allow it. Reusing space in the tuplestore is
> also straightforward since that's exactly what the final merge already
> does, so some rework of that code should be sufficient.
> 
> This is a separate, but related idea of being able to avoid
> mark/restores completely when the outer scan is provably unique. 
> 
> I'm not intending to implement this idea just yet, but it seemed worth
> recording since it occurred to me - and discussing it as a TODO item.
> 
> Comments?
> 
> -- 
>   Simon Riggs             
>   EnterpriseDB   http://www.enterprisedb.com
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

pgsql-hackers by date:

From: Bruce Momjian
Date: 06 January 2007, 18:02:09
Subject: Re: InitPostgres and flatfiles question

From: Bruce Momjian
Date: 06 January 2007, 18:18:46
Subject: Re: pg_ctl options

Re: Mark/Restore and avoiding RandomAccess sorts - Mailing list pgsql-hackers

Previous

Next