Re: Tuplesort merge pre-reading - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Tuplesort merge pre-reading |
Date | |
Msg-id | 1541ef84-c2e4-26b6-03a3-19d22589caa2@iki.fi Whole thread Raw |
In response to | Re: Tuplesort merge pre-reading (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Tuplesort merge pre-reading
Re: Tuplesort merge pre-reading Re: Tuplesort merge pre-reading |
List | pgsql-hackers |
On 09/15/2016 10:12 PM, Peter Geoghegan wrote: > On Wed, Sep 14, 2016 at 10:43 AM, Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> Spotted another issue with this code just now. Shouldn't it be based >>> on state->tapeRange? You don't want the destTape to get memory, since >>> you don't use batch memory for tapes that are written to (and final >>> on-the-fly merges don't use their destTape at all). > >>> Wait, you're using a local variable maxTapes here, which potentially >>> differs from state->maxTapes: > >> I changed that so that it does actually change state->maxTapes. I considered >> having a separate numTapes field, that can be smaller than maxTapes, but we >> don't need the original maxTapes value after that point anymore, so it >> would've been just pro forma to track them separately. I hope the comment >> now explains that better. > > I still don't get why you're doing all of this within mergeruns() (the > beginning of when we start merging -- we merge all quicksorted runs), > rather than within beginmerge() (the beginning of one particular merge > pass, of which there are potentially more than one). As runs are > merged in a non-final merge pass, fewer tapes will remain active for > the next merge pass. It doesn't do to do all that up-front when we > have multiple merge passes, which will happen from time to time. Now that the pre-reading is done in logtape.c, it doesn't stop at a run boundary. For example, when we read the last 1 MB of the first run on a tape, and we're using a 10 MB read buffer, we will merrily also read the first 9 MB from the next run. You cannot un-read that data, even if the tape is inactive in the next merge pass. I don't think it makes much difference in practice, because most merge passes use all, or almost all, of the available tapes. BTW, I think the polyphase algorithm prefers to do all the merges that don't use all tapes upfront, so that the last final merge always uses all the tapes. I'm not 100% sure about that, but that's my understanding of the algorithm, and that's what I've seen in my testing. > Correct me if I'm wrong, but I think that you're more skeptical of the > need for polyphase merge than I am. I at least see no reason to not > keep it around. I also think it has some value. It doesn't make this > optimization any harder, really. We certainly still need multi-pass merges. BTW, does a 1-way merge make any sense? I was surprised to see this in the log, even without this patch: LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY i) t; LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY i) t; LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY i) t; LOG: finished 1-way merge step: CPU 0.62s/7.22u sec elapsed 8.43 sec STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY i) t; LOG: finished 3-way merge step: CPU 0.62s/7.23u sec elapsed 8.44 sec STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY i) t; LOG: finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.44 sec STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY i) t; LOG: finished 6-way merge step: CPU 0.62s/7.24u sec elapsed 8.45 sec STATEMENT: SELECT COUNT(*) FROM (SELECT * FROM medium.random_ints ORDER BY i) t; - Heikki
pgsql-hackers by date: