Re: Memory usage during sorting - Mailing list pgsql-hackers
From | Hitoshi Harada |
---|---|
Subject | Re: Memory usage during sorting |
Date | |
Msg-id | CAP7QgmkKb+81F_fNsheJBT9uc_8+RmfqNhjiJkYP-A-h3Eg8zA@mail.gmail.com Whole thread Raw |
In response to | Re: Memory usage during sorting (Jeff Janes <jeff.janes@gmail.com>) |
Responses |
Re: Memory usage during sorting
|
List | pgsql-hackers |
On Sat, Feb 11, 2012 at 11:34 AM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, Feb 8, 2012 at 1:01 AM, Hitoshi Harada <umi.tanuki@gmail.com> wrote: >> On Sun, Jan 15, 2012 at 4:59 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >>> >>> The attached patch allows it to reuse that memory. On my meager >>> system it reduced the building of an index on an integer column in a >>> skinny 200 million row totally randomly ordered table by about 3% from >>> a baseline of 25 minutes. >>> >> >> Just to give a standard review, this patch is one line change and >> applies cleanly, builds ok. >> >> I'm not pretty sure what exactly you're trying to accomplish, but it >> seems to me that it's avoiding the first dumptuples cycle by forcing >> availMem = 0 even if it's negative. > > Yes. Currently when it switches to the TSS_BUILDRUNS part of a > tape-sort, it starts by calling WRITETUP a large number of time > consecutively, to work off the memory deficit incurred by the 3 blocks > per tape of tape overhead, and then after that calls WRITETUP about > once per puttuple.. Under my patch, it would only call WRITETUP > about once per puttuple, right from the beginning. > >> I read your comments as it'd be >> avoiding to alternate reading/writing back and force with scattered >> memory failing memory cache much during merge phase, but actually it >> doesn't affect merge phase but only init-dump phase, does it? > > It effects the building of the runs. But this building of the runs is > not a simple dump, it is itself a mini merge phase, in which it merges > the existing in-memory priority queue against the still-incoming > tuples from the node which invoked the sort. By using less memory > than it could, this means that the resulting runs are smaller than > they could be, and so will sometimes necessitate an additional layer > of merging later on. (This effect is particularly large for the very > first run being built. Generally by merging incoming tuples into the > memory-tuples, you can create runs that are 1.7 times the size of fits > in memory. By wasting some memory, we are getting 1.7 the size of a > smaller starting point. But for the first run, it is worse than that. > Most of the benefit that leads to that 1.7 multiplier comes at the > very early stage of each run-build. But by initially using the full > memory, then writing out a bunch of tuples without doing any merge of > the incoming, we have truncated the part that gives the most benefit.) > > My analysis that the "freed" memory is never reused (because we refuse > to reuse it ourselves and it is too fragmented to be reused by anyone > else, like the palloc or VM system) only applies to the run-building > phase. So never was a bit of an overstatement. By the time the last > initial run is completely written out to tape, the heap used for the > priority queue should be totally empty. So at this point the > allocator would have the chance to congeal all of the fragmented > memory back into larger chunks, or maybe it parcels the allocations > back out again in an order so that the unused space is contiguous and > could be meaningfully paged out. > > But, it is it worth worrying about how much we fragment memory and if > we overshoot our promises by 10 or 20%? > >> If so, >> I'm not so convinced your benchmark gave 3 % gain by this change. >> Correct me as I'm probably wrong. > > I've now done more complete testing. Building an index on an > 200,000,000 row table with an integer column populated in random order > with integers from 1..500,000,000, non-unique, on a machine with 2GB > of RAM and 600MB of shared_buffers. > > It improves things by 6-7 percent at the end of working mem size, the > rest are in the noise except at 77936 KB, where it reproducibly makes > things 4% worse, for reasons I haven't figured out. So maybe the best > thing to do is, rather than micromanaging memory usage, simply don't > set maintenance_work_mem way to low. (But, it is the default). I've tested here with only a million rows mix of integer/text (table size is 80MB or so) with default setting, running CREATE INDEX continuously, but couldn't find performance improvement. The number varies from -2% to +2%, which I think is just error. While I agree with your insist that avoiding the first dump would make sense, I guess it depends on situations; if the dump goes with lots of tuples (which should happen when availMem is big), writing tuples a lot at a time will be faster than writing little by little later. I'm not sure about the conclusion, but given this discussion, I'm inclined to mark this Returned with Feedback. Thanks, -- Hitoshi Harada
pgsql-hackers by date: