Re: Using quicksort for every external sort run - Mailing list pgsql-hackers
From | Jeff Janes |
---|---|
Subject | Re: Using quicksort for every external sort run |
Date | |
Msg-id | CAMkU=1zKBOzkX-nqE-kJFFMyNm2hMGYL9AsKDEUHhwXASsJEbg@mail.gmail.com Whole thread Raw |
In response to | Re: Using quicksort for every external sort run (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Using quicksort for every external sort run
Re: Using quicksort for every external sort run |
List | pgsql-hackers |
On Wed, Dec 2, 2015 at 10:03 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > While large sorts are uncommon in queries, they are much more common > in index builds. Therefore, I think we ought to be worrying more > about regressions at 64MB than at 4MB, because we ship with > maintenance_work_mem = 64MB and a lot of people probably don't change > it before trying to build an index. You have more sympathy for people who don't tune their settings than I do. Especially now that auovacuum_work_mem exists, there is much less constraint on increasing maintance_work_mem than there is on work_mem. Unless, perhaps, you have a lot of user-driven temp tables which get indexes created on them. > If we make those index builds go > faster, users will be happy. If we make them go slower, users will be > sad. So I think it's worth asking the question "are there any CREATE > INDEX commands that someone might type on a system on which they've > done no other configuration that will be slower with this patch"? I found a regression on my 2nd attempt. I am indexing random md5 hashes (so they should get the full benefit of key abbreviation), and in this case 400,000,000 of them: create table foobar as select md5(random()::text) as x, random() as y from generate_series(1,100000000); insert into foobar select * from foobar ; insert into foobar select * from foobar ; Gives a 29GB table. with the index: create index on foobar (x); With 64MB maintenance_work_mem, I get (best time of 7 or 8): unpatched 2,436,483.834 ms allpatches 3,964,875.570 ms 62% slower not_0005 3,794,716.331 ms The unpatched sort ends with a 118-way merge followed by a 233-way merge: LOG: finished 118-way merge step: CPU 98.65s/835.67u sec elapsed 1270.61 sec LOG: performsort done (except 233-way final merge): CPU 98.75s/835.88u sec elapsed 1276.14 sec LOG: external sort ended, 2541465 disk blocks used: CPU 194.02s/1635.12u sec elapsed 2435.46 sec The patched one ends with a 2-way, two sequential 233-way merges, and a final 233-way merge: LOG: finished 2-way merge step: CPU 62.08s/435.70u sec elapsed 587.52 sec LOG: finished 233-way merge step: CPU 77.94s/660.11u sec elapsed 897.51 sec LOG: a multi-pass external merge sort is required (234 tape maximum) HINT: Consider increasing the configuration parameter "maintenance_work_mem". LOG: finished 233-way merge step: CPU 94.55s/884.63u sec elapsed 1185.17 sec LOG: performsort done (except 233-way final merge): CPU 94.76s/884.69u sec elapsed 1192.01 sec LOG: external sort ended, 2541656 disk blocks used: CPU 202.65s/1771.50u sec elapsed 3963.90 sec If you just look at the final merges of each, they should have the same number of tuples going through them (i.e. all of the tuples) but the patched one took well over twice as long, and all that time was IO time, not CPU time. I reversed out the memory pooling patch, and that shaved some time off, but nowhere near bringing it back to parity. I think what is going on here is that the different numbers of runs with the patched code just makes it land in an anti-sweat spot in the tape emulation and buffering algorithm. Each tape gets 256kB of buffer. But two tapes have one third of the tuples each other third are spread over all the other tapes almost equally (or maybe one tape has 2/3 of the tuples, if the output of one 233-way nonfinal merge was selected as the input of the other one). Once the large tape(s) has depleted its buffer, the others have had only slightly more than 1kB each depleted. Yet when it goes to fill the large tape, it also tops off every other tape while it is there, which is not going to get much read-ahead performance on them, leading to a lot of random IO. Now, I'm not sure why this same logic wouldn't apply to the unpatched code with 118-way merge too. So maybe I am all wet here. It seems like that imbalance would be enough to also cause the problem. I have seen this same type if things years ago, but was never able to analyze it to my satisfaction (as I haven't been able to do now, either). So if this patch with this exact workload just happens to land on a pre-existing infelicity, how big of a deal is that? It wouldn't be creating a regression, just shoving the region that experiences the problem around in such a way that it affects a different group of use cases. And perhaps more importantly, can anyone else reproduce this, or understand it? Cheers, Jeff
pgsql-hackers by date: