Re: Abbreviated keys for text cost model fix - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Abbreviated keys for text cost model fix |
Date | |
Msg-id | 54EB580C.2000904@2ndquadrant.com Whole thread Raw |
In response to | Re: Abbreviated keys for text cost model fix (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Abbreviated keys for text cost model fix
Re: Abbreviated keys for text cost model fix |
List | pgsql-hackers |
Hi, On 22.2.2015 22:30, Peter Geoghegan wrote: > On Sun, Feb 22, 2015 at 1:19 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> In short, this fixes all the cases except for the ASC sorted data. I >> haven't done any code review, but I think we want this. >> >> I'll use data from the i5-2500k, but it applies to the Xeon too, except >> that the Xeon results are more noisy and the speedups are not that >> significant. >> >> For the 'text' data type, and 'random' dataset, the results are these: >> >> scale datum cost-model >> ------------------------------- >> 100000 328% 323% >> 1000000 392% 391% >> 2000000 96% 565% >> 3000000 97% 572% >> 4000000 97% 571% >> 5000000 98% 570% >> >> The numbers are speedup vs. master, so 100% means exactly the same >> speed, 200% means twice as fast. >> >> So while with 'datum' patch this actually caused very nice speedup for >> small datasets - about 3-4x speedup up to 1M rows, for larger datasets >> we've seen small regression (~3% slower). With the cost model fix, we >> actually see a significant speedup (about 5.7x) for these cases. > > Cool. > >> I haven't verified whether this produces the same results, but if it >> does this is very nice. >> >> For 'DESC' dataset (i.e. data sorted in reverse order), we do get even >> better numbers, with up to 6.5x speedup on large datasets. >> >> But for 'ASC' dataset (i.e. already sorted data), we do get this: >> >> scale datum cost-model >> ------------------------------- >> 100000 85% 84% >> 1000000 87% 87% >> 2000000 76% 96% >> 3000000 82% 90% >> 4000000 91% 83% >> 5000000 93% 81% >> >> Ummm, not that great, I guess :-( > > You should try it with the data fully sorted like this, but with one > tiny difference: The very last tuple is out of order. How does that > look? So here are the results for ASC-ordered dataset, with one 'unsorted' row added to the end of the dataset. As before the complete scripts are attached, and the raw results are available in a spreadsheet: http://bit.ly/18g1nTU The durations are much higher than without the single unsorted row added at the end. Queries often take 20x longer to finish (on the same code), depending on the scale. The speedup results (compared to master) look like this: scale query# datum numeric cost model 100000 1 859% 861% 856% 100000 2 811% 814% 805% 100000 3 100% 100% 97% 1000000 1 805% 804% 807% 1000000 2 769% 773% 770% 1000000 3 100% 100% 98% 2000000 1 97% 97% 673% 2000000 2 96% 97% 646% 2000000 3 99% 101% 678% 3000000 1 98% 98% 578% 3000000 2 96% 97% 557% 3000000 3 99% 101% 579% 4000000 1 99% 99% 513% 4000000 2 97% 98% 497% 4000000 3 99% 101% 510% 5000000 1 99% 99% 469% 5000000 2 97% 98% 456% 5000000 3 99% 101% 466% What's interesting here is that some queries are much faster, but query #3 is slow until we hit 2M rows: select * from (select * from stuff_int_desc order by randint offset 100000000000) foo Looking at the previous tests, I see this is exactly what's happening to this query with 'random' dataset - it's slightly slower than master up until 2M rows, when it suddenly jumps to the same speedup as the other queries. Can we do something about that? Anyway, I'm wondering what conclusion we can do from this? I believe vast majority of datasets in production won't be perfectly sorted, because when the table is CLUSTERed by index we tend to use index scan to do the sort (so no problem), or the data are not actually perfectly sorted (and here we get significant speedup). -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: