Re: Abbreviated keys for Numeric - Mailing list pgsql-hackers
From | Andrew Gierth |
---|---|
Subject | Re: Abbreviated keys for Numeric |
Date | |
Msg-id | 87h9tcbkzn.fsf@news-spur.riddles.org.uk Whole thread Raw |
In response to | Re: Abbreviated keys for Numeric (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Abbreviated keys for Numeric
|
List | pgsql-hackers |
>>>>> "Peter" == Peter Geoghegan <pg@heroku.com> writes: Peter> I don't think it's that simple. The actual point of abbreviationPeter> is to amortize the total cost of performingO(n log n)Peter> comparisons (the average case, which is usually assumed byPeter> infrastructure like sortsupport),by performing an encodingPeter> process n times. That encoding process may be more expensivePeter> than eachof the comparisons. Sometimes significantly morePeter> expensive. But the cost of the encoding depends only on the number of non-null values. Again, the nulls turn out to be irrelevant. Peter> Forget about abbreviation entirely for a minute - consider plainPeter> old sorting of strxfrm() blobs, which for examplethe glibcPeter> documentation recommends (with some caveats) as the preferredPeter> approach to sorting strings whilerespecting collation. SupposePeter> that your applications happens to frequently encounter fullyPeter> sorted arraysof strings when passed an array of strings toPeter> sort. If you were using our qsort() implementation, which,Peter>rather questionably, has a pre-check for fully sorted inputPeter> (that throws away work when the pre-check fails),then you'llPeter> only have n comparisons. Even if an encoding is only asPeter> expensive as an actual comparison,the potential to lose withPeter> encoding clearly exists. Similarly, we don't abbreviate forPeter> top-n heapsorts, even though all of those abbreviated keys mayPeter> be used for at least one comparison. Nothing in the above paragraph has the slightest relevance to even the text abbreviation code, much less the numeric one. Peter> Now, I hear what you're saying about weighing the costs and thePeter> benefits -- you point out that abbreviationof NULL values isPeter> not a cost, which is certainly true. It's not a cost because it DOESN'T HAPPEN. The abbreviation function is not called for null inputs. Peter> But if the total number of NULL values is so high that theyPeter> dominate, then abbreviation might well be the wrongthing forPeter> that reason alone. No. This is the point that you're getting wrong. The amount of time saved by the abbreviation code depends ONLY on the number and cardinality of the NON-NULL values. Also, the time _lost_ by the abbreviation code if we fail to abort in pathological cases STILL depends only on the number of non-null values, so there's no benefit in aborting even in a case when we have 1000 equal non-null values mixed in with tens of millions of nulls. Let me give you an actual example: create table numtest2 as select floor(random() * 200)::numeric as a from generate_series(1,1000000); create table numtest3 as select a from (select floor(random() * 200)::numeric as a from generate_series(1,1000000) union all select null from generate_series(1,4000000)) s order byrandom(); So one table has a million non-null rows with 200 distinct values, and the other has the same plus 4 million null rows. Using my code, which ignores nulls in the cost model: select * from numtest2 order by a offset 100000000; -- 1.5 seconds select * from numtest3 order by a offset 100000000; -- 3.6 seconds With abbreviation disabled entirely: select * from numtest2 order by a offset 100000000; -- 4.5 seconds select * from numtest3 order by a offset 100000000; -- 6.8 seconds Notice that while proportionally the speedup is only <2x rather than 3x for the version with nulls, the absolute difference in time is the same in both cases - abbreviation saved us ~3 seconds in both cases. (This relation remains true for larger values too: make it 19 million nulls rather than 4 million and the timings are 11.5 sec / 14.5 sec, still a 3 sec difference.) Your version would have aborted abbrevation on that second query, thus losing a 3 second speedup. How on earth is that supposed to be justified? It's not like there's any realistically possible case where your version performs faster than mine by more than a tiny margin. Peter> Where I think your argument is stronger is around the cost ofPeter> actually aborting in particular (even though notmuch work isPeter> thrown away). Scanning through the memtuples array once morePeter> certainly isn't free. Yes, the cost of actually aborting abbreviation goes up with the number of nulls. But your version of the code makes that WORSE, by making it more likely that we will abort unnecessarily. If we use the raw value of memtuplecount for anything, it should be to make it LESS likely that we abort abbreviations (since we'd be paying a higher cost), not more. But benchmarking doesn't suggest that this is necessary, at least not for numerics. Peter> And you could take the view that it's always worth the risk,Peter> since it's at most a marginal cost saved if thefirst ~10kPeter> tuples are representative, but a large cost incurred when theyPeter> are not. So on reflection, I aminclined to put the check forPeter> non-null values back in. Right result but the wrong reasoning. Peter> I also concede that the cost of abbreviation is lower with yourPeter> numeric encoding scheme than it is for thatof the text opclass,Peter> which favors this approach. If you do some benchmarks on text columns, you'll find that this is a win there too. -- Andrew (irc:RhodiumToad)
pgsql-hackers by date: