Home > mailing lists

Re: Abbreviated keys for Numeric - Mailing list pgsql-hackers

From	Andrew Gierth
Subject	Re: Abbreviated keys for Numeric
Date	March 23, 2015 01:48:43
Msg-id	87h9tcbkzn.fsf@news-spur.riddles.org.uk Whole thread Raw
In response to	Re: Abbreviated keys for Numeric (Peter Geoghegan <pg@heroku.com>)
Responses	Re: Abbreviated keys for Numeric
List	pgsql-hackers

Tree view

>>>>> "Peter" == Peter Geoghegan <pg@heroku.com> writes:
Peter> I don't think it's that simple. The actual point of abbreviationPeter> is to amortize the total cost of
performingO(n log n)Peter> comparisons (the average case, which is usually assumed byPeter> infrastructure like
sortsupport),by performing an encodingPeter> process n times. That encoding process may be more expensivePeter> than
eachof the comparisons. Sometimes significantly morePeter> expensive.
 

But the cost of the encoding depends only on the number of non-null
values. Again, the nulls turn out to be irrelevant.
Peter> Forget about abbreviation entirely for a minute - consider plainPeter> old sorting of strxfrm() blobs, which for
examplethe glibcPeter> documentation recommends (with some caveats) as the preferredPeter> approach to sorting strings
whilerespecting collation. SupposePeter> that your applications happens to frequently encounter fullyPeter> sorted
arraysof strings when passed an array of strings toPeter> sort. If you were using our qsort() implementation,
which,Peter>rather questionably, has a pre-check for fully sorted inputPeter> (that throws away work when the pre-check
fails),then you'llPeter> only have n comparisons. Even if an encoding is only asPeter> expensive as an actual
comparison,the potential to lose withPeter> encoding clearly exists. Similarly, we don't abbreviate forPeter> top-n
heapsorts, even though all of those abbreviated keys mayPeter> be used for at least one comparison.
 

Nothing in the above paragraph has the slightest relevance to even the
text abbreviation code, much less the numeric one.
Peter> Now, I hear what you're saying about weighing the costs and thePeter> benefits -- you point out that
abbreviationof NULL values isPeter> not a cost, which is certainly true.
 

It's not a cost because it DOESN'T HAPPEN. The abbreviation function is
not called for null inputs.
Peter> But if the total number of NULL values is so high that theyPeter> dominate, then abbreviation might well be the
wrongthing forPeter> that reason alone.
 

No. This is the point that you're getting wrong. The amount of time
saved by the abbreviation code depends ONLY on the number and
cardinality of the NON-NULL values. Also, the time _lost_ by the
abbreviation code if we fail to abort in pathological cases STILL
depends only on the number of non-null values, so there's no benefit in
aborting even in a case when we have 1000 equal non-null values mixed in
with tens of millions of nulls.

Let me give you an actual example:

create table numtest2 as select floor(random() * 200)::numeric as a   from generate_series(1,1000000);
create table numtest3 as select a from (select floor(random() * 200)::numeric as a                  from
generate_series(1,1000000)                union all                select null from generate_series(1,4000000)) s order
byrandom();
 

So one table has a million non-null rows with 200 distinct values, and
the other has the same plus 4 million null rows.

Using my code, which ignores nulls in the cost model:

select * from numtest2 order by a offset 100000000;  -- 1.5 seconds
select * from numtest3 order by a offset 100000000;  -- 3.6 seconds

With abbreviation disabled entirely:

select * from numtest2 order by a offset 100000000;  -- 4.5 seconds
select * from numtest3 order by a offset 100000000;  -- 6.8 seconds

Notice that while proportionally the speedup is only <2x rather than 3x
for the version with nulls, the absolute difference in time is the same
in both cases - abbreviation saved us ~3 seconds in both cases.

(This relation remains true for larger values too: make it 19 million
nulls rather than 4 million and the timings are 11.5 sec / 14.5 sec,
still a 3 sec difference.)

Your version would have aborted abbrevation on that second query, thus
losing a 3 second speedup. How on earth is that supposed to be
justified? It's not like there's any realistically possible case where
your version performs faster than mine by more than a tiny margin.
Peter> Where I think your argument is stronger is around the cost ofPeter> actually aborting in particular (even though
notmuch work isPeter> thrown away).  Scanning through the memtuples array once morePeter> certainly isn't free.
 

Yes, the cost of actually aborting abbreviation goes up with the number
of nulls. But your version of the code makes that WORSE, by making it
more likely that we will abort unnecessarily.

If we use the raw value of memtuplecount for anything, it should be to
make it LESS likely that we abort abbreviations (since we'd be paying a
higher cost), not more. But benchmarking doesn't suggest that this is
necessary, at least not for numerics.
Peter> And you could take the view that it's always worth the risk,Peter> since it's at most a marginal cost saved if
thefirst ~10kPeter> tuples are representative, but a large cost incurred when theyPeter> are not. So on reflection, I
aminclined to put the check forPeter> non-null values back in.
 

Right result but the wrong reasoning.
Peter> I also concede that the cost of abbreviation is lower with yourPeter> numeric encoding scheme than it is for
thatof the text opclass,Peter> which favors this approach.
 

If you do some benchmarks on text columns, you'll find that this is a
win there too.

-- 
Andrew (irc:RhodiumToad)

pgsql-hackers by date:

From: Amit Langote
Date: 23 March 2015, 01:48:27
Subject: Re: Parallel Seq Scan

From: Andrew Gierth
Date: 23 March 2015, 01:52:41
Subject: Re: Abbreviated keys for Numeric

Re: Abbreviated keys for Numeric - Mailing list pgsql-hackers

Previous

Next