pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H |
Date | |
Msg-id | 5581B3EE.1070907@2ndquadrant.com Whole thread Raw |
Responses |
Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H |
List | pgsql-hackers |
Hi, I'm currently running some tests on a 3TB TPC-H data set, and I tripped over a pretty bad n_distinct underestimate, causing OOM in HashAgg (which somehow illustrates the importance of the memory-bounded hashagg patch Jeff Davis is working on). The problem is Q18, particularly this simple subquery: select l_orderkey from lineitem group by l_orderkey having sum(l_quantity) > 313; which is planned like this: QUERY PLAN --------------------------------------------------------------------------------- HashAggregate (cost=598510163.92..598515393.93rows=418401 width=12) Group Key: l_orderkey Filter: (sum(l_quantity) > '313'::doubleprecision) -> Seq Scan on lineitem (cost=0.00..508509923.28 rows=18000048128 width=12) (4 rows) but sadly, in reality the l_orderkey cardinality looks like this: tpch=# select count(distinct l_orderkey) from lineitem; count ------------ 4500000000 (1 row) That's a helluva difference - not the usual one or two orders of magnitude, but 10000x underestimate. The usual thing to do in this case is increasing statistics target, and while this improves the estimate, the improvement is rather small: statistics target estimate difference -------------------------------------------------- 100 429491 10000 1000 4240418 1000 10000 42913759 100 I find the pattern rather strange - every time the statistics target increases 10x, the difference decreases 10x - maybe that's natural, but the perfect proportionality is suspicious IMHO. Also, this is a quite large dataset - the table has ~18 billion rows, and even with target=10000 we're sampling only 3M rows, which is ~0.02%. That's a tiny sample, so inaccuracy is naturally expected, but OTOH the TPC-H dataset is damn uniform - there's pretty much no skew in the distributions AFAIK. So I'd expect a slightly better result. With target=10000 the plan switches to GroupAggregate, because the estimate gets sufficient to exceed work_mem (2GB). But it's still way off, and it's mostly just a lucky coincidence. So I'm wondering if there's some bug because of the dataset size (an integer overflow or something like), so I added a bunch of logging into the estimator, logging all the parameters computed: target=100 (samplerows=30000) ----------------------------- WARNING: attnum=1 attname=l_orderkey f1=27976 ndistinct=28977 nmultiple=1001 toowide_cnt=0 d=28977 numer=869310000.000000 denom=2024.046627 stadistinct=429491.094029 WARNING: ndistinct estimate attnum=1 attname=l_orderkey current=429491.09 adaptive=443730.00 target=1000 (samplerows=300000) ------------------------------- WARNING: attnum=1 attname=l_orderkey f1=279513 ndistinct=289644 nmultiple=10131 toowide_cnt=0 d=289644 numer=86893200000.000000 denom=20491.658538 stadistinct=4240418.111618 WARNING: ndistinct estimate attnum=1 attname=l_orderkey current=4240418.11 adaptive=4375171.00 target=10000 (samplerows=3000000) --------------------------------- WARNING: attnum=1 attname=l_orderkey f1=2797888 ndistinct=2897799 nmultiple=99911 toowide_cnt=0 d=2897799 numer=8693397000000.000000 denom=202578.313396 stadistinct=42913759.396282 WARNING: ndistinct estimate attnum=1 attname=l_orderkey current=42913759.40 adaptive=44449882.00 It's totalrows=18000049031 in all cases. The logs also show estimate produced by the adaptive estimate (discussed in a separate thread), but apparently that does not change the estimates much :-( Any ideas? -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: