Re: strange buildfarm failure on lionfish - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: strange buildfarm failure on lionfish |
Date | |
Msg-id | 13477.1185299054@sss.pgh.pa.us Whole thread Raw |
In response to | Re: strange buildfarm failure on lionfish (Gregory Stark <stark@enterprisedb.com>) |
Responses |
Re: strange buildfarm failure on lionfish
|
List | pgsql-hackers |
Gregory Stark <stark@enterprisedb.com> writes: > "Tom Lane" <tgl@sss.pgh.pa.us> writes: >> I saw what I think was the identical failure last night on my own >> machine, but it wasn't repeatable. Evidently the planner is changing to >> a different plan for those queries, but why has this only started >> recently? Maybe the recent changes to autovacuum defaults are causing >> autovac to hit these tables when it never did before? > Indeed the only alternate plan I can imagine for this is to do the join the > other way around. And given the large difference in sizes between the two > tables the only way I could get that to happen was by obliterating the > statistics entirely for one table but having stats for the other. Yeah, I turned off autovac and let the tests run through 'horology', then stopped and looked at the plans for these queries. For the first one, what you get with no ANALYZE having been done is explain SELECT t.d1 + i.f1 AS "102" FROM TIMESTAMP_TBL t, INTERVAL_TBL i WHEREt.d1 BETWEEN '1990-01-01' AND '2001-01-01' AND i.f1 BETWEEN '00:00' AND '23:00'; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------Nested Loop (cost=36.56..80.89 rows=99 width=24) -> Seq Scan on timestamp_tbl t (cost=0.00..42.10 rows=11 width=8) Filter:((d1 >= '1990-01-01 00:00:00'::timestamp without time zone) AND (d1 <= '2001-01-01 00:00:00'::timestamp without timezone)) -> Materialize (cost=36.56..36.65 rows=9 width=16) -> Seq Scan on interval_tbl i (cost=0.00..36.55rows=9 width=16) Filter: ((f1 >= '00:00:00'::interval) AND (f1 <= '23:00:00'::interval)) If timestamp_tbl is then ANALYZEd, the plan flips around to put i on the outside of the nestloop (because the estimate of the number of matching rows rises to 49, which is pretty good because the actual is 52). OTOH, if interval_tbl is ANALYZEd, the estimate for it drops to 2 rows (again a much better match to reality) and we go back to preferring i on the inside, with or without timestamp_tbl having been analyzed. And, at least in the serial-schedule case, the stats at this point look like relid | schemaname | relname | seq_scan | seq_tup_read | idx_scan | idx_tup_fetch | n_tup_ins | n_tup_upd | n_tup_del| n_live_tup | n_dead_tup | last_vacuum | last_autovacuum | last_analyze | last_autoanalyze --------+------------+---------------+----------+--------------+----------+---------------+-----------+-----------+-----------+------------+------------+-------------+-----------------+--------------+------------------132885 |public | interval_tbl | 22 | 210 | | | 10 | 0 | 0 | 10 | 0 | | | | 132879 | public | timestamp_tbl | 45 | 2444 | | | 74 | 0 | 8 | 66 | 8 | | | | So yesterday's change to reduce the analyze threshold to 50 means that timestamp_tbl is now vulnerable to being asynchronously analyzed while the tests run. While I don't have any very strong objection to putting an ORDER BY on these particular queries, I'm worried about how many other regression tests will now start showing random failures. We have an awful lot of small tables in the tests ... regards, tom lane
pgsql-hackers by date: