Thread: master check fails on Windows Server 2008
Hello, hackers! I got a permanent failure of master (commit 2a41507dab0f293ff241fe8ae326065998668af8) check on Windows Server 2008. Regression output and diffs as well as config.pl are attached. I used the following commands: build.bat > build.txt vcregress.bat check > check.txt Binary search has shown that this failure begins with commit bed9ef5a16239d91d97a1fa2efd9309c3cbbc4b2 (Rework the stats_ext test). On the previous commit (70ec3f1f8f0b753c38a1a582280a02930d7cac5f) the check passes. I'm trying to figure out what went wrong, and any suspicions are welcome. About the system: Windows Server 2008 R2 Standard, Service Pack 1, 64-bit. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Marina Polyakova <m.polyakova@postgrespro.ru> writes: > Hello, hackers! I got a permanent failure of master (commit > 2a41507dab0f293ff241fe8ae326065998668af8) check on Windows Server 2008. > Regression output and diffs as well as config.pl are attached. Weird. AFAICS the cost estimates for those two plans should be quite different, so this isn't just a matter of the estimates maybe being a bit platform-dependent. (And that test has been there nearly a year without causing reported problems.) To dig into it a bit more, I tweaked the test case to show the costs for both plans, and got an output diff as attached. Could you try the same experiment on your Windows box? In order to force the choice in the other direction, you'd need to temporarily disable enable_sort, not enable_hashagg as I did here, but the principle is the same. regards, tom lane diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql index 46acaad..1082660 100644 *** a/src/test/regress/sql/stats_ext.sql --- b/src/test/regress/sql/stats_ext.sql *************** EXPLAIN (COSTS off) *** 177,184 **** EXPLAIN (COSTS off) SELECT COUNT(*) FROM ndistinct GROUP BY a, b, c, d; ! EXPLAIN (COSTS off) SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d; EXPLAIN (COSTS off) SELECT COUNT(*) FROM ndistinct GROUP BY a, d; --- 177,188 ---- EXPLAIN (COSTS off) SELECT COUNT(*) FROM ndistinct GROUP BY a, b, c, d; ! EXPLAIN --(COSTS off) ! SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d; ! set enable_hashagg = 0; ! EXPLAIN --(COSTS off) SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d; + reset enable_hashagg; EXPLAIN (COSTS off) SELECT COUNT(*) FROM ndistinct GROUP BY a, d; *** /home/postgres/pgsql/src/test/regress/expected/stats_ext.out Mon Feb 12 14:53:46 2018 --- /home/postgres/pgsql/src/test/regress/results/stats_ext.out Fri Feb 16 11:23:11 2018 *************** *** 309,323 **** -> Seq Scan on ndistinct (5 rows) ! EXPLAIN (COSTS off) SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d; ! QUERY PLAN ! ----------------------------- ! HashAggregate Group Key: b, c, d ! -> Seq Scan on ndistinct (3 rows) EXPLAIN (COSTS off) SELECT COUNT(*) FROM ndistinct GROUP BY a, d; QUERY PLAN --- 309,336 ---- -> Seq Scan on ndistinct (5 rows) ! EXPLAIN --(COSTS off) SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d; ! QUERY PLAN ! ---------------------------------------------------------------------- ! HashAggregate (cost=291.00..307.32 rows=1632 width=20) Group Key: b, c, d ! -> Seq Scan on ndistinct (cost=0.00..191.00 rows=10000 width=12) (3 rows) + set enable_hashagg = 0; + EXPLAIN --(COSTS off) + SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d; + QUERY PLAN + ---------------------------------------------------------------------------- + GroupAggregate (cost=1026.89..1168.21 rows=1632 width=20) + Group Key: b, c, d + -> Sort (cost=1026.89..1051.89 rows=10000 width=12) + Sort Key: b, c, d + -> Seq Scan on ndistinct (cost=0.00..191.00 rows=10000 width=12) + (5 rows) + + reset enable_hashagg; EXPLAIN (COSTS off) SELECT COUNT(*) FROM ndistinct GROUP BY a, d; QUERY PLAN ======================================================================
On 16-02-2018 19:31, Tom Lane wrote: > Marina Polyakova <m.polyakova@postgrespro.ru> writes: >> Hello, hackers! I got a permanent failure of master (commit >> 2a41507dab0f293ff241fe8ae326065998668af8) check on Windows Server >> 2008. >> Regression output and diffs as well as config.pl are attached. > > Weird. AFAICS the cost estimates for those two plans should be quite > different, so this isn't just a matter of the estimates maybe being > a bit platform-dependent. (And that test has been there nearly a > year without causing reported problems.) > > To dig into it a bit more, I tweaked the test case to show the costs > for both plans, and got an output diff as attached. Could you try > the same experiment on your Windows box? In order to force the choice > in the other direction, you'd need to temporarily disable enable_sort, > not enable_hashagg as I did here, but the principle is the same. Thank you very much! Your test showed that hash aggregation was not even added to the possible paths (see windows_regression.diffs attached). Exploring this, I found that not allowing float8 to pass by value in config.pl was crucial for the size of the hash table used in this query (see diff.patch attached): From postmaster.log on Windows: 2018-02-17 20:50:48.596 MSK [1592] pg_regress/stats_ext STATEMENT: EXPLAIN SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d; 2018-02-17 20:50:48.596 MSK [1592] pg_regress/stats_ext LOG: rewritten parse tree: ... 2018-02-17 20:50:48.596 MSK [1592] pg_regress/stats_ext STATEMENT: EXPLAIN SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d; # 20 = INT8OID => pg_type.typbyval = FLOAT8PASSBYVAL: get_agg_clause_costs_walker aggtranstype 20 get_typbyval(aggtranstype) 0 get_agg_clause_costs_walker avgwidth 8 sizeof(void *) 8 costs->transitionSpace 24 # add AGG_SORTED path: add_paths_to_grouping_rel 1 create_agg_path (aggstrategy 1) estimate_hashagg_tablesize 1 hashentrysize 32 # add transitionSpace = 24: estimate_hashagg_tablesize 2 hashentrysize 56 estimate_hashagg_tablesize 3 hashentrysize 96 estimate_hashagg_tablesize dNumGroups 1632.000000 # 156672 = 96 * 1632 > 131072: add_paths_to_grouping_rel hashaggtablesize 156672 work_mem 128 work_mem * 1024L 131072 grouped_rel->pathlist == NIL 0 2018-02-17 20:50:48.596 MSK [1592] pg_regress/stats_ext LOG: plan: ... From postmaster.log on my computer (allow float8 to pass by value): 2018-02-17 20:45:57.651 MSK [3012] pg_regress/stats_ext STATEMENT: EXPLAIN SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d; 2018-02-17 20:45:57.651 MSK [3012] pg_regress/stats_ext LOG: rewritten parse tree: ... 2018-02-17 20:45:57.651 MSK [3012] pg_regress/stats_ext STATEMENT: EXPLAIN SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d; # 20 = INT8OID => pg_type.typbyval = FLOAT8PASSBYVAL: get_agg_clause_costs_walker aggtranstype 20 get_typbyval(aggtranstype) 1 # add AGG_SORTED path: add_paths_to_grouping_rel 1 create_agg_path (aggstrategy 1) estimate_hashagg_tablesize 1 hashentrysize 32 # add transitionSpace = 0: estimate_hashagg_tablesize 2 hashentrysize 32 estimate_hashagg_tablesize 3 hashentrysize 72 estimate_hashagg_tablesize dNumGroups 1632.000000 # 117504 = 72 * 1632 < 131072: add_paths_to_grouping_rel hashaggtablesize 117504 work_mem 128 work_mem * 1024L 131072 grouped_rel->pathlist == NIL 0 # add AGG_HASHED path: add_paths_to_grouping_rel 2 create_agg_path (aggstrategy 2) 2018-02-17 20:45:57.651 MSK [3012] pg_regress/stats_ext LOG: plan: ... -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Attachment
Marina Polyakova <m.polyakova@postgrespro.ru> writes: > On 16-02-2018 19:31, Tom Lane wrote: >> Weird. AFAICS the cost estimates for those two plans should be quite >> different, so this isn't just a matter of the estimates maybe being >> a bit platform-dependent. (And that test has been there nearly a >> year without causing reported problems.) > Thank you very much! Your test showed that hash aggregation was not even > added to the possible paths (see windows_regression.diffs attached). > Exploring this, I found that not allowing float8 to pass by value in > config.pl was crucial for the size of the hash table used in this query > (see diff.patch attached): Ah-hah. I can reproduce the described failure if I configure with --disable-float8-byval on an otherwise 64-bit machine. It appears that the minimum work_mem setting that will allow this query to use a hashagg plan on such a configuration is about 155kB (which squares with the results you show). On the other hand, in a normal 64-bit configuration, work_mem settings of 160kB or more cause other failures (due to other plans switching to hashagg), and on a 32-bit machine I see such failures with work_mem of 150kB or more. So there's basically no setting of work_mem that would make all these tests pass everywhere. I see several things we could do about this: 1. Nothing; just say "sorry, we don't promise that the regression tests pass with no plan differences on nonstandard configurations". Given that --disable-float8-byval has hardly any real-world use, there is not a lot of downside to that. 2. Decide that --disable-float8-byval, and for that matter --disable-float4-byval, have no real-world use at all and take them out. There was some point in those options back when we cared about binary compatibility with version-zero C functions, but now I'm not sure why anyone would use them. 3. Drop that one test case from stats_ext.sql; I'm not sure how much additional test value it's actually bringing. 4. Try to tweak the stats_ext.sql test conditions in some more refined way to get the test to pass everywhere. This'd be a lot of work with no guarantee of success, so I'm not too excited about it. 5. Something else? regards, tom lane
On Mon, Feb 19, 2018 at 4:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I see several things we could do about this: > > 1. Nothing; just say "sorry, we don't promise that the regression tests > pass with no plan differences on nonstandard configurations". Given that > --disable-float8-byval has hardly any real-world use, there is not a lot > of downside to that. That would make sense to me. It will also be necessary to formalize what "nonstandard configuration" actually means if we go this way, of course. I believe that this is already true with "dynamic_shared_memory_type == DSM_IMPL_NONE", so that's a second entry for the "nonstandard configuration" list. -- Peter Geoghegan
On 20-02-2018 3:37, Tom Lane wrote: > Ah-hah. I can reproduce the described failure if I configure with > --disable-float8-byval on an otherwise 64-bit machine. It appears that > the minimum work_mem setting that will allow this query to use a > hashagg > plan on such a configuration is about 155kB (which squares with the > results you show). On the other hand, in a normal 64-bit > configuration, > work_mem settings of 160kB or more cause other failures (due to other > plans switching to hashagg), and on a 32-bit machine I see such > failures > with work_mem of 150kB or more. So there's basically no setting of > work_mem that would make all these tests pass everywhere. > > I see several things we could do about this: > > 1. Nothing; just say "sorry, we don't promise that the regression tests > pass with no plan differences on nonstandard configurations". Given > that > --disable-float8-byval has hardly any real-world use, there is not a > lot > of downside to that. > > 2. Decide that --disable-float8-byval, and for that matter > --disable-float4-byval, have no real-world use at all and take them > out. > There was some point in those options back when we cared about binary > compatibility with version-zero C functions, but now I'm not sure why > anyone would use them. > > 3. Drop that one test case from stats_ext.sql; I'm not sure how much > additional test value it's actually bringing. > > 4. Try to tweak the stats_ext.sql test conditions in some more refined > way to get the test to pass everywhere. This'd be a lot of work with > no guarantee of success, so I'm not too excited about it. Thank you for your explanations! I'll try to do something in this direction.. > 5. Something else? > > regards, tom lane -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Marina Polyakova <m.polyakova@postgrespro.ru> writes: > On 20-02-2018 3:37, Tom Lane wrote: >> 4. Try to tweak the stats_ext.sql test conditions in some more refined >> way to get the test to pass everywhere. This'd be a lot of work with >> no guarantee of success, so I'm not too excited about it. > Thank you for your explanations! I'll try to do something in this > direction.. OK. The least painful fix might be to establish a different work_mem setting just for that one query. However, if you're intent on putting work into continued support of --disable-float8-byval, I would *strongly* suggest setting up a buildfarm member that runs that way, because otherwise we're pretty much guaranteed to break it again. I continue to wonder if it's not better to just remove the option and thereby simplify our lives. What's the actual value of having it anymore? regards, tom lane
On 20-02-2018 21:23, Tom Lane wrote: > Marina Polyakova <m.polyakova@postgrespro.ru> writes: >> On 20-02-2018 3:37, Tom Lane wrote: >>> 4. Try to tweak the stats_ext.sql test conditions in some more >>> refined >>> way to get the test to pass everywhere. This'd be a lot of work with >>> no guarantee of success, so I'm not too excited about it. > >> Thank you for your explanations! I'll try to do something in this >> direction.. > > OK. The least painful fix might be to establish a different work_mem > setting just for that one query. > > However, if you're intent on putting work into continued support of > --disable-float8-byval, I would *strongly* suggest setting up a > buildfarm > member that runs that way, because otherwise we're pretty much > guaranteed > to break it again. Oh, thank you again! > I continue to wonder if it's not better to just remove > the option and thereby simplify our lives. What's the actual value of > having it anymore? I agree with you, but I have too little experience to vote for removing this option. -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Marina Polyakova <m.polyakova@postgrespro.ru> writes: > On 20-02-2018 21:23, Tom Lane wrote: >> I continue to wonder if it's not better to just remove >> the option and thereby simplify our lives. What's the actual value of >> having it anymore? > I agree with you, but I have too little experience to vote for removing > this option. I've started a separate thread to propose removal of the option, at https://postgr.es/m/10862.1519228208@sss.pgh.pa.us regards, tom lane
On 21-02-2018 18:51, Tom Lane wrote: > Marina Polyakova <m.polyakova@postgrespro.ru> writes: >> On 20-02-2018 21:23, Tom Lane wrote: >>> I continue to wonder if it's not better to just remove >>> the option and thereby simplify our lives. What's the actual value >>> of >>> having it anymore? > >> I agree with you, but I have too little experience to vote for >> removing >> this option. > > I've started a separate thread to propose removal of the option, at > https://postgr.es/m/10862.1519228208@sss.pgh.pa.us Thank you! -- Marina Polyakova Postgres Professional: http://www.postgrespro.com The Russian Postgres Company