Thread: master check fails on Windows Server 2008

master check fails on Windows Server 2008

From

Marina Polyakova

Date:

16 February 2018, 13:58:37

Hello, hackers! I got a permanent failure of master (commit 
2a41507dab0f293ff241fe8ae326065998668af8) check on Windows Server 2008. 
Regression output and diffs as well as config.pl are attached.

I used the following commands:
build.bat > build.txt
vcregress.bat check > check.txt

Binary search has shown that this failure begins with commit 
bed9ef5a16239d91d97a1fa2efd9309c3cbbc4b2 (Rework the stats_ext test). On 
the previous commit (70ec3f1f8f0b753c38a1a582280a02930d7cac5f) the check 
passes.
I'm trying to figure out what went wrong, and any suspicions are 
welcome.

About the system: Windows Server 2008 R2 Standard, Service Pack 1, 
64-bit.
-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: master check fails on Windows Server 2008

From

Tom Lane

Date:

16 February 2018, 19:31:28

Marina Polyakova <m.polyakova@postgrespro.ru> writes:
> Hello, hackers! I got a permanent failure of master (commit 
> 2a41507dab0f293ff241fe8ae326065998668af8) check on Windows Server 2008. 
> Regression output and diffs as well as config.pl are attached.

Weird.  AFAICS the cost estimates for those two plans should be quite
different, so this isn't just a matter of the estimates maybe being
a bit platform-dependent.  (And that test has been there nearly a
year without causing reported problems.)

To dig into it a bit more, I tweaked the test case to show the costs
for both plans, and got an output diff as attached.  Could you try
the same experiment on your Windows box?  In order to force the choice
in the other direction, you'd need to temporarily disable enable_sort,
not enable_hashagg as I did here, but the principle is the same.

            regards, tom lane

diff --git a/src/test/regress/sql/stats_ext.sql b/src/test/regress/sql/stats_ext.sql
index 46acaad..1082660 100644
*** a/src/test/regress/sql/stats_ext.sql
--- b/src/test/regress/sql/stats_ext.sql
*************** EXPLAIN (COSTS off)
*** 177,184 ****
  EXPLAIN (COSTS off)
   SELECT COUNT(*) FROM ndistinct GROUP BY a, b, c, d;

! EXPLAIN (COSTS off)
   SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d;

  EXPLAIN (COSTS off)
   SELECT COUNT(*) FROM ndistinct GROUP BY a, d;
--- 177,188 ----
  EXPLAIN (COSTS off)
   SELECT COUNT(*) FROM ndistinct GROUP BY a, b, c, d;

! EXPLAIN --(COSTS off)
!  SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d;
! set enable_hashagg = 0;
! EXPLAIN --(COSTS off)
   SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d;
+ reset enable_hashagg;

  EXPLAIN (COSTS off)
   SELECT COUNT(*) FROM ndistinct GROUP BY a, d;
*** /home/postgres/pgsql/src/test/regress/expected/stats_ext.out    Mon Feb 12 14:53:46 2018
--- /home/postgres/pgsql/src/test/regress/results/stats_ext.out    Fri Feb 16 11:23:11 2018
***************
*** 309,323 ****
           ->  Seq Scan on ndistinct
  (5 rows)

! EXPLAIN (COSTS off)
   SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d;
!          QUERY PLAN
! -----------------------------
!  HashAggregate
     Group Key: b, c, d
!    ->  Seq Scan on ndistinct
  (3 rows)

  EXPLAIN (COSTS off)
   SELECT COUNT(*) FROM ndistinct GROUP BY a, d;
           QUERY PLAN
--- 309,336 ----
           ->  Seq Scan on ndistinct
  (5 rows)

! EXPLAIN --(COSTS off)
   SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d;
!                               QUERY PLAN
! ----------------------------------------------------------------------
!  HashAggregate  (cost=291.00..307.32 rows=1632 width=20)
     Group Key: b, c, d
!    ->  Seq Scan on ndistinct  (cost=0.00..191.00 rows=10000 width=12)
  (3 rows)

+ set enable_hashagg = 0;
+ EXPLAIN --(COSTS off)
+  SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d;
+                                  QUERY PLAN
+ ----------------------------------------------------------------------------
+  GroupAggregate  (cost=1026.89..1168.21 rows=1632 width=20)
+    Group Key: b, c, d
+    ->  Sort  (cost=1026.89..1051.89 rows=10000 width=12)
+          Sort Key: b, c, d
+          ->  Seq Scan on ndistinct  (cost=0.00..191.00 rows=10000 width=12)
+ (5 rows)
+
+ reset enable_hashagg;
  EXPLAIN (COSTS off)
   SELECT COUNT(*) FROM ndistinct GROUP BY a, d;
           QUERY PLAN

======================================================================

Re: master check fails on Windows Server 2008

From

Marina Polyakova

Date:

17 February 2018, 21:42:33

On 16-02-2018 19:31, Tom Lane wrote:
> Marina Polyakova <m.polyakova@postgrespro.ru> writes:
>> Hello, hackers! I got a permanent failure of master (commit
>> 2a41507dab0f293ff241fe8ae326065998668af8) check on Windows Server 
>> 2008.
>> Regression output and diffs as well as config.pl are attached.
> 
> Weird.  AFAICS the cost estimates for those two plans should be quite
> different, so this isn't just a matter of the estimates maybe being
> a bit platform-dependent.  (And that test has been there nearly a
> year without causing reported problems.)
> 
> To dig into it a bit more, I tweaked the test case to show the costs
> for both plans, and got an output diff as attached.  Could you try
> the same experiment on your Windows box?  In order to force the choice
> in the other direction, you'd need to temporarily disable enable_sort,
> not enable_hashagg as I did here, but the principle is the same.

Thank you very much! Your test showed that hash aggregation was not even 
added to the possible paths (see windows_regression.diffs attached). 
Exploring this, I found that not allowing float8 to pass by value in 
config.pl was crucial for the size of the hash table used in this query 
(see diff.patch attached):

 From postmaster.log on Windows:

2018-02-17 20:50:48.596 MSK [1592] pg_regress/stats_ext STATEMENT:  
EXPLAIN
     SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d;
2018-02-17 20:50:48.596 MSK [1592] pg_regress/stats_ext LOG:  rewritten 
parse tree:
...
2018-02-17 20:50:48.596 MSK [1592] pg_regress/stats_ext STATEMENT:  
EXPLAIN
     SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d;
# 20 = INT8OID => pg_type.typbyval = FLOAT8PASSBYVAL:
get_agg_clause_costs_walker aggtranstype 20 get_typbyval(aggtranstype) 0
get_agg_clause_costs_walker avgwidth 8 sizeof(void *) 8 
costs->transitionSpace 24
# add AGG_SORTED path:
add_paths_to_grouping_rel 1 create_agg_path (aggstrategy 1)
estimate_hashagg_tablesize 1 hashentrysize 32
# add transitionSpace = 24:
estimate_hashagg_tablesize 2 hashentrysize 56
estimate_hashagg_tablesize 3 hashentrysize 96
estimate_hashagg_tablesize dNumGroups 1632.000000
# 156672 = 96 * 1632 > 131072:
add_paths_to_grouping_rel hashaggtablesize 156672 work_mem 128 work_mem 
* 1024L 131072 grouped_rel->pathlist == NIL 0
2018-02-17 20:50:48.596 MSK [1592] pg_regress/stats_ext LOG:  plan:
...

 From postmaster.log on my computer (allow float8 to pass by value):

2018-02-17 20:45:57.651 MSK [3012] pg_regress/stats_ext STATEMENT:  
EXPLAIN
     SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d;
2018-02-17 20:45:57.651 MSK [3012] pg_regress/stats_ext LOG:  rewritten 
parse tree:
...
2018-02-17 20:45:57.651 MSK [3012] pg_regress/stats_ext STATEMENT:  
EXPLAIN
     SELECT COUNT(*) FROM ndistinct GROUP BY b, c, d;
# 20 = INT8OID => pg_type.typbyval = FLOAT8PASSBYVAL:
get_agg_clause_costs_walker aggtranstype 20 get_typbyval(aggtranstype) 1
# add AGG_SORTED path:
add_paths_to_grouping_rel 1 create_agg_path (aggstrategy 1)
estimate_hashagg_tablesize 1 hashentrysize 32
# add transitionSpace = 0:
estimate_hashagg_tablesize 2 hashentrysize 32
estimate_hashagg_tablesize 3 hashentrysize 72
estimate_hashagg_tablesize dNumGroups 1632.000000
# 117504 = 72 * 1632 < 131072:
add_paths_to_grouping_rel hashaggtablesize 117504 work_mem 128 work_mem 
* 1024L 131072 grouped_rel->pathlist == NIL 0
# add AGG_HASHED path:
add_paths_to_grouping_rel 2 create_agg_path (aggstrategy 2)
2018-02-17 20:45:57.651 MSK [3012] pg_regress/stats_ext LOG:  plan:
...

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: master check fails on Windows Server 2008

From

Tom Lane

Date:

20 February 2018, 03:37:46

Marina Polyakova <m.polyakova@postgrespro.ru> writes:
> On 16-02-2018 19:31, Tom Lane wrote:
>> Weird.  AFAICS the cost estimates for those two plans should be quite
>> different, so this isn't just a matter of the estimates maybe being
>> a bit platform-dependent.  (And that test has been there nearly a
>> year without causing reported problems.)

> Thank you very much! Your test showed that hash aggregation was not even 
> added to the possible paths (see windows_regression.diffs attached). 
> Exploring this, I found that not allowing float8 to pass by value in 
> config.pl was crucial for the size of the hash table used in this query 
> (see diff.patch attached):

Ah-hah.  I can reproduce the described failure if I configure with
--disable-float8-byval on an otherwise 64-bit machine.  It appears that
the minimum work_mem setting that will allow this query to use a hashagg
plan on such a configuration is about 155kB (which squares with the
results you show).  On the other hand, in a normal 64-bit configuration,
work_mem settings of 160kB or more cause other failures (due to other
plans switching to hashagg), and on a 32-bit machine I see such failures
with work_mem of 150kB or more.  So there's basically no setting of
work_mem that would make all these tests pass everywhere.

I see several things we could do about this:

1. Nothing; just say "sorry, we don't promise that the regression tests
pass with no plan differences on nonstandard configurations".  Given that
--disable-float8-byval has hardly any real-world use, there is not a lot
of downside to that.

2. Decide that --disable-float8-byval, and for that matter
--disable-float4-byval, have no real-world use at all and take them out.
There was some point in those options back when we cared about binary
compatibility with version-zero C functions, but now I'm not sure why
anyone would use them.

3. Drop that one test case from stats_ext.sql; I'm not sure how much
additional test value it's actually bringing.

4. Try to tweak the stats_ext.sql test conditions in some more refined
way to get the test to pass everywhere.  This'd be a lot of work with
no guarantee of success, so I'm not too excited about it.

5. Something else?

            regards, tom lane

Re: master check fails on Windows Server 2008

From

Peter Geoghegan

Date:

20 February 2018, 04:21:12

On Mon, Feb 19, 2018 at 4:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I see several things we could do about this:
>
> 1. Nothing; just say "sorry, we don't promise that the regression tests
> pass with no plan differences on nonstandard configurations".  Given that
> --disable-float8-byval has hardly any real-world use, there is not a lot
> of downside to that.

That would make sense to me. It will also be necessary to formalize
what "nonstandard configuration" actually means if we go this way, of
course.

I believe that this is already true with "dynamic_shared_memory_type
== DSM_IMPL_NONE", so that's a second entry for the "nonstandard
configuration" list.

-- 
Peter Geoghegan

Re: master check fails on Windows Server 2008

From

Marina Polyakova

Date:

20 February 2018, 21:15:00

On 20-02-2018 3:37, Tom Lane wrote:
> Ah-hah.  I can reproduce the described failure if I configure with
> --disable-float8-byval on an otherwise 64-bit machine.  It appears that
> the minimum work_mem setting that will allow this query to use a 
> hashagg
> plan on such a configuration is about 155kB (which squares with the
> results you show).  On the other hand, in a normal 64-bit 
> configuration,
> work_mem settings of 160kB or more cause other failures (due to other
> plans switching to hashagg), and on a 32-bit machine I see such 
> failures
> with work_mem of 150kB or more.  So there's basically no setting of
> work_mem that would make all these tests pass everywhere.
> 
> I see several things we could do about this:
> 
> 1. Nothing; just say "sorry, we don't promise that the regression tests
> pass with no plan differences on nonstandard configurations".  Given 
> that
> --disable-float8-byval has hardly any real-world use, there is not a 
> lot
> of downside to that.
> 
> 2. Decide that --disable-float8-byval, and for that matter
> --disable-float4-byval, have no real-world use at all and take them 
> out.
> There was some point in those options back when we cared about binary
> compatibility with version-zero C functions, but now I'm not sure why
> anyone would use them.
> 
> 3. Drop that one test case from stats_ext.sql; I'm not sure how much
> additional test value it's actually bringing.
> 
> 4. Try to tweak the stats_ext.sql test conditions in some more refined
> way to get the test to pass everywhere.  This'd be a lot of work with
> no guarantee of success, so I'm not too excited about it.

Thank you for your explanations! I'll try to do something in this 
direction..

> 5. Something else?
> 
>             regards, tom lane

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: master check fails on Windows Server 2008

From

Tom Lane

Date:

20 February 2018, 21:23:26

Marina Polyakova <m.polyakova@postgrespro.ru> writes:
> On 20-02-2018 3:37, Tom Lane wrote:
>> 4. Try to tweak the stats_ext.sql test conditions in some more refined
>> way to get the test to pass everywhere.  This'd be a lot of work with
>> no guarantee of success, so I'm not too excited about it.

> Thank you for your explanations! I'll try to do something in this 
> direction..

OK.  The least painful fix might be to establish a different work_mem
setting just for that one query.

However, if you're intent on putting work into continued support of
--disable-float8-byval, I would *strongly* suggest setting up a buildfarm
member that runs that way, because otherwise we're pretty much guaranteed
to break it again.  I continue to wonder if it's not better to just remove
the option and thereby simplify our lives.  What's the actual value of
having it anymore?

            regards, tom lane

Re: master check fails on Windows Server 2008

From

Marina Polyakova

Date:

21 February 2018, 13:20:18

On 20-02-2018 21:23, Tom Lane wrote:
> Marina Polyakova <m.polyakova@postgrespro.ru> writes:
>> On 20-02-2018 3:37, Tom Lane wrote:
>>> 4. Try to tweak the stats_ext.sql test conditions in some more 
>>> refined
>>> way to get the test to pass everywhere.  This'd be a lot of work with
>>> no guarantee of success, so I'm not too excited about it.
> 
>> Thank you for your explanations! I'll try to do something in this
>> direction..
> 
> OK.  The least painful fix might be to establish a different work_mem
> setting just for that one query.
> 
> However, if you're intent on putting work into continued support of
> --disable-float8-byval, I would *strongly* suggest setting up a 
> buildfarm
> member that runs that way, because otherwise we're pretty much 
> guaranteed
> to break it again.

Oh, thank you again!

> I continue to wonder if it's not better to just remove
> the option and thereby simplify our lives.  What's the actual value of
> having it anymore?

I agree with you, but I have too little experience to vote for removing 
this option.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: master check fails on Windows Server 2008

From

Tom Lane

Date:

21 February 2018, 18:51:51

Marina Polyakova <m.polyakova@postgrespro.ru> writes:
> On 20-02-2018 21:23, Tom Lane wrote:
>> I continue to wonder if it's not better to just remove
>> the option and thereby simplify our lives.  What's the actual value of
>> having it anymore?

> I agree with you, but I have too little experience to vote for removing 
> this option.

I've started a separate thread to propose removal of the option, at
https://postgr.es/m/10862.1519228208@sss.pgh.pa.us

            regards, tom lane

Re: master check fails on Windows Server 2008

From

Marina Polyakova

Date:

21 February 2018, 19:15:04

On 21-02-2018 18:51, Tom Lane wrote:
> Marina Polyakova <m.polyakova@postgrespro.ru> writes:
>> On 20-02-2018 21:23, Tom Lane wrote:
>>> I continue to wonder if it's not better to just remove
>>> the option and thereby simplify our lives.  What's the actual value 
>>> of
>>> having it anymore?
> 
>> I agree with you, but I have too little experience to vote for 
>> removing
>> this option.
> 
> I've started a separate thread to propose removal of the option, at
> https://postgr.es/m/10862.1519228208@sss.pgh.pa.us

Thank you!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company