Thread: IPC::Run accepts bug reports

IPC::Run accepts bug reports

From
Noah Misch
Date:
Separating this from the pytest thread:

On Sat, Jun 15, 2024 at 01:26:57PM -0400, Robert Haas wrote:
> The one
> thing I know about that *I* think is a pretty big problem about Perl
> is that IPC::Run is not really maintained.

I don't see in https://github.com/cpan-authors/IPC-Run/issues anything
affecting PostgreSQL.  If you know of IPC::Run defects, please report them.
If I knew of an IPC::Run defect affecting PostgreSQL, I likely would work on
it before absurdity like https://github.com/cpan-authors/IPC-Run/issues/175
NetBSD-10-specific behavior coping.



Re: IPC::Run accepts bug reports

From
Robert Haas
Date:
On Sat, Jun 15, 2024 at 7:48 PM Noah Misch <noah@leadboat.com> wrote:
> Separating this from the pytest thread:
>
> On Sat, Jun 15, 2024 at 01:26:57PM -0400, Robert Haas wrote:
> > The one
> > thing I know about that *I* think is a pretty big problem about Perl
> > is that IPC::Run is not really maintained.
>
> I don't see in https://github.com/cpan-authors/IPC-Run/issues anything
> affecting PostgreSQL.  If you know of IPC::Run defects, please report them.
> If I knew of an IPC::Run defect affecting PostgreSQL, I likely would work on
> it before absurdity like https://github.com/cpan-authors/IPC-Run/issues/175
> NetBSD-10-specific behavior coping.

I'm not concerned about any specific open issue; my concern is about
the health of that project. https://metacpan.org/pod/IPC::Run says
that this module is seeking new maintainers, and it looks like the
people listed as current maintainers are mostly inactive. Instead,
you're fixing stuff. That's great, but we ideally want PostgreSQL's
dependencies to be things that are used widely enough that we don't
end up maintaining them ourselves.

I apologize if my comment came across as disparaging your efforts;
that was not my intent.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: IPC::Run accepts bug reports

From
Andres Freund
Date:
Hi,

On 2024-06-15 16:48:24 -0700, Noah Misch wrote:
> Separating this from the pytest thread:
>
> On Sat, Jun 15, 2024 at 01:26:57PM -0400, Robert Haas wrote:
> > The one
> > thing I know about that *I* think is a pretty big problem about Perl
> > is that IPC::Run is not really maintained.
>
> I don't see in https://github.com/cpan-authors/IPC-Run/issues anything
> affecting PostgreSQL.  If you know of IPC::Run defects, please report them.
> If I knew of an IPC::Run defect affecting PostgreSQL, I likely would work on
> it before absurdity like https://github.com/cpan-authors/IPC-Run/issues/175
> NetBSD-10-specific behavior coping.

1) Sometimes hangs hard on windows if started processes have not been shut
   down before script exits.  I've mostly encountered this via the buildfarm /
   CI, so I never had a good way of narrowing this down. It's very painful
   because things seem to often just get stuck once that happens.

2) If a subprocess dies in an inopportune moment, IPC::Run dies with "ack
   Broken pipe:" (in _do_filters()). There's plenty reports of this on the
   list, and I've hit this several times personally. It seems to be timing
   dependent, I've encountered it after seemingly irrelevant ordering changes.

   I suspect I could create a reproducer with a bit of time.

3) It's very slow on windows (in addition to the windows process
   slowness). That got to be fixable to some degree.

Greetings,

Andres Freund



Re: IPC::Run accepts bug reports

From
Noah Misch
Date:
On Mon, Jun 17, 2024 at 11:11:17AM -0700, Andres Freund wrote:
> On 2024-06-15 16:48:24 -0700, Noah Misch wrote:
> > On Sat, Jun 15, 2024 at 01:26:57PM -0400, Robert Haas wrote:
> > > The one
> > > thing I know about that *I* think is a pretty big problem about Perl
> > > is that IPC::Run is not really maintained.
> >
> > I don't see in https://github.com/cpan-authors/IPC-Run/issues anything
> > affecting PostgreSQL.  If you know of IPC::Run defects, please report them.
> > If I knew of an IPC::Run defect affecting PostgreSQL, I likely would work on
> > it before absurdity like https://github.com/cpan-authors/IPC-Run/issues/175
> > NetBSD-10-specific behavior coping.
> 
> 1) Sometimes hangs hard on windows if started processes have not been shut
>    down before script exits.  I've mostly encountered this via the buildfarm /
>    CI, so I never had a good way of narrowing this down. It's very painful
>    because things seem to often just get stuck once that happens.

That's bad.  Do you have a link to a log, a thread discussing it, or even just
one of the test names experiencing it?

> 2) If a subprocess dies in an inopportune moment, IPC::Run dies with "ack
>    Broken pipe:" (in _do_filters()). There's plenty reports of this on the
>    list, and I've hit this several times personally. It seems to be timing
>    dependent, I've encountered it after seemingly irrelevant ordering changes.
> 
>    I suspect I could create a reproducer with a bit of time.

I've seen that one.  If the harness has data to write to a child, the child
exiting before the write is one way to reach that.  Perhaps before exec(),
IPC::Run should do a non-blocking write from each pending IO.  That way, small
writes never experience the timing-dependent behavior.

> 3) It's very slow on windows (in addition to the windows process
>    slowness). That got to be fixable to some degree.

Agreed.  For the next release, today's git has some optimizations.  There are
other known-possible Windows optimizations not implemented.



Re: IPC::Run accepts bug reports

From
Andres Freund
Date:
Hi,

On 2024-06-18 10:10:17 -0700, Noah Misch wrote:
> On Mon, Jun 17, 2024 at 11:11:17AM -0700, Andres Freund wrote:
> > On 2024-06-15 16:48:24 -0700, Noah Misch wrote:
> > > On Sat, Jun 15, 2024 at 01:26:57PM -0400, Robert Haas wrote:
> > > > The one
> > > > thing I know about that *I* think is a pretty big problem about Perl
> > > > is that IPC::Run is not really maintained.
> > >
> > > I don't see in https://github.com/cpan-authors/IPC-Run/issues anything
> > > affecting PostgreSQL.  If you know of IPC::Run defects, please report them.
> > > If I knew of an IPC::Run defect affecting PostgreSQL, I likely would work on
> > > it before absurdity like https://github.com/cpan-authors/IPC-Run/issues/175
> > > NetBSD-10-specific behavior coping.
> > 
> > 1) Sometimes hangs hard on windows if started processes have not been shut
> >    down before script exits.  I've mostly encountered this via the buildfarm /
> >    CI, so I never had a good way of narrowing this down. It's very painful
> >    because things seem to often just get stuck once that happens.
> 
> That's bad.  Do you have a link to a log, a thread discussing it, or even just
> one of the test names experiencing it?

I'm unfortunately blanking on the right keyword right now.

I think it basically required not shutting down a process started in the
background with IPC::Run.

I'll try to repro it by removing some ->finish or ->quit calls.

There's also a bunch of tests that have blocks like

    # some Windows Perls at least don't like IPC::Run's start/kill_kill regime.
    skip "Test fails on Windows perl", 2 if $Config{osname} eq 'MSWin32';

Some of them may have been related to this.


> > 2) If a subprocess dies in an inopportune moment, IPC::Run dies with "ack
> >    Broken pipe:" (in _do_filters()). There's plenty reports of this on the
> >    list, and I've hit this several times personally. It seems to be timing
> >    dependent, I've encountered it after seemingly irrelevant ordering changes.
> > 
> >    I suspect I could create a reproducer with a bit of time.
> 
> I've seen that one.  If the harness has data to write to a child, the child
> exiting before the write is one way to reach that.  Perhaps before exec(),
> IPC::Run should do a non-blocking write from each pending IO.  That way, small
> writes never experience the timing-dependent behavior.

I think the question is rather, why is ipc run choosing to die in this
situation and can that be fixed?


> > 3) It's very slow on windows (in addition to the windows process
> >    slowness). That got to be fixable to some degree.
> 
> Agreed.  For the next release, today's git has some optimizations.  There are
> other known-possible Windows optimizations not implemented.

Yay!

Greetings,

Andres Freund



Re: IPC::Run accepts bug reports

From
Andrew Dunstan
Date:


On 2024-06-18 Tu 3:00 PM, Andres Freund wrote:
Hi,

On 2024-06-18 10:10:17 -0700, Noah Misch wrote:
On Mon, Jun 17, 2024 at 11:11:17AM -0700, Andres Freund wrote:
On 2024-06-15 16:48:24 -0700, Noah Misch wrote:
On Sat, Jun 15, 2024 at 01:26:57PM -0400, Robert Haas wrote:
The one
thing I know about that *I* think is a pretty big problem about Perl
is that IPC::Run is not really maintained.
I don't see in https://github.com/cpan-authors/IPC-Run/issues anything
affecting PostgreSQL.  If you know of IPC::Run defects, please report them.
If I knew of an IPC::Run defect affecting PostgreSQL, I likely would work on
it before absurdity like https://github.com/cpan-authors/IPC-Run/issues/175
NetBSD-10-specific behavior coping.
1) Sometimes hangs hard on windows if started processes have not been shut   down before script exits.  I've mostly encountered this via the buildfarm /   CI, so I never had a good way of narrowing this down. It's very painful   because things seem to often just get stuck once that happens.
That's bad.  Do you have a link to a log, a thread discussing it, or even just
one of the test names experiencing it?
I'm unfortunately blanking on the right keyword right now.

I think it basically required not shutting down a process started in the
background with IPC::Run.

I'll try to repro it by removing some ->finish or ->quit calls.

There's also a bunch of tests that have blocks like
	# some Windows Perls at least don't like IPC::Run's start/kill_kill regime.	skip "Test fails on Windows perl", 2 if $Config{osname} eq 'MSWin32';

Some of them may have been related to this.


I only found one of those, in src/test/recovery/t/006_logical_decoding.pl, which seems to be the only place we use kill_kill at all. That comment dates back to 2017, so maybe a more modern perl and/or IPC::Run will improve matters.

It's not clear to me why that code isn't calling finish() before trying kill_kill(). That's what the IPC::Run docs seem to suggest you should do.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: IPC::Run accepts bug reports

From
Andres Freund
Date:
Hi,

On 2024-06-18 12:00:13 -0700, Andres Freund wrote:
> On 2024-06-18 10:10:17 -0700, Noah Misch wrote:
> > > 1) Sometimes hangs hard on windows if started processes have not been shut
> > >    down before script exits.  I've mostly encountered this via the buildfarm /
> > >    CI, so I never had a good way of narrowing this down. It's very painful
> > >    because things seem to often just get stuck once that happens.
> >
> > That's bad.  Do you have a link to a log, a thread discussing it, or even just
> > one of the test names experiencing it?
>
> I'm unfortunately blanking on the right keyword right now.
>
> I think it basically required not shutting down a process started in the
> background with IPC::Run.
>
> I'll try to repro it by removing some ->finish or ->quit calls.

Yep, that did it.  It reliably reproduces if I comment out
the lines below
  # explicitly shut down psql instances gracefully - to avoid hangs
  # or worse on windows
in 021_row_visibility.pl

The logfile ends in
Warning: unable to close filehandle GEN25 properly: Bad file descriptor during global destruction.
Warning: unable to close filehandle GEN20 properly: Bad file descriptor during global destruction.


Even if I cancel the test, I can't rerun it because due to a leftover psql
a) a new temp install can't be made (could be solved by rm -rf)
b) the test's logfile can't be removed (couldn't even rename the directory)

The psql instance needs to be found and terminated first.


Greetings,

Andres Freund



Re: IPC::Run accepts bug reports

From
Noah Misch
Date:
On Tue, Jun 18, 2024 at 08:07:27PM -0700, Andres Freund wrote:
> > > > 1) Sometimes hangs hard on windows if started processes have not been shut
> > > >    down before script exits.

> It reliably reproduces if I comment out
> the lines below
>   # explicitly shut down psql instances gracefully - to avoid hangs
>   # or worse on windows
> in 021_row_visibility.pl
> 
> The logfile ends in
> Warning: unable to close filehandle GEN25 properly: Bad file descriptor during global destruction.
> Warning: unable to close filehandle GEN20 properly: Bad file descriptor during global destruction.
> 
> 
> Even if I cancel the test, I can't rerun it because due to a leftover psql
> a) a new temp install can't be made (could be solved by rm -rf)
> b) the test's logfile can't be removed (couldn't even rename the directory)
> 
> The psql instance needs to be found and terminated first.

Thanks for that recipe.  I've put that in my queue to fix.

On Tue, Jun 18, 2024 at 12:00:13PM -0700, Andres Freund wrote:
> On 2024-06-18 10:10:17 -0700, Noah Misch wrote:
> > On Mon, Jun 17, 2024 at 11:11:17AM -0700, Andres Freund wrote:
> > > 2) If a subprocess dies in an inopportune moment, IPC::Run dies with "ack
> > >    Broken pipe:" (in _do_filters()). There's plenty reports of this on the
> > >    list, and I've hit this several times personally. It seems to be timing
> > >    dependent, I've encountered it after seemingly irrelevant ordering changes.
> > > 
> > >    I suspect I could create a reproducer with a bit of time.
> > 
> > I've seen that one.  If the harness has data to write to a child, the child
> > exiting before the write is one way to reach that.  Perhaps before exec(),
> > IPC::Run should do a non-blocking write from each pending IO.  That way, small
> > writes never experience the timing-dependent behavior.
> 
> I think the question is rather, why is ipc run choosing to die in this
> situation and can that be fixed?

With default signal handling, the process would die to SIGPIPE.  Since
PostgreSQL::Test ignores SIGPIPE, this happens instead.  The IPC::Run source
tree has no discussion of ignoring SIGPIPE, so I bet it didn't get a conscious
decision.  Perhaps it can do better.



Re: IPC::Run accepts bug reports

From
Noah Misch
Date:
On Mon, Jun 17, 2024 at 01:56:46PM -0400, Robert Haas wrote:
> On Sat, Jun 15, 2024 at 7:48 PM Noah Misch <noah@leadboat.com> wrote:
> > Separating this from the pytest thread:
> >
> > On Sat, Jun 15, 2024 at 01:26:57PM -0400, Robert Haas wrote:
> > > The one
> > > thing I know about that *I* think is a pretty big problem about Perl
> > > is that IPC::Run is not really maintained.
> >
> > I don't see in https://github.com/cpan-authors/IPC-Run/issues anything
> > affecting PostgreSQL.  If you know of IPC::Run defects, please report them.
> > If I knew of an IPC::Run defect affecting PostgreSQL, I likely would work on
> > it before absurdity like https://github.com/cpan-authors/IPC-Run/issues/175
> > NetBSD-10-specific behavior coping.
> 
> I'm not concerned about any specific open issue; my concern is about
> the health of that project. https://metacpan.org/pod/IPC::Run says
> that this module is seeking new maintainers, and it looks like the
> people listed as current maintainers are mostly inactive. Instead,
> you're fixing stuff. That's great, but we ideally want PostgreSQL's
> dependencies to be things that are used widely enough that we don't
> end up maintaining them ourselves.

That's reasonable to want.

> I apologize if my comment came across as disparaging your efforts;
> that was not my intent.

It did not come across as disparaging.



Re: IPC::Run accepts bug reports

From
Alexander Lakhin
Date:
Hello Noah,

16.06.2024 02:48, Noah Misch wrote:
> I don't see in https://github.com/cpan-authors/IPC-Run/issues anything
> affecting PostgreSQL.  If you know of IPC::Run defects, please report them.
> If I knew of an IPC::Run defect affecting PostgreSQL, I likely would work on
> it before absurdity like https://github.com/cpan-authors/IPC-Run/issues/175
> NetBSD-10-specific behavior coping.

It looks like a recent indri failure [1] revealed one more IPC::Run
anomaly. The failure log contains:
# Running: pgbench -n -t 1 -Dfoo=bla -Dnull=null -Dtrue=true -Done=1 -Dzero=0.0 -Dbadtrue=trueXXX 
-Dmaxint=9223372036854775807 -Dminint=-9223372036854775808 -M prepared -f 

/Users/buildfarm/bf-data/HEAD/pgsql.build/src/bin/pgbench/tmp_check/t_001_pgbench_with_server_main_data/001_pgbench_error_sleep_undefined_variable
[22:38:14.887](0.014s) ok 362 - pgbench script error: sleep undefined variable status (got 2 vs expected 2)
[22:38:14.887](0.000s) ok 363 - pgbench script error: sleep undefined variable stdout /(?^:processed: 0/1)/
[22:38:14.887](0.000s) not ok 364 - pgbench script error: sleep undefined variable stderr /(?^:sleep: undefined
variable)/
[22:38:14.887](0.000s)
[22:38:14.887](0.000s) #   Failed test 'pgbench script error: sleep undefined variable stderr /(?^:sleep: undefined 
variable)/'
#   at t/001_pgbench_with_server.pl line 1242.
[22:38:14.887](0.000s) #                   ''
#     doesn't match '(?^:sleep: undefined variable)'

So the pgbench process exited as expected, stdout contained expected
string, but stderr happened to be empty.

Maybe such behavior is specific to macOS, and even on indri it's the
only failure of that ilk out of 2000+ runs (and I couldn't reproduce
this in a Linux VM), but I find this place in IPC::Run suspicious:
sub _read {
...
     my $r = POSIX::read( $_[0], $s, 10_000 );
     croak "$!: read( $_[0] )" if not($r) and !$!{EINTR};

That is, EINTR kind of recognized as an expected error, but there is no
retry in this case. Thus, with the following modification, which simulates
read() failed with EINTR:
  sub _read {
      confess 'undef' unless defined $_[0];
      my $s = '';
-    my $r = POSIX::read( $_[0], $s, 10_000 );
+    my $r;
+if (int(rand(100)) == 0)
+{
+   $r = 0;  $! = Errno::EINTR;
+}
+else
+{
+    $r = POSIX::read( $_[0], $s, 10_000 );
+}
      croak "$!: read( $_[0] )" if not($r) and !$!{EINTR};

I can see failures like the one in question when running that test.

Perhaps, I could reproduce the issue with a program, that sends signals
to running (pgbench) processes (and thus interrupts read()), if it makes
sense.

What do you think?

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2024-10-02%2002%3A34%3A16

Best regards,
Alexander



Re: IPC::Run accepts bug reports

From
Noah Misch
Date:
On Fri, Oct 04, 2024 at 02:00:00PM +0300, Alexander Lakhin wrote:
> sub _read {
> ...
>     my $r = POSIX::read( $_[0], $s, 10_000 );
>     croak "$!: read( $_[0] )" if not($r) and !$!{EINTR};
> 
> That is, EINTR kind of recognized as an expected error, but there is no
> retry in this case. Thus, with the following modification, which simulates
> read() failed with EINTR:
>  sub _read {
>      confess 'undef' unless defined $_[0];
>      my $s = '';
> -    my $r = POSIX::read( $_[0], $s, 10_000 );
> +    my $r;
> +if (int(rand(100)) == 0)
> +{
> +   $r = 0;  $! = Errno::EINTR;
> +}
> +else
> +{
> +    $r = POSIX::read( $_[0], $s, 10_000 );
> +}
>      croak "$!: read( $_[0] )" if not($r) and !$!{EINTR};
> 
> I can see failures like the one in question when running that test.

That makes sense.  Would you file this at
https://github.com/cpan-authors/IPC-Run/issues?  I suppose that code should
become roughly:

  do { $r = POSIX::read(...) } while (!defined($r) && $!{EINTR});
  croak ... unless defined($r);

> Perhaps, I could reproduce the issue with a program, that sends signals
> to running (pgbench) processes (and thus interrupts read()), if it makes
> sense.

That should work.  Best would be an IPC::Run test case, like the existing
t/eintr.t that makes select() report EINTR.  That said, IPC::Run could adopt
the retry without a test.

> [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=indri&dt=2024-10-02%2002%3A34%3A16



Re: IPC::Run accepts bug reports

From
Alexander Lakhin
Date:
Hello Noah,

04.10.2024 21:57, Noah Misch wrote:
> That makes sense.  Would you file this at
> https://github.com/cpan-authors/IPC-Run/issues?  I suppose that code should
> become roughly:
>
>    do { $r = POSIX::read(...) } while (!defined($r) && $!{EINTR});
>    croak ... unless defined($r);
>

Just for reference: I've filed the bug report at:
https://github.com/cpan-authors/IPC-Run/issues/176

Best regards,
Alexander