Thread: Several buildfarm animals fail tests because of shared memory error

Several buildfarm animals fail tests because of shared memory error

From
Alexander Lakhin
Date:
Hello hackers,

I'd like to bring your attention to multiple buildfarm failures, which
occurred this month, on master only, caused by "could not open shared
memory segment ...: No such file or directory" errors.

First such errors were produced on 2024-12-16 by:
leafhopper
Amazon Linux 2023 | gcc 11.4.1 | aarch64/graviton4/r8g.2xl | tharar [ a t ] amazon.com
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2012%3A27%3A01
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2020%3A40%3A09

and batta:
sid | gcc recent | aarch64 | michael [ a t ] paquier.xyz
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=batta&dt=2024-12-16%2008%3A05%3A04

Then there was alligator:
Ubuntu 24.04 LTS | gcc experimental (nightly build) | x86_64 | tharakan [ a t ] gmail.com
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=alligator&dt=2024-12-19%2001%3A30%3A57

and parula:
Amazon Linux 2 | gcc 13.2.0 | aarch64/Graviton3/c7g.2xl | tharar [ a t ] amazon.com
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=parula&dt=2024-12-21%2009%3A56%3A28

Maybe it's a configuration issue (all animals except batta are owned by
Robins), as described here:
https://www.postgresql.org/docs/devel/kernel-resources.html#SYSTEMD-REMOVEIPC

And maybe leafhopper is faulty by itself, because it also produced very
weird test outputs (in older branches) like:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-16%2023%3A43%3A03
REL_15_STABLE
-               Rows Removed by Filter: 9990
+               Rows Removed by Filter: 447009543

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=leafhopper&dt=2024-12-21%2022%3A18%3A04
REL_16_STABLE
-               Rows Removed by Filter: 9990
+               Rows Removed by Filter: 9395

But still why master only?

Unfortunately I'm unable to reproduce such failures locally, so I'm sorry
for such raw information, but I see no way to investigate this further
without assistance. Perhaps owners of these animals could shed some light
on this...

Best regards,
Alexander



Re: Several buildfarm animals fail tests because of shared memory error

From
Robins Tharakan
Date:
Hi Alexander,

Thanks for collating this list.
I'll try to add as much as I know, in hopes that it helps.

On Sun, 22 Dec 2024 at 16:30, Alexander Lakhin <exclusion@gmail.com> wrote:
I'd like to bring your attention to multiple buildfarm failures, which
occurred this month, on master only, caused by "could not open shared
memory segment ...: No such file or directory" errors.


- I am unsure how batta is set up, but till late last week, none of my instances had set REMOVEIPC correctly. I am sorry, I didn't know about this until Thomas pointed it out to me in another thread. So if that's a key reason here, then probably by this time next week things should settle down. I've begun setting it correctly (2 done with a few more to go) - although given that some machines are at work, I'll try to get to them this coming week.



But still why master only?

+1. It is interesting though as to why master is affected more often. This may be statistical - since master ends up with more commits and thus more tests? Unsure.

Also:
- I recently (~2 days back) switched parula to gcc-experimental nightly - after which I see 4 of the recent errors - although the recent most test is green.
- The only info about leafhopper may be relevant is that it's one of the newest machines (Graviton4) so it comes with a recent hardware / kernel / stock gcc 11.4.1.

Unfortunately I'm unable to reproduce such failures locally, so I'm sorry
for such raw information, but I see no way to investigate this further
without assistance. Perhaps owners of these animals could shed some light
on this...

Since the instances are created with work accounts, it isn't trivial to share access but I could revert with any outputs / capture if it can help here.

Lastly, alligator has been on gcc nightly for a few months, and is on x86_64 - so by this time next week if alligator is still stuttering, pretty sure there's more than just aarch64 or gcc or IPC config to blame here.

-
robins

Re: Several buildfarm animals fail tests because of shared memory error

From
Robins Tharakan
Date:

On Thu, 9 Jan 2025 at 15:30, Alexander Lakhin <exclusion@gmail.com> wrote:
>
> Maybe you could try to reproduce such failures without buildfarm client, just
> by running select_parallel, for example, with the attached patch applied.
> I mean running `make check` with parallel_schedule like:
> ...
> Or
> TESTS="test_setup copy create_misc create_index $(printf "select_parallel %.0s" {1..100})" make check-tests
>

Thanks Alexander for pointing to the test steps. I'll try to run these on leafhopper
the next couple of days and come back if I see anything interesting.

-
robins

Re: Several buildfarm animals fail tests because of shared memory error

From
Robins Tharakan
Date:
Hi Alexander,

On Sun, 22 Dec 2024 at 17:57, Robins Tharakan <tharakan@gmail.com> wrote:
>
> .... So if that's a key reason here, then probably by this time next week things should
> settle down. I've begun setting it correctly (2 done with a few more to go) - although
> given that some machines are at work, I'll try to get to them this coming week.
>

All of my machines now have the RemoveIPC config set correctly and seem to work
well for the past few days, so ideally we should be good there.

Unrelated, parula has been failing the libperl test (only v15 and older), for the past
3 weeks - to clarify, this test started to fail (~18 days ago) before I fixed the
'RemoveIPC' configuration (~5 days ago), so this is unrelated to that
change.

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=parula&dt=2025-01-09%2003%3A13%3A18&stg=configure

The first REL_15_STABLE test failure points to acd5c28db5 but I didn't see
anything interesting there.

The error seems to be around "annobin.so" and so it may be about how
gcc is being compiled (not sure). While I figure out if GCC compilation
needs work, I thought to bring it up here since v16+ seems to work fine on
the same box and we may want to consider doing something similar for all
older versions too?

"
configure:19818: checking for libperl
configure:19834: ccache gcc -o conftest -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wimplicit-fallthrough=3 -Wcast-function-type -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -Wno-deprecated-non-prototype -Wno-format-truncation -Wno-stringop-truncation -g -O2 -std=gnu17 -fPIC  -D_GNU_SOURCE -I/usr/include/libxml2  -I/usr/lib64/perl5/CORE   conftest.c  -Wl,-z,relro -Wl,--as-needed -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -Wl,--build-id=sha1  -fstack-protector-strong -L/usr/local/lib  -L/usr/lib64/perl5/CORE -lperl -lpthread -lresolv -ldl -lm -lcrypt -lutil -lc >&5
cc1: fatal error: inaccessible plugin file /opt/gcc/prod/bin/../lib/gcc/aarch64-unknown-linux-gnu/15.0.0/plugin/annobin.so expanded from short plugin name annobin: No such file or directory
compilation terminated.
configure:19834: $? = 1
"

A wild guess is that this may be about the config 
"-specs=/usr/lib/rpm/redhat/redhat-annobin-cc1"
which I don't see in v16+, but I don't know enough to be sure if that's in the
correct direction.

-
robins