Re: Cygwin cleanup - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Cygwin cleanup |
Date | |
Msg-id | CA+hUKG+J4jSFk=-hdoZdcx+p7ru6xuipzCZY-kiKoDc2FjsV7g@mail.gmail.com Whole thread Raw |
In response to | Re: Cygwin cleanup (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
On Wed, Feb 8, 2023 at 8:06 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Fri, Jan 13, 2023 at 5:17 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > My patch used fsync_fname_ext() which would cause an ERROR rather than a > > PANIC when failing to fsync the logical decoding pathname. > > FTR While analysing a lot of CI logs trying to debug something else I > came across a plain Windows/MSVC (not Cygwin) build that panicked like > this: > > https://cirrus-ci.com/task/6689224833892352 > https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/testrun/build/testrun/subscription/013_partition/log/013_partition_publisher.log > https://api.cirrus-ci.com/v1/artifact/task/6689224833892352/crashlog/crashlog-postgres.exe_0af4_2023-02-05_21-53-20-018.txt Here are some more flapping CI failures due to this phenomenon (nothing to do with Cygwin, this is just regular Windows): 4509011781877760 | Windows - Server 2019, VS 2019 - Meson & ninja 4525770962370560 | Windows - Server 2019, VS 2019 - Meson & ninja 5664518341132288 | Windows - Server 2019, VS 2019 - Meson & ninja 5689846694412288 | Windows - Server 2019, VS 2019 - Meson & ninja 5853025126842368 | Windows - Server 2019, VS 2019 - Meson & ninja 6639943179567104 | Windows - Server 2019, VS 2019 - Meson & ninja 6727728217456640 | Windows - Server 2019, VS 2019 - Meson & ninja 6740158104469504 | Windows - Server 2019, VS 2019 - Meson & ninja They all say something like 'PANIC: could not open file "pg_logical/snapshots/0-1597938.snap": No such file or directory', because they all do rename(some_temporary_file, that_name), then try to re-open and sync it, but rename() on Windows fails to be atomic so a concurrent process can see an intermediate ENOENT state. I see a few 'local' workarounds we could do to fix that, but ... there seems to be a much better idea staring us in the face in the comments! I think this would be fixed as a happy by-product of this TODO in SnapBuildSerialize(): * TODO: Do the fsync() via checkpoints/restartpoints, doing it here has * some noticeable overhead since it's performed synchronously during * decoding? I have done no analysis myself of whether that is sound, but assuming it is, I think the way to achieve that is to tweak FileTag so that it can describe the file to be fsync'd, and use the sync.c machinery to fsync the file in the background. Presumably that would provide a huge speed up for logical decoding, and people would rejoice. Some other topics that came up in this thread: * Now that PostgreSQL seems to be stable enough on Cygwin to get through the basic regression tests reliably, lorikeet might as well run the full TAP test suite? * Justin complained about the weird effects of wal_sync_method, and I finally got around to showing how I think that should be untangled, in https://commitfest.postgresql.org/44/4453/
pgsql-hackers by date: