Re: BUG #16161: pg_ctl stop fails sometimes (on Windows) - Mailing list pgsql-bugs
From | Alexander Lakhin |
---|---|
Subject | Re: BUG #16161: pg_ctl stop fails sometimes (on Windows) |
Date | |
Msg-id | 5c7eaa87-8670-2008-3a15-337c133c12c2@gmail.com Whole thread Raw |
In response to | Re: BUG #16161: pg_ctl stop fails sometimes (on Windows) (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: BUG #16161: pg_ctl stop fails sometimes (on Windows)
Re: BUG #16161: pg_ctl stop fails sometimes (on Windows) |
List | pgsql-bugs |
19.12.2019 18:44, Tom Lane wrote: > Alexander Lakhin <exclusion@gmail.com> writes: >> Please look at the patch that implements (2). It makes `vcregress >> recoverycheck` pass (and my demo restart test still passes too). > I think we could be a little smarter here. If the problem is > STATUS_DELETE_PENDING, doesn't that affect stat() as well? That is, > if stat() succeeds, we needn't wait, independently of whether it > says S_ISDIR or not? This seems like a noticeable improvement if > true, because it would mean that ordinary file-permission failures > also need not wait. We'd still get confused if we get a permission > failure on a containing directory rather than the file itself, but > that I'm prepared to dismiss as an uncommon case. > > Entirely-untested patch along this line attached. I have tested it with my demo restart test and it works. Yes, stat() is resulted in the CreateFile() call too, so it should fail the same way. I haven't managed to catch both CreateFile() and stat() failing with STATUS_DELETE_PENDING (CreateFile() failed for me 37 times per 1000 loops), but I believe that sleep() will work correctly in this case. `vcregress recoverytest` is passed too. The DELETE_PENDING conflict now looks like this: 1 "8:50:44.9928200 PM","postgres.exe","2792","CreateFile","...\pgdata\postmaster.pid","SUCCESS","Desired Access: Read Attributes, Delete, ..." 2 "8:50:44.9928951 PM","postgres.exe","2792","QueryAttributeTagFile","...\pgdata\postmaster.pid","SUCCESS",... 3 "8:50:44.9929114 PM","pg_ctl.exe","2996","CreateFile","...\pgdata\postmaster.pid","DELETE PENDING","Desired Access: Generic Read, Disposition: Open, ..." 4 "8:50:44.9929162 PM","postgres.exe","2792","SetDispositionInformationFile","...\pgdata\postmaster.pid","SUCCESS","Delete: True" 5 "8:50:44.9929377 PM","postgres.exe","2792","CloseFile","...\pgdata\postmaster.pid","SUCCESS","" 6 "8:50:44.9932617 PM","pg_ctl.exe","2996","CreateFile","...\pgdata\postmaster.pid","NAME NOT FOUND","Desired Access: Read Attributes, ..." 7 "8:50:45.1042027 PM","pg_ctl.exe","2996","CreateFile","...\pgdata\postmaster.pid","NAME NOT FOUND","Desired Access: Generic Read, ..." 8 "8:50:45.9926445 PM","pg_ctl.exe","2396","CreateFile","...\pgdata\postmaster.pid","NAME NOT FOUND","Desired Access: Generic Read ..." Here 1, 2, 4, 5 are calls behind unlink() in postmaster, 3 is CreateFile() in pgwin32_open(), 6 is stat() in pgwin32_safestat(), then we have sleep for 0.1 sec, 7 is CreateFile() in pgwin32_open() again, and 8 is CreateFile() performed by a new pg_ctl instance. Maybe we should change the condition to 'if (stat(fileName, &st) != 0 && (err = GetLastError()) == ERROR_ACCESS_DENIED)' to avoid unnecessary sleep with a loop iteration... With this change the same conflict looks like this: 1 "10:23:12.9940793 PM","postgres.exe","1908","CreateFile","...\pgdata\postmaster.pid","SUCCESS","Desired Access: Read Attributes, ..." 2 "10:23:12.9941500 PM","postgres.exe","1908","QueryAttributeTagFile","...\pgdata\postmaster.pid","SUCCESS", ... 3 "10:23:12.9941713 PM","postgres.exe","1908","SetDispositionInformationFile","...\pgdata\postmaster.pid","SUCCESS","Delete: True" 4 "10:23:12.9941766 PM","pg_ctl.exe","2420","CreateFile","...\pgdata\postmaster.pid","DELETE PENDING","Desired Access: Generic Read ..." 5 "10:23:12.9941936 PM","postgres.exe","1908","CloseFile","...\pgdata\postmaster.pid","SUCCESS","" 6 "10:23:12.9945254 PM","pg_ctl.exe","2420","CreateFile","...\pgdata\postmaster.pid","NAME NOT FOUND","Desired Access: Read Attributes, ..." 7 "10:23:13.9940261 PM","pg_ctl.exe","1676","CreateFile","...\pgdata\postmaster.pid","NAME NOT FOUND","Desired Access: Generic Read, ..." Here 1, 2, 3, 5 are calls behind unlink() in postmaster, 4 is CreateFile() in pgwin32_open(), 6 is stat() in pgwin32_safestat(), 7 is CreateFile() performed by a new pg_ctl instance. `vcregress recoverytest` is passed. The log of access to the logical decoding slot now looks like this: 1 "10:37:22.9496662 PM","postgres.exe","2956","CreateFile","...\pgdata\pg_replslot\otherdb_slot","IS DIRECTORY","Desired Access: Generic Read, ..." 2 "10:37:22.9498061 PM","postgres.exe","2956","CreateFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","Desired Access: Read Attributes, ..." 3 "10:37:22.9498478 PM","postgres.exe","2956","QueryInformationVolume","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","VolumeCreationTime: ... " 4 "10:37:22.9498634 PM","postgres.exe","2956","QueryAllInformationFile","...\pgdata\pg_replslot\otherdb_slot","BUFFER OVERFLOW","CreationTime: ..." 5 "10:37:22.9498860 PM","postgres.exe","2956","CloseFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","" 6 "10:37:22.9499936 PM","postgres.exe","2956","QueryOpen","...\pgdata\pg_replslot\otherdb_slot","FAST IO DISALLOWED","" 7 "10:37:22.9500733 PM","postgres.exe","2956","CreateFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","Desired Access: ..." 8 "10:37:22.9500973 PM","postgres.exe","2956","QueryNetworkOpenInformationFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","CreationTime:..." 9 "10:37:22.9501108 PM","postgres.exe","2956","CloseFile","...\pgdata\pg_replslot\otherdb_slot","SUCCESS","" > BTW ... it's likely that stat() here is actually going to invoke > pgwin32_safestat(), which has its own opinions about this, and > indeed seems to think we'll get ERROR_DELETE_PENDING. I think > that's harmless here, but it makes me wonder if we should use > a native Windows API instead of going through stat(). Yes, I would call native stat(), as we don't need to perform an excessive call to get file size (for directories). It seems that the check for ERROR_DELETE_PENDING was added to pgwin32_safestat() blindly, the issue wasn't reproduced at that time: https://www.postgresql.org/message-id/CAB7nPqRJV6trFta-Qzgi6j2feuYR2ZC%2BKHvWdHnbpDG2scTrxw%40mail.gmail.com >> As open(dir) is getting a little more expensive than before, maybe it's >> still worthwhile to patch fsync*(..., isdir,...). I can prepare such a >> patch if so. > I think we should leave that for later, so that the buildfarm can > actually test whatever logic we put in here. Agreed. It doesn't directly affect this bug and should be done separately. Best regards, Alexander
pgsql-bugs by date: