Thread: POSIX shared memory redux
The goal of this work is to address all of the shortcomings of previous POSIX shared memory patches as pointed out mostlyby Tom Lane. Branch: http://git.postgresql.org/gitweb?p=users/agentm/postgresql.git;a=shortlog;h=refs/heads/posix_shmem Main file: http://git.postgresql.org/gitweb?p=users/agentm/postgresql.git;a=blob;f=src/backend/port/posix_shmem.c;h=da93848d14eeadb182d8bf1fe576d741ae5792c3;hb=refs/heads/posix_shmem Design goals: 1) ensure that shared memory creation collisions are impossible 2) ensure that shared memory access collisions are impossible 3) ensure proper shared memory cleanup after backend and postmaster close 4) minimize API changes http://archives.postgresql.org/pgsql-patches/2007-02/msg00527.php http://archives.postgresql.org/pgsql-patches/2007-02/msg00558.php This patch addresses the above goals and offers some benefits over SysV shared memory: 1) no kern.sysv management (one documentation page with platform-specific help can disappear) 2) shared memory allocation limited only by mmap usage 3) shared memory regions are completely cleaned up when the postmaster and all of its children are exited or killed for anyreason (including SIGKILL) 4) shared memory creation race conditions or collisions between postmasters or backends are impossible 5) after postmaster startup, the postmaster becomes the sole arbiter of which other processes are granted access to the sharedmemory region 6) mmap and munmap can be used on the shared memory region- this may be useful for offering the option to expand the memoryregion dynamically The design goals are accomplished by a simple change in shared memory creation: after shm_open, the region name is immediatelyshm_unlink'd. Because POSIX shared memory relies on file descriptors, the shared memory is not deallocated inthe kernel until the last referencing file descriptor is closed (in this case, on process exit). The postmaster then becomesthe sole arbiter of passing the shared memory file descriptor (either through children or through file descriptorpassing, if necessary). The patch is a reworked version of Chris Marcellino <cmarcellino@apple.com>'s patch. Details: 1) the shared memory name is based on getpid()- this ensures that no two starting postmasters (or other processes) will attemptto acquire the same shared memory segment. 2) the shared memory segment is created and immediately unlinked, preventing outside access to the shared memory region 3) the shared memory file descriptor is passed to backends via static int file descriptor (normal file descriptor inheritance)*perhaps there is a better location to store the file descriptor- advice welcomed. 4) shared memory segment detach occurs when the process exits (kernel-based cleanup instead of scheduled in-process cleanup) Additional notes: The "feature" whereby arbitrary postgres user processes could connect to the shared memory segment has been removed withthis patch. If this is a desirable feature (perhaps for debugging or performance tools), this could be added by implementinga file descriptor passing server in the postmaster which would use SCM_RIGHTS control message passing to a) verifythat the remote process is running as the same user as the postmaster b) pass the shared memory file descriptor tothe process. I am happy to implement this, if required. I am happy to continue work on this patch if the pg-hackers deem it worthwhile. Thanks! Cheers, M
"A.M." <agentm@themactionfaction.com> writes: > The goal of this work is to address all of the shortcomings of previous POSIX shared memory patches as pointed out mostlyby Tom Lane. It seems like you've failed to understand the main shortcoming of this whole idea, which is the loss of ability to detect pre-existing backends still running in a cluster whose postmaster has crashed. The nattch variable of SysV shmem segments is really pretty critical to us, and AFAIK there simply is no substitute for it in POSIX-land. regards, tom lane
On Sat, Nov 13, 2010 at 08:07:52PM -0500, Tom Lane wrote: > "A.M." <agentm@themactionfaction.com> writes: > > The goal of this work is to address all of the shortcomings of previous POSIX shared memory patches as pointed out mostlyby Tom Lane. > > It seems like you've failed to understand the main shortcoming of this > whole idea, which is the loss of ability to detect pre-existing backends > still running in a cluster whose postmaster has crashed. The nattch > variable of SysV shmem segments is really pretty critical to us, and > AFAIK there simply is no substitute for it in POSIX-land. I've been looking and there really doesn't appear to be. This is consistant as there is nothing else in POSIX where you can determine how many other people have the same file, pipe, tty, etc open. I asked a few people for ideas and got answers like: just walk through /proc and check. Apart from the portability issues, this won't work if there are different user-IDs in play. The only real solution seems to me to be to keep a small SysV shared memory segment for the locking and allocate the rest of the shared memory some other way. If all backends map the SysV memory before the other way, then you can use the non-existance of the SysV SHM to determine the non-existance of the other segment. Quite a bit more work, ISTM. Haveva nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
Martijn van Oosterhout <kleptog@svana.org> writes: > The only real solution seems to me to be to keep a small SysV shared > memory segment for the locking and allocate the rest of the shared > memory some other way. Yeah, that's been discussed. It throws all the portability gains out the window. It might get you out from under the need to readjust a machine's SHMMAX setting before you can use a large amount of shared memory, but it's not clear that's enough of a win to be worth the trouble. The other direction that we could possibly go is to find some other way entirely of interlocking access to the data directory. If for example we could rely on a file lock held by the postmaster and all backends, we could check that instead of having to rely on a shmem behavior. The killer objection to that so far is that file locking is unreliable in some environments, particularly NFS. But it'd have some advantages too --- in particular, in the NFS context, the fact that the lock is visible to would-be postmasters on different machines might be thought a huge safety improvement over what we do now. regards, tom lane
On Sun, Nov 14, 2010 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Martijn van Oosterhout <kleptog@svana.org> writes: >> The only real solution seems to me to be to keep a small SysV shared >> memory segment for the locking and allocate the rest of the shared >> memory some other way. > > Yeah, that's been discussed. It throws all the portability gains out > the window. It might get you out from under the need to readjust a > machine's SHMMAX setting before you can use a large amount of shared > memory, but it's not clear that's enough of a win to be worth the > trouble. One of the things that would be really nice to be able to do is resize our shm after startup, in response to changes in configuration parameters. That's not so easy to make work, of course, but I feel like this might be going in the right direction, since POSIX shms can be resized using ftruncate(). > The other direction that we could possibly go is to find some other way > entirely of interlocking access to the data directory. If for example > we could rely on a file lock held by the postmaster and all backends, > we could check that instead of having to rely on a shmem behavior. > The killer objection to that so far is that file locking is unreliable > in some environments, particularly NFS. But it'd have some advantages > too --- in particular, in the NFS context, the fact that the lock is > visible to would-be postmasters on different machines might be thought > a huge safety improvement over what we do now. I've never had a lot of luck making filesystem locks work reliably, but I don't discount the possibility that I was doing it wrong. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello, Based on feedback from Tom Lane and Robert Haas, I have amended the POSIX shared memory patch to account for multiple-postmasterstart race conditions (which is currently based on SysV shared memory checks). https://github.com/agentm/postgres/tree/posix_shmem To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data directorylock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster childstarts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts, itqueries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the casewhen other postgresql processes are running. Because POSIX fcntl locking is per-process and released in the kernel when a process ends, as long as there is a single processrunning and holding a read lock, no new postmaster can be started for that data directory. Furthermore, the fcntlsyscall allows us to get a live PID for a conflicting lock-holding process, so the postgresql startup can print a livePID on a conflict startup. The contents of the data directory lock file remain the same, however, the PID stored in thelock file becomes less vital. The cost of this change is one additional file descriptor open in each postgresql process (for the full life of the process). As a gimmick, I also implemented a process-failover feature based on the F_SETLKW flag which allows a new postmaster to startupimmediately if the running postmaster and all its children exit for any reason. This may also be useful to queue postmasterstartup which could be controlled by a secondary non-postgresql process, pending some action (such as in a failoverscenario). This feature is controlled via "postgres -b" (for "blocking"), but it is not vital to the shared memoryimplementation. Note that this implementation of the fcntl locking is effectively independent of the shared memory interface, i.e. this samelocking could be used with the existing SysV shared memory scheme. Is this approach good enough to push to the next CommitFest? I am happy to amend the patch as necessary. Thanks! Cheers, M
Attachment
On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote: > To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data directorylock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster childstarts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts, itqueries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the casewhen other postgresql processes are running. This seems a lot leakier than what we do now (imagine, for example, shared storage) and I'm not sure what the advantage is. I was imagining keeping some portion of the data in sysv shm, and moving the big stuff to a POSIX shm that would operate alongside it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Apr 11, 2011, at 6:06 PM, Robert Haas wrote: > On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote: >> To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data directorylock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster childstarts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts, itqueries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the casewhen other postgresql processes are running. > > This seems a lot leakier than what we do now (imagine, for example, > shared storage) and I'm not sure what the advantage is. I was > imagining keeping some portion of the data in sysv shm, and moving the > big stuff to a POSIX shm that would operate alongside it. What do you mean by "leakier"? The goal here is to extinguish SysV shared memory for portability and convenience benefits.The mini-SysV proposal was implemented and shot down by Tom Lane. Cheers, M
On Mon, Apr 11, 2011 at 3:11 PM, A.M. <agentm@themactionfaction.com> wrote: > > On Apr 11, 2011, at 6:06 PM, Robert Haas wrote: > >> On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote: >>> To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data directorylock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster childstarts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts, itqueries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the casewhen other postgresql processes are running. >> >> This seems a lot leakier than what we do now (imagine, for example, >> shared storage) and I'm not sure what the advantage is. I was >> imagining keeping some portion of the data in sysv shm, and moving the >> big stuff to a POSIX shm that would operate alongside it. > > What do you mean by "leakier"? The goal here is to extinguish SysV shared memory for portability and convenience benefits.The mini-SysV proposal was implemented and shot down by Tom Lane. I mean I'm not convinced that fcntl() locking will be as reliable. I know Tom shot that down before, but I still think it's probably the best way forward. The advantage I see is that we would be able to more easily allocate larger chunks of shared memory with changing kernel parameters, and perhaps even to dynamically resize shared memory chunks. That'd be worth the price of admission even if we didn't get all those benefits in one commit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Apr 11, 2011 at 3:11 PM, A.M. <agentm@themactionfaction.com> wrote: >> What do you mean by "leakier"? The goal here is to extinguish SysV shared memory for portability and convenience benefits.The mini-SysV proposal was implemented and shot down by Tom Lane. > I mean I'm not convinced that fcntl() locking will be as reliable. I'm not either. Particularly not on NFS. (Although on NFS you have other issues to worry about too, like postmasters on different machines being able to reach the same data directory. I wonder if we should do both SysV and fcntl locking ...) > I know Tom shot that down before, but I still think it's probably the > best way forward. Did I? I think I pointed out that there's zero gain in portability as long as we still depend on SysV shmem to work. However, if you're doing it for other reasons than portability, it might make sense anyway. The question is whether there are adequate other reasons. > The advantage I see is that we would be able to > more easily allocate larger chunks of shared memory with changing > kernel parameters, Yes, getting out from under the SHMMAX bugaboo would be awfully nice. > and perhaps even to dynamically resize shared memory chunks. This I don't really believe will ever work reliably, especially not in 32-bit machines. Whatever your kernel API is, you still have the problem of finding address space contiguous to what you were already using. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote: >> To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data directorylock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster childstarts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts, itqueries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the casewhen other postgresql processes are running. > This seems a lot leakier than what we do now (imagine, for example, > shared storage) and I'm not sure what the advantage is. BTW, the above-described solution flat out doesn't work anyway, because it has a race condition. Postmaster children have to reacquire the lock after forking, because fcntl locks aren't inherited during fork(). And that means you can't tell whether there's a just-started backend that hasn't yet acquired the lock. It's really critical for our purposes that SysV shmem segments are inherited at fork() and so there's no window where a just-forked backend isn't visible to somebody checking the state of the shmem segment. regards, tom lane
On Apr 11, 2011, at 7:25 PM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote: >>> To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data directorylock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster childstarts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts, itqueries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the casewhen other postgresql processes are running. > >> This seems a lot leakier than what we do now (imagine, for example, >> shared storage) and I'm not sure what the advantage is. > > BTW, the above-described solution flat out doesn't work anyway, because > it has a race condition. Postmaster children have to reacquire the lock > after forking, because fcntl locks aren't inherited during fork(). And > that means you can't tell whether there's a just-started backend that > hasn't yet acquired the lock. It's really critical for our purposes > that SysV shmem segments are inherited at fork() and so there's no > window where a just-forked backend isn't visible to somebody checking > the state of the shmem segment. Then you haven't looked at my patch because I address this race condition by ensuring that a lock-holding violator is thepostmaster or a postmaster child. If such as condition is detected, the child exits immediately without touching the sharedmemory. POSIX shmem is inherited via file descriptors. Cheers, M
On Apr 11, 2011, at 7:13 PM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Apr 11, 2011 at 3:11 PM, A.M. <agentm@themactionfaction.com> wrote: >>> What do you mean by "leakier"? The goal here is to extinguish SysV shared memory for portability and convenience benefits.The mini-SysV proposal was implemented and shot down by Tom Lane. > >> I mean I'm not convinced that fcntl() locking will be as reliable. > > I'm not either. Particularly not on NFS. (Although on NFS you have > other issues to worry about too, like postmasters on different machines > being able to reach the same data directory. I wonder if we should do > both SysV and fcntl locking ...) Is there an example of a recent system where fcntl is broken (ignoring NFS)? I believe my patch addresses all potential raceconditions and uses the APIs properly to guarantee single-postmaster data directory usage and I tested on Darwin anda two-year-old Linux kernel. In the end, fcntl locking relies on the same kernel which provides the SysV user count, soI'm not sure what makes it less "reliable", but I have heard that twice now, so I am open to hearing about your experiences. >> I know Tom shot that down before, but I still think it's probably the >> best way forward. > > Did I? I think I pointed out that there's zero gain in portability as > long as we still depend on SysV shmem to work. However, if you're doing > it for other reasons than portability, it might make sense anyway. The > question is whether there are adequate other reasons. I provided an example of postmaster-failover relying on F_SETLKW in the email with the patch. Also, as you point out above,fcntl locking at least has a chance of working over NFS. > >> The advantage I see is that we would be able to >> more easily allocate larger chunks of shared memory with changing >> kernel parameters, > > Yes, getting out from under the SHMMAX bugaboo would be awfully nice. Yes, please! That is my primary motivation for this patch. > >> and perhaps even to dynamically resize shared memory chunks. > > This I don't really believe will ever work reliably, especially not in > 32-bit machines. Whatever your kernel API is, you still have the > problem of finding address space contiguous to what you were already > using. Even if expanding shmem involves copying large regions of memory, it could be at least useful to adjust buffer sizes livewithout a restart. Cheers, M
"A.M." <agentm@themactionfaction.com> writes: > On Apr 11, 2011, at 7:13 PM, Tom Lane wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> I mean I'm not convinced that fcntl() locking will be as reliable. >> I'm not either. Particularly not on NFS. > Is there an example of a recent system where fcntl is broken (ignoring NFS)? Well, the fundamental point is that "ignoring NFS" is not the real world. We can't tell people not to put data directories on NFS, and even if we did tell them not to, they'd still do it. And NFS locking is not trustworthy, because the remote lock daemon can crash and restart (forgetting everything it ever knew) while your own machine and the postmaster remain blissfully awake. None of this is to say that an fcntl lock might not be a useful addition to what we do already. It is to say that fcntl can't just replace what we do already, because there are real-world failure cases that the current solution handles and fcntl alone wouldn't. regards, tom lane
On Apr 13, 2011, at 2:06 AM, Tom Lane wrote: > "A.M." <agentm@themactionfaction.com> writes: >> On Apr 11, 2011, at 7:13 PM, Tom Lane wrote: >>> Robert Haas <robertmhaas@gmail.com> writes: >>>> I mean I'm not convinced that fcntl() locking will be as reliable. > >>> I'm not either. Particularly not on NFS. > >> Is there an example of a recent system where fcntl is broken (ignoring NFS)? > > Well, the fundamental point is that "ignoring NFS" is not the real > world. We can't tell people not to put data directories on NFS, > and even if we did tell them not to, they'd still do it. And NFS > locking is not trustworthy, because the remote lock daemon can crash > and restart (forgetting everything it ever knew) while your own machine > and the postmaster remain blissfully awake. > > None of this is to say that an fcntl lock might not be a useful addition > to what we do already. It is to say that fcntl can't just replace what > we do already, because there are real-world failure cases that the > current solution handles and fcntl alone wouldn't. The goal of this patch is to eliminate SysV shared memory, not to implement NFS-capable locking which, as you point out,is virtually impossible. As far as I can tell, in the worst case, my patch does not change how postgresql handles the NFS case. SysV shared memorywon't work across NFS, so that interlock won't catch, so postgresql is left with looking at a lock file with PID ofprocess on another machine, so that won't catch either. This patch does not alter the lock file semantics, but merely augmentsthe file with file locking. At least with this patch, there is a chance the lock might work across NFS. In the best case, it can allow for shared-storagepostgresql failover, which is a new feature. Furthermore, there is an improvement in shared memory handling in that it is unlinked immediately after creation, so onlythe postmaster and its children have access to it (through file descriptor inheritance). This means shared memory cannotbe stomped on by any other process. Considering that possibly working NFS locking is a side-effect of this patch and not its goal and, in the worst possiblescenario, it doesn't change current behavior, I don't see how this can be a ding against this patch. Cheers, M
On Wed, Apr 13, 2011 at 7:20 AM, A.M. <agentm@themactionfaction.com> wrote: > The goal of this patch is to eliminate SysV shared memory, not to implement NFS-capable locking which, as you point out,is virtually impossible. > > As far as I can tell, in the worst case, my patch does not change how postgresql handles the NFS case. SysV shared memorywon't work across NFS, so that interlock won't catch, so postgresql is left with looking at a lock file with PID ofprocess on another machine, so that won't catch either. This patch does not alter the lock file semantics, but merely augmentsthe file with file locking. > > At least with this patch, there is a chance the lock might work across NFS. In the best case, it can allow for shared-storagepostgresql failover, which is a new feature. > > Furthermore, there is an improvement in shared memory handling in that it is unlinked immediately after creation, so onlythe postmaster and its children have access to it (through file descriptor inheritance). This means shared memory cannotbe stomped on by any other process. > > Considering that possibly working NFS locking is a side-effect of this patch and not its goal and, in the worst possiblescenario, it doesn't change current behavior, I don't see how this can be a ding against this patch. I don't see why we need to get rid of SysV shared memory; needing less of it seems just as good. In answer to your off-list question, one of the principle ways I've seen fcntl() locking fall over and die is when someone removes the lock file. You might think that this could be avoided by picking something important like pg_control as the log file, but it turns out that doesn't really work: http://0pointer.de/blog/projects/locking.html Tom's point is valid too. Many storage appliances present themselves as an NFS server, so it's very plausible for the data directory to be on an NFS server, and there's no guarantee that flock() won't be broken there. If our current interlock were known to be unreliable also maybe we wouldn't care very much, but AFAICT it's been extremely robust. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > In answer to your off-list question, one of the principle ways I've > seen fcntl() locking fall over and die is when someone removes the > lock file. You might think that this could be avoided by picking > something important like pg_control as the log file, but it turns out > that doesn't really work: > http://0pointer.de/blog/projects/locking.html Hm, I wasn't aware of the fact that you lose an fcntl() lock if you close a *different* file descriptor for the same file. My goodness, that's horrid :-(. It basically means that any third-party code running in a backend could accidentally defeat the locking, and most of the time you'd never even know it had happened ... as opposed to what would happen if you detached from shmem for instance. You could run with such code for years, and probably never know why sometimes the interlock failed to work when you needed it to. regards, tom lane
On Apr 13, 2011, at 8:36 PM, Robert Haas wrote: > > I don't see why we need to get rid of SysV shared memory; needing less > of it seems just as good. 1. As long one keeps SysV shared memory around, the postgresql project has to maintain the annoying platform-specific documenton how to configure the poorly named kernel parameters. If the SysV region is very small, that means I can run morepostgresql instances within the same kernel limits, but one can still hit the limits. My patch allows the postgresqlproject to delete that page and the hassles with it. 2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.) > > In answer to your off-list question, one of the principle ways I've > seen fcntl() locking fall over and die is when someone removes the > lock file. You might think that this could be avoided by picking > something important like pg_control as the log file, but it turns out > that doesn't really work: > > http://0pointer.de/blog/projects/locking.html > > Tom's point is valid too. Many storage appliances present themselves > as an NFS server, so it's very plausible for the data directory to be > on an NFS server, and there's no guarantee that flock() won't be > broken there. If our current interlock were known to be unreliable > also maybe we wouldn't care very much, but AFAICT it's been extremely > robust. Both you and Tom have somehow assumed that the patch alters current postgresql behavior. In fact, the opposite is true. Ihaven't changed any of the existing behavior. The "robust" behavior remains. I merely added fcntl interlocking on top ofthe lock file to replace the SysV shmem check. If someone deletes the postgresql lock file over NFS, the data directoryis equally screwed, but with my patch there is chance that two machines sharing a properly-configured NFS mountcan properly interlock- postgresql cannot offer that today, so this is a feature upgrade with no loss. The worst casescenario is today's behavior. My original goal remains to implement POSIX shared memory, but Tom Lane was right to point out that the current interlockingcheck relies on SysV, so, even though the startup locking is really orthogonal to shared memory, I implementedwhat could be considered a separate patch for that and rolled it into one. I would encourage you to take a look at the patch. Cheers, M
On Wed, Apr 13, 2011 at 6:11 PM, A.M. <agentm@themactionfaction.com> wrote: >> I don't see why we need to get rid of SysV shared memory; needing less >> of it seems just as good. > > 1. As long one keeps SysV shared memory around, the postgresql project has to maintain the annoying platform-specific documenton how to configure the poorly named kernel parameters. If the SysV region is very small, that means I can run morepostgresql instances within the same kernel limits, but one can still hit the limits. My patch allows the postgresqlproject to delete that page and the hassles with it. > > 2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.) With all due respect, I think this is an unproductive conversation. Your patch proves that SysV is wholly unnecessary only if we also agree that fcntl() locking is just as reliable as the nattch interlock, and Tom and I are trying to explain why we don't believe that's the case. Saying that we're just wrong without responding to our points substantively doesn't move the conversation forward. In case it's not clear, here again is what we're concerned about: A System V shm *cannot* be removed until nobody is attached to it. A lock file can be removed, or the lock can be accidentally released by the apparently innocuous operation of closing a file descriptor. > Both you and Tom have somehow assumed that the patch alters current postgresql behavior. In fact, the opposite is true.I haven't changed any of the existing behavior. The "robust" behavior remains. I merely added fcntl interlocking ontop of the lock file to replace the SysV shmem check. This seems contradictory. If you replaced the SysV shmem check, then it's not there, which means you altered the behavior. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
"A.M." <agentm@themactionfaction.com> writes: > 1. As long one keeps SysV shared memory around, the postgresql project > has to maintain the annoying platform-specific document on how to > configure the poorly named kernel parameters. No, if it's just a small area, I don't see that that's an issue. You're going to max out on other things (like I/O bandwidth) long before you run into the limit on how many postmasters you can have from this. The reason that those parameters are problematic now is that people tend to want *large* shmem segments and the typical defaults aren't friendly to that. > 2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.) You were losing this argument already, but ad hominem attacks are pretty much guaranteed to get people to tune you out. There are real, substantive, unfixable reasons to not trust fcntl locking completely. > I would encourage you to take a look at the patch. Just to be perfectly clear: I have not read your patch, and am not likely to before the next commitfest starts, because I have approximately forty times too many things to do already. I'm just going off your own abbreviated description of the patch. But from what I know about fcntl locking, it's not a sufficient substitute for the SysV shmem interlock, because it has failure modes that the SysV interlock doesn't, and those failure modes occur in real-world cases. Yeah, it'd be nice to also be able to lock against other postmasters on other NFS clients, but we hardly ever hear of somebody getting burnt by the lack of that (and fcntl wouldn't be a bulletproof defense anyway). On the other hand, accidentally trying to start a duplicate postmaster on the same machine is an everyday occurrence. regards, tom lane
* Tom Lane: > Well, the fundamental point is that "ignoring NFS" is not the real > world. We can't tell people not to put data directories on NFS, > and even if we did tell them not to, they'd still do it. And NFS > locking is not trustworthy, because the remote lock daemon can crash > and restart (forgetting everything it ever knew) while your own machine > and the postmaster remain blissfully awake. Is this still the case with NFSv4? Does the local daemon still keep the lock state? > None of this is to say that an fcntl lock might not be a useful addition > to what we do already. It is to say that fcntl can't just replace what > we do already, because there are real-world failure cases that the > current solution handles and fcntl alone wouldn't. If it requires NFS misbehavior (possibly in an older version), and you have to start postmasters on separate nodes (which you normally wouldn't do), doesn't this make it increasingly unlikely that it's going to be triggered in the wild? -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
On Apr 13, 2011, at 9:30 PM, Robert Haas wrote: > On Wed, Apr 13, 2011 at 6:11 PM, A.M. <agentm@themactionfaction.com> wrote: >>> I don't see why we need to get rid of SysV shared memory; needing less >>> of it seems just as good. >> >> 1. As long one keeps SysV shared memory around, the postgresql project has to maintain the annoying platform-specificdocument on how to configure the poorly named kernel parameters. If the SysV region is very small, thatmeans I can run more postgresql instances within the same kernel limits, but one can still hit the limits. My patch allowsthe postgresql project to delete that page and the hassles with it. >> >> 2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.) > > With all due respect, I think this is an unproductive conversation. > Your patch proves that SysV is wholly unnecessary only if we also > agree that fcntl() locking is just as reliable as the nattch > interlock, and Tom and I are trying to explain why we don't believe > that's the case. Saying that we're just wrong without responding to > our points substantively doesn't move the conversation forward. Sorry- it wasn't meant to be an attack- just a dumb pun. I am trying to argue that, even if the fcntl is unreliable, thestartup procedure is just as reliable as it is now. The reasons being: 1) the SysV nattch method's primary purpose is to protect the shmem region. This is no longer necessary in my patch becausethe shared memory in unlinked immediately after creation, so only the initial postmaster and its children have access. 2) the standard postgresql lock file remains the same Furthermore, there is indeed a case where the SysV nattch cannot work while the fcntl locking can indeed catch: if two separatemachines have a postgresql data directory mounted over NFS, postgresql will currently allow both machines to starta postmaster in that directory because the SysV nattch check fails and then the pid in the lock file is the pid on thefirst machine, so postgresql will say "starting anyway". With fcntl locking, this can be fixed. SysV only has presenceon one kernel. > > In case it's not clear, here again is what we're concerned about: A > System V shm *cannot* be removed until nobody is attached to it. A > lock file can be removed, or the lock can be accidentally released by > the apparently innocuous operation of closing a file descriptor. > >> Both you and Tom have somehow assumed that the patch alters current postgresql behavior. In fact, the opposite is true.I haven't changed any of the existing behavior. The "robust" behavior remains. I merely added fcntl interlocking ontop of the lock file to replace the SysV shmem check. > > This seems contradictory. If you replaced the SysV shmem check, then > it's not there, which means you altered the behavior. From what I understood, the primary purpose of the SysV check was to protect the shared memory from multiple stompers. Theinterlock was a neat side-effect. The lock file contents are currently important to get the pid of a potential, conflicting postmaster. With the fcntl API,we can return a live conflicting PID (whether a postmaster or a stuck child), so that's an improvement. This could beused, for example, for STONITH, to reliably kill a dying replication clone- just loop on the pids returned from the lock. Even if the fcntl check passes, the pid in the lock file is checked, so the lock file behavior remains the same. If you were to implement a daemon with a shared data directory but no shared memory, how would implement the interlock? Wouldyou still insist on SysV shmem? Unix daemons generally rely on lock files alone. Perhaps there is a different API onwhich we can agree. Cheers, M
On Apr 14, 2011, at 8:22 AM, Florian Weimer wrote: > * Tom Lane: > >> Well, the fundamental point is that "ignoring NFS" is not the real >> world. We can't tell people not to put data directories on NFS, >> and even if we did tell them not to, they'd still do it. And NFS >> locking is not trustworthy, because the remote lock daemon can crash >> and restart (forgetting everything it ever knew) while your own machine >> and the postmaster remain blissfully awake. > > Is this still the case with NFSv4? Does the local daemon still keep > the lock state? The lock handling has been fixed in NFSv4. http://nfs.sourceforge.net/ "NFS Version 4 introduces support for byte-range locking and share reservation. Locking in NFS Version 4 is lease-based,so an NFS Version 4 client must maintain contact with an NFS Version 4 server to continue extending its openand lock leases." http://linux.die.net/man/2/flock "flock(2) does not lock files over NFS. Use fcntl(2) instead: that does work over NFS, given a sufficiently recent versionof Linux and a server which supports locking." I would need some more time to dig up what "recent version of Linux" specifies, but NFSv4 is likely required. > >> None of this is to say that an fcntl lock might not be a useful addition >> to what we do already. It is to say that fcntl can't just replace what >> we do already, because there are real-world failure cases that the >> current solution handles and fcntl alone wouldn't. > > If it requires NFS misbehavior (possibly in an older version), and you > have to start postmasters on separate nodes (which you normally > wouldn't do), doesn't this make it increasingly unlikely that it's > going to be triggered in the wild? With the patch I offer, it would be possible to use shared storage and failover postgresql nodes on different machines overNFS. (The second postmaster blocks and waits for the lock to be released.) Obviously, such as a setup isn't as strongas using replication, but given a sufficiently fail-safe shared storage setup, it could be made reliable. Cheers, M
On Apr 13, 2011, at 11:37 PM, Tom Lane wrote: > "A.M." <agentm@themactionfaction.com> writes: >> 1. As long one keeps SysV shared memory around, the postgresql project >> has to maintain the annoying platform-specific document on how to >> configure the poorly named kernel parameters. > > No, if it's just a small area, I don't see that that's an issue. > You're going to max out on other things (like I/O bandwidth) long before > you run into the limit on how many postmasters you can have from this. > The reason that those parameters are problematic now is that people tend > to want *large* shmem segments and the typical defaults aren't friendly > to that. That's assuming that no other processes on the system are using up the available segments (such as older postgresql instances). >> 2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.) > > You were losing this argument already, but ad hominem attacks are pretty > much guaranteed to get people to tune you out. I apologized to Robert Haas in another post- no offense was intended. > There are real, > substantive, unfixable reasons to not trust fcntl locking completely. ...on NFS which the postgresql community doesn't recommend anyway. But even in that case, the existing lock file (even withoutthe fcntl lock), can catch that case via the PID in the file contents. That is what I meant when I claimed that thebehavior remains the same. > >> I would encourage you to take a look at the patch. > > Just to be perfectly clear: I have not read your patch, and am not > likely to before the next commitfest starts, because I have > approximately forty times too many things to do already. I'm just going > off your own abbreviated description of the patch. But from what I know > about fcntl locking, it's not a sufficient substitute for the SysV shmem > interlock, because it has failure modes that the SysV interlock doesn't, > and those failure modes occur in real-world cases. Yeah, it'd be nice > to also be able to lock against other postmasters on other NFS clients, > but we hardly ever hear of somebody getting burnt by the lack of that > (and fcntl wouldn't be a bulletproof defense anyway). On the other > hand, accidentally trying to start a duplicate postmaster on the same > machine is an everyday occurrence. I really do appreciate the time you have put into feedback. I posed this question also to Robert Haas: is there a differentAPI which you would find acceptable? I chose fcntl because it seemed well-suited for this task, but the feedbackhas been regarding NFS v<4 concerns. Cheers, M
On Thu, Apr 14, 2011 at 10:26:33AM -0400, A.M. wrote: > 1) the SysV nattch method's primary purpose is to protect the shmem > region. This is no longer necessary in my patch because the shared > memory in unlinked immediately after creation, so only the initial > postmaster and its children have access. Umm, you don't unlink SysV shared memory. All the flag does is make sure it goes away when the last user goes away. In the mean time people can still connect to it. > The lock file contents are currently important to get the pid of a > potential, conflicting postmaster. With the fcntl API, we can return > a live conflicting PID (whether a postmaster or a stuck child), so > that's an improvement. This could be used, for example, for STONITH, > to reliably kill a dying replication clone- just loop on the pids > returned from the lock. SysV shared memory also gives you a PID, that's the point. > > Even if the fcntl check passes, the pid in the lock file is checked, so the lock file behavior remains the same. The interlock is to make sure there are no living postmaster children. The lockfile won't tell you that. So the issue is that while fcntl can work, sysv can do better. Also, I think you underestimate the value of the current interlock. Before this people did manage to trash their databases regularly this way. Lockfiles can be deleted and yes, people do it all the time. Actually, it occurs to me you can solve NFS problem by putting the lockfile in the socket dir. That can't possibly be on NFS. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patriotism is when love of your own people comes first; nationalism, > when hate for people other than your own comes first. > - Charles de Gaulle
On Thu, Apr 14, 2011 at 7:26 AM, A.M. <agentm@themactionfaction.com> wrote: > From what I understood, the primary purpose of the SysV check was to protect the shared memory from multiple stompers.The interlock was a neat side-effect. Not really - the purpose of the interlock is to protect the underlying data files. The nattch interlock allows us to be very confident that there isn't another postmaster running on the same data directory on the same machine, and that is extremely important. You've just about convinced me that it might not be a bad idea to add the fcntl() interlock in addition because, as you say, that has a chance of working even over NFS. But the interlock we have now is *extremely* reliable, and I think we'd need to get some other amazingly compelling benefit to consider changing it (even if we were convinced that the alternate method was also reliable). I don't see that there is one. Anyone running an existing version of PostgreSQL in an environment where they care *at all* about performance has already adjusted their SysV shm settings way up. The benefit of using POSIX shm is that in, say, PostgreSQL 9.2, it might be possible for shared buffers to have a somewhat higher default setting out of the box, and be further increased from there without kernel parameter changes. And there might be more benefits besides that, but certainly those by themselves seem pretty worthwhile. SysV shm is extremely portable, so I don't think that we're losing anything by continuing to allocate a small amount of it (a few kilobytes, perhaps) and just push everything else out into POSIX shm. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company