Thread: POSIX shared memory redux

POSIX shared memory redux

From

A.M.

Date:

13 November 2010, 20:49:08

The goal of this work is to address all of the shortcomings of previous POSIX shared memory patches as pointed out
mostlyby Tom Lane. 

Branch: http://git.postgresql.org/gitweb?p=users/agentm/postgresql.git;a=shortlog;h=refs/heads/posix_shmem
Main file:
http://git.postgresql.org/gitweb?p=users/agentm/postgresql.git;a=blob;f=src/backend/port/posix_shmem.c;h=da93848d14eeadb182d8bf1fe576d741ae5792c3;hb=refs/heads/posix_shmem

Design goals:
1) ensure that shared memory creation collisions are impossible
2) ensure that shared memory access collisions are impossible
3) ensure proper shared memory cleanup after backend and postmaster close
4) minimize API changes
http://archives.postgresql.org/pgsql-patches/2007-02/msg00527.php
http://archives.postgresql.org/pgsql-patches/2007-02/msg00558.php

This patch addresses the above goals and offers some benefits over SysV shared memory:

1) no kern.sysv management (one documentation page with platform-specific help can disappear)
2) shared memory allocation limited only by mmap usage
3) shared memory regions are completely cleaned up when the postmaster and all of its children are exited or killed for
anyreason (including SIGKILL) 
4) shared memory creation race conditions or collisions between postmasters or backends are impossible
5) after postmaster startup, the postmaster becomes the sole arbiter of which other processes are granted access to the
sharedmemory region  
6) mmap and munmap can be used on the shared memory region- this may be useful for offering the option to expand the
memoryregion dynamically 

The design goals are accomplished by a simple change in shared memory creation: after shm_open, the region name is
immediatelyshm_unlink'd. Because POSIX shared memory relies on file descriptors, the shared memory is not deallocated
inthe kernel until the last referencing file descriptor is closed (in this case, on process exit). The postmaster then
becomesthe sole arbiter of passing the shared memory file descriptor (either through children or through file
descriptorpassing, if necessary). 

The patch is a reworked version of Chris Marcellino <cmarcellino@apple.com>'s patch.

Details:

1) the shared memory name is based on getpid()- this ensures that no two starting postmasters (or other processes) will
attemptto acquire the same shared memory segment. 
2) the shared memory segment is created and immediately unlinked, preventing outside access to the shared memory region
3) the shared memory file descriptor is passed to backends via static int file descriptor (normal file descriptor
inheritance)*perhaps there is a better location to store the file descriptor- advice welcomed. 
4) shared memory segment detach occurs when the process exits (kernel-based cleanup instead of scheduled in-process
cleanup) 

Additional notes:
The "feature" whereby arbitrary postgres user processes could connect to the shared memory segment has been removed
withthis patch. If this is a desirable feature (perhaps for debugging or performance tools), this could be added by
implementinga file descriptor passing server in the postmaster which would use SCM_RIGHTS control message passing to a)
verifythat the remote process is running as the same user as the postmaster b) pass the shared memory file descriptor
tothe process. I am happy to implement this, if required. 

I am happy to continue work on this patch if the pg-hackers deem it worthwhile. Thanks!

Cheers,
M

Re: POSIX shared memory redux

From

Tom Lane

Date:

13 November 2010, 21:08:00

"A.M." <agentm@themactionfaction.com> writes:
> The goal of this work is to address all of the shortcomings of previous POSIX shared memory patches as pointed out
mostlyby Tom Lane.
 

It seems like you've failed to understand the main shortcoming of this
whole idea, which is the loss of ability to detect pre-existing backends
still running in a cluster whose postmaster has crashed.  The nattch
variable of SysV shmem segments is really pretty critical to us, and
AFAIK there simply is no substitute for it in POSIX-land.
        regards, tom lane

Re: POSIX shared memory redux

From

Martijn van Oosterhout

Date:

14 November 2010, 11:06:55

On Sat, Nov 13, 2010 at 08:07:52PM -0500, Tom Lane wrote:
> "A.M." <agentm@themactionfaction.com> writes:
> > The goal of this work is to address all of the shortcomings of previous POSIX shared memory patches as pointed out
mostlyby Tom Lane. 
>
> It seems like you've failed to understand the main shortcoming of this
> whole idea, which is the loss of ability to detect pre-existing backends
> still running in a cluster whose postmaster has crashed.  The nattch
> variable of SysV shmem segments is really pretty critical to us, and
> AFAIK there simply is no substitute for it in POSIX-land.

I've been looking and there really doesn't appear to be. This is
consistant as there is nothing else in POSIX where you can determine
how many other people have the same file, pipe, tty, etc open.

I asked a few people for ideas and got answers like: just walk through
/proc and check. Apart from the portability issues, this won't work if
there are different user-IDs in play.

The only real solution seems to me to be to keep a small SysV shared
memory segment for the locking and allocate the rest of the shared
memory some other way. If all backends map the SysV memory before the
other way, then you can use the non-existance of the SysV SHM to
determine the non-existance of the other segment.

Quite a bit more work, ISTM.

Haveva nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: POSIX shared memory redux

From

Tom Lane

Date:

14 November 2010, 12:07:00

Martijn van Oosterhout <kleptog@svana.org> writes:
> The only real solution seems to me to be to keep a small SysV shared
> memory segment for the locking and allocate the rest of the shared
> memory some other way.

Yeah, that's been discussed.  It throws all the portability gains out
the window.  It might get you out from under the need to readjust a
machine's SHMMAX setting before you can use a large amount of shared
memory, but it's not clear that's enough of a win to be worth the
trouble.

The other direction that we could possibly go is to find some other way
entirely of interlocking access to the data directory.  If for example
we could rely on a file lock held by the postmaster and all backends,
we could check that instead of having to rely on a shmem behavior.
The killer objection to that so far is that file locking is unreliable
in some environments, particularly NFS.  But it'd have some advantages
too --- in particular, in the NFS context, the fact that the lock is
visible to would-be postmasters on different machines might be thought
a huge safety improvement over what we do now.
        regards, tom lane

Re: POSIX shared memory redux

From

Robert Haas

Date:

14 November 2010, 14:37:41

On Sun, Nov 14, 2010 at 11:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
>> The only real solution seems to me to be to keep a small SysV shared
>> memory segment for the locking and allocate the rest of the shared
>> memory some other way.
>
> Yeah, that's been discussed.  It throws all the portability gains out
> the window.  It might get you out from under the need to readjust a
> machine's SHMMAX setting before you can use a large amount of shared
> memory, but it's not clear that's enough of a win to be worth the
> trouble.

One of the things that would be really nice to be able to do is resize
our shm after startup, in response to changes in configuration
parameters.  That's not so easy to make work, of course, but I feel
like this might be going in the right direction, since POSIX shms can
be resized using ftruncate().

> The other direction that we could possibly go is to find some other way
> entirely of interlocking access to the data directory.  If for example
> we could rely on a file lock held by the postmaster and all backends,
> we could check that instead of having to rely on a shmem behavior.
> The killer objection to that so far is that file locking is unreliable
> in some environments, particularly NFS.  But it'd have some advantages
> too --- in particular, in the NFS context, the fact that the lock is
> visible to would-be postmasters on different machines might be thought
> a huge safety improvement over what we do now.

I've never had a lot of luck making filesystem locks work reliably,
but I don't discount the possibility that I was doing it wrong.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POSIX shared memory redux

From

A.M.

Date:

10 April 2011, 21:04:25

Hello,

Based on feedback from Tom Lane and Robert Haas, I have amended the POSIX shared memory patch to account for
multiple-postmasterstart race conditions (which is currently based on SysV shared memory checks).

https://github.com/agentm/postgres/tree/posix_shmem

To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data
directorylock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster
childstarts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts,
itqueries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the
casewhen other postgresql processes are running.

Because POSIX fcntl locking is per-process and released in the kernel when a process ends, as long as there is a single
processrunning and holding a read lock, no new postmaster can be started for that data directory. Furthermore, the
fcntlsyscall allows us to get a live PID for a conflicting lock-holding process, so the postgresql startup can print a
livePID on a conflict startup. The contents of the data directory lock file remain the same, however, the PID stored in
thelock file becomes less vital.

The cost of this change is one additional file descriptor open in each postgresql process (for the full life of the
process).

As a gimmick, I also implemented a process-failover feature based on the F_SETLKW flag which allows a new postmaster to
startupimmediately if the running postmaster and all its children exit for any reason. This may also be useful to queue
postmasterstartup which could be controlled by a secondary non-postgresql process, pending some action (such as in a
failoverscenario). This feature is controlled via "postgres -b" (for "blocking"), but it is not vital to the shared
memoryimplementation.

Note that this implementation of the fcntl locking is effectively independent of the shared memory interface, i.e. this
samelocking could be used with the existing SysV shared memory scheme.

Is this approach good enough to push to the next CommitFest? I am happy to amend the patch as necessary. Thanks!

Cheers,
M

Attachment

posix_shmem.patch

Re: POSIX shared memory redux

From

Robert Haas

Date:

11 April 2011, 19:06:29

On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote:
> To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data
directorylock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster
childstarts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts,
itqueries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the
casewhen other postgresql processes are running. 

This seems a lot leakier than what we do now (imagine, for example,
shared storage) and I'm not sure what the advantage is.  I was
imagining keeping some portion of the data in sysv shm, and moving the
big stuff to a POSIX shm that would operate alongside it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POSIX shared memory redux

From

"A.M."

Date:

11 April 2011, 19:11:19

On Apr 11, 2011, at 6:06 PM, Robert Haas wrote:

> On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote:
>> To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data
directorylock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster
childstarts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts,
itqueries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the
casewhen other postgresql processes are running. 
>
> This seems a lot leakier than what we do now (imagine, for example,
> shared storage) and I'm not sure what the advantage is.  I was
> imagining keeping some portion of the data in sysv shm, and moving the
> big stuff to a POSIX shm that would operate alongside it.

What do you mean by "leakier"? The goal here is to extinguish SysV shared memory for portability and convenience
benefits.The mini-SysV proposal was implemented and shot down by Tom Lane. 

Cheers,
M

Re: POSIX shared memory redux

From

Robert Haas

Date:

11 April 2011, 19:16:27

On Mon, Apr 11, 2011 at 3:11 PM, A.M. <agentm@themactionfaction.com> wrote:
>
> On Apr 11, 2011, at 6:06 PM, Robert Haas wrote:
>
>> On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote:
>>> To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data
directorylock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster
childstarts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts,
itqueries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the
casewhen other postgresql processes are running. 
>>
>> This seems a lot leakier than what we do now (imagine, for example,
>> shared storage) and I'm not sure what the advantage is.  I was
>> imagining keeping some portion of the data in sysv shm, and moving the
>> big stuff to a POSIX shm that would operate alongside it.
>
> What do you mean by "leakier"? The goal here is to extinguish SysV shared memory for portability and convenience
benefits.The mini-SysV proposal was implemented and shot down by Tom Lane. 

I mean I'm not convinced that fcntl() locking will be as reliable.

I know Tom shot that down before, but I still think it's probably the
best way forward.  The advantage I see is that we would be able to
more easily allocate larger chunks of shared memory with changing
kernel parameters, and perhaps even to dynamically resize shared
memory chunks.  That'd be worth the price of admission even if we
didn't get all those benefits in one commit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POSIX shared memory redux

From

Tom Lane

Date:

11 April 2011, 20:13:40

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Apr 11, 2011 at 3:11 PM, A.M. <agentm@themactionfaction.com> wrote:
>> What do you mean by "leakier"? The goal here is to extinguish SysV shared memory for portability and convenience
benefits.The mini-SysV proposal was implemented and shot down by Tom Lane.

> I mean I'm not convinced that fcntl() locking will be as reliable.

I'm not either.  Particularly not on NFS.  (Although on NFS you have
other issues to worry about too, like postmasters on different machines
being able to reach the same data directory.  I wonder if we should do
both SysV and fcntl locking ...)

> I know Tom shot that down before, but I still think it's probably the
> best way forward.

Did I?  I think I pointed out that there's zero gain in portability as
long as we still depend on SysV shmem to work.  However, if you're doing
it for other reasons than portability, it might make sense anyway.  The
question is whether there are adequate other reasons.

> The advantage I see is that we would be able to
> more easily allocate larger chunks of shared memory with changing
> kernel parameters,

Yes, getting out from under the SHMMAX bugaboo would be awfully nice.

> and perhaps even to dynamically resize shared memory chunks.

This I don't really believe will ever work reliably, especially not in
32-bit machines.  Whatever your kernel API is, you still have the
problem of finding address space contiguous to what you were already
using.
        regards, tom lane

Re: POSIX shared memory redux

From

Tom Lane

Date:

11 April 2011, 20:25:18

Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote:
>> To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data
directorylock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster
childstarts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts,
itqueries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the
casewhen other postgresql processes are running.
 

> This seems a lot leakier than what we do now (imagine, for example,
> shared storage) and I'm not sure what the advantage is.

BTW, the above-described solution flat out doesn't work anyway, because
it has a race condition.  Postmaster children have to reacquire the lock
after forking, because fcntl locks aren't inherited during fork().  And
that means you can't tell whether there's a just-started backend that
hasn't yet acquired the lock.  It's really critical for our purposes
that SysV shmem segments are inherited at fork() and so there's no
window where a just-forked backend isn't visible to somebody checking
the state of the shmem segment.
        regards, tom lane

Re: POSIX shared memory redux

From

"A.M."

Date:

11 April 2011, 20:29:51

On Apr 11, 2011, at 7:25 PM, Tom Lane wrote:

> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, Apr 10, 2011 at 5:03 PM, A.M. <agentm@themactionfaction.com> wrote:
>>> To ensure that no two postmasters can startup in the same data directory, I use fcntl range locking on the data
directorylock file, which also works properly on (properly configured) NFS volumes. Whenever a postmaster or postmaster
childstarts, it acquires a read (non-exclusive) lock on the data directory's lock file. When a new postmaster starts,
itqueries if anything would block a write (exclusive) lock on the lock file which returns a lock-holding PID in the
casewhen other postgresql processes are running. 
>
>> This seems a lot leakier than what we do now (imagine, for example,
>> shared storage) and I'm not sure what the advantage is.
>
> BTW, the above-described solution flat out doesn't work anyway, because
> it has a race condition.  Postmaster children have to reacquire the lock
> after forking, because fcntl locks aren't inherited during fork().  And
> that means you can't tell whether there's a just-started backend that
> hasn't yet acquired the lock.  It's really critical for our purposes
> that SysV shmem segments are inherited at fork() and so there's no
> window where a just-forked backend isn't visible to somebody checking
> the state of the shmem segment.

Then you haven't looked at my patch because I address this race condition by ensuring that a lock-holding violator is
thepostmaster or a postmaster child. If such as condition is detected, the child exits immediately without touching the
sharedmemory. POSIX shmem is inherited via file descriptors. 

Cheers,
M

Re: POSIX shared memory redux

From

"A.M."

Date:

11 April 2011, 20:30:26

On Apr 11, 2011, at 7:13 PM, Tom Lane wrote:

> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Apr 11, 2011 at 3:11 PM, A.M. <agentm@themactionfaction.com> wrote:
>>> What do you mean by "leakier"? The goal here is to extinguish SysV shared memory for portability and convenience
benefits.The mini-SysV proposal was implemented and shot down by Tom Lane. 
>
>> I mean I'm not convinced that fcntl() locking will be as reliable.
>
> I'm not either.  Particularly not on NFS.  (Although on NFS you have
> other issues to worry about too, like postmasters on different machines
> being able to reach the same data directory.  I wonder if we should do
> both SysV and fcntl locking ...)

Is there an example of a recent system where fcntl is broken (ignoring NFS)? I believe my patch addresses all potential
raceconditions and uses the APIs properly to guarantee single-postmaster data directory usage and I tested on Darwin
anda two-year-old Linux kernel. In the end, fcntl locking relies on the same kernel which provides the SysV user count,
soI'm not sure what makes it less "reliable", but I have heard that twice now, so I am open to hearing about your
experiences.

>> I know Tom shot that down before, but I still think it's probably the
>> best way forward.
>
> Did I?  I think I pointed out that there's zero gain in portability as
> long as we still depend on SysV shmem to work.  However, if you're doing
> it for other reasons than portability, it might make sense anyway.  The
> question is whether there are adequate other reasons.

I provided an example of postmaster-failover relying on F_SETLKW in the email with the patch. Also, as you point out
above,fcntl locking at least has a chance of working over NFS. 

>
>> The advantage I see is that we would be able to
>> more easily allocate larger chunks of shared memory with changing
>> kernel parameters,
>
> Yes, getting out from under the SHMMAX bugaboo would be awfully nice.

Yes, please! That is my primary motivation for this patch.

>
>> and perhaps even to dynamically resize shared memory chunks.
>
> This I don't really believe will ever work reliably, especially not in
> 32-bit machines.  Whatever your kernel API is, you still have the
> problem of finding address space contiguous to what you were already
> using.

Even if expanding shmem involves copying large regions of memory, it could be at least useful to adjust buffer sizes
livewithout a restart. 

Cheers,
M

Re: POSIX shared memory redux

From

Tom Lane

Date:

13 April 2011, 03:06:38

"A.M." <agentm@themactionfaction.com> writes:
> On Apr 11, 2011, at 7:13 PM, Tom Lane wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> I mean I'm not convinced that fcntl() locking will be as reliable.

>> I'm not either.  Particularly not on NFS.

> Is there an example of a recent system where fcntl is broken (ignoring NFS)?

Well, the fundamental point is that "ignoring NFS" is not the real
world.  We can't tell people not to put data directories on NFS,
and even if we did tell them not to, they'd still do it.  And NFS
locking is not trustworthy, because the remote lock daemon can crash
and restart (forgetting everything it ever knew) while your own machine
and the postmaster remain blissfully awake.

None of this is to say that an fcntl lock might not be a useful addition
to what we do already.  It is to say that fcntl can't just replace what
we do already, because there are real-world failure cases that the
current solution handles and fcntl alone wouldn't.
        regards, tom lane

Re: POSIX shared memory redux

From

A.M.

Date:

13 April 2011, 11:20:18

On Apr 13, 2011, at 2:06 AM, Tom Lane wrote:

> "A.M." <agentm@themactionfaction.com> writes:
>> On Apr 11, 2011, at 7:13 PM, Tom Lane wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> I mean I'm not convinced that fcntl() locking will be as reliable.
>
>>> I'm not either.  Particularly not on NFS.
>
>> Is there an example of a recent system where fcntl is broken (ignoring NFS)?
>
> Well, the fundamental point is that "ignoring NFS" is not the real
> world.  We can't tell people not to put data directories on NFS,
> and even if we did tell them not to, they'd still do it.  And NFS
> locking is not trustworthy, because the remote lock daemon can crash
> and restart (forgetting everything it ever knew) while your own machine
> and the postmaster remain blissfully awake.
>
> None of this is to say that an fcntl lock might not be a useful addition
> to what we do already.  It is to say that fcntl can't just replace what
> we do already, because there are real-world failure cases that the
> current solution handles and fcntl alone wouldn't.

The goal of this patch is to eliminate SysV shared memory, not to implement NFS-capable locking which, as you point
out,is virtually impossible. 

As far as I can tell, in the worst case, my patch does not change how postgresql handles the NFS case. SysV shared
memorywon't work across NFS, so that interlock won't catch, so postgresql is left with looking at a lock file with PID
ofprocess on another machine, so that won't catch either. This patch does not alter the lock file semantics, but merely
augmentsthe file with file locking. 

At least with this patch, there is a chance the lock might work across NFS. In the best case, it can allow for
shared-storagepostgresql failover, which is a new feature. 

Furthermore, there is an improvement in shared memory handling in that it is unlinked immediately after creation, so
onlythe postmaster and its children have access to it (through file descriptor inheritance). This means shared memory
cannotbe stomped on by any other process. 

Considering that possibly working NFS locking is a side-effect of this patch and not its goal and, in the worst
possiblescenario, it doesn't change current behavior, I don't see how this can be a ding against this patch. 

Cheers,
M

Re: POSIX shared memory redux

From

Robert Haas

Date:

13 April 2011, 21:36:53

On Wed, Apr 13, 2011 at 7:20 AM, A.M. <agentm@themactionfaction.com> wrote:
> The goal of this patch is to eliminate SysV shared memory, not to implement NFS-capable locking which, as you point
out,is virtually impossible. 
>
> As far as I can tell, in the worst case, my patch does not change how postgresql handles the NFS case. SysV shared
memorywon't work across NFS, so that interlock won't catch, so postgresql is left with looking at a lock file with PID
ofprocess on another machine, so that won't catch either. This patch does not alter the lock file semantics, but merely
augmentsthe file with file locking. 
>
> At least with this patch, there is a chance the lock might work across NFS. In the best case, it can allow for
shared-storagepostgresql failover, which is a new feature. 
>
> Furthermore, there is an improvement in shared memory handling in that it is unlinked immediately after creation, so
onlythe postmaster and its children have access to it (through file descriptor inheritance). This means shared memory
cannotbe stomped on by any other process. 
>
> Considering that possibly working NFS locking is a side-effect of this patch and not its goal and, in the worst
possiblescenario, it doesn't change current behavior, I don't see how this can be a ding against this patch. 

I don't see why we need to get rid of SysV shared memory; needing less
of it seems just as good.

In answer to your off-list question, one of the principle ways I've
seen fcntl() locking fall over and die is when someone removes the
lock file.  You might think that this could be avoided by picking
something important like pg_control as the log file, but it turns out
that doesn't really work:

http://0pointer.de/blog/projects/locking.html

Tom's point is valid too.  Many storage appliances present themselves
as an NFS server, so it's very plausible for the data directory to be
on an NFS server, and there's no guarantee that flock() won't be
broken there.  If our current interlock were known to be unreliable
also maybe we wouldn't care very much, but AFAICT it's been extremely
robust.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POSIX shared memory redux

From

Tom Lane

Date:

13 April 2011, 22:10:03

Robert Haas <robertmhaas@gmail.com> writes:
> In answer to your off-list question, one of the principle ways I've
> seen fcntl() locking fall over and die is when someone removes the
> lock file.  You might think that this could be avoided by picking
> something important like pg_control as the log file, but it turns out
> that doesn't really work:

> http://0pointer.de/blog/projects/locking.html

Hm, I wasn't aware of the fact that you lose an fcntl() lock if you
close a *different* file descriptor for the same file.  My goodness,
that's horrid :-(.  It basically means that any third-party code running
in a backend could accidentally defeat the locking, and most of the time
you'd never even know it had happened ... as opposed to what would
happen if you detached from shmem for instance.  You could run with such
code for years, and probably never know why sometimes the interlock
failed to work when you needed it to.
        regards, tom lane

Re: POSIX shared memory redux

From

"A.M."

Date:

13 April 2011, 22:11:55

On Apr 13, 2011, at 8:36 PM, Robert Haas wrote:
>
> I don't see why we need to get rid of SysV shared memory; needing less
> of it seems just as good.

1. As long one keeps SysV shared memory around, the postgresql project has to maintain the annoying platform-specific
documenton how to configure the poorly named kernel parameters. If the SysV region is very small, that means I can run
morepostgresql instances within the same kernel limits, but one can still hit the limits. My patch allows the
postgresqlproject to delete that page and the hassles with it. 

2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.)

>
> In answer to your off-list question, one of the principle ways I've
> seen fcntl() locking fall over and die is when someone removes the
> lock file.  You might think that this could be avoided by picking
> something important like pg_control as the log file, but it turns out
> that doesn't really work:
>
> http://0pointer.de/blog/projects/locking.html
>
> Tom's point is valid too.  Many storage appliances present themselves
> as an NFS server, so it's very plausible for the data directory to be
> on an NFS server, and there's no guarantee that flock() won't be
> broken there.  If our current interlock were known to be unreliable
> also maybe we wouldn't care very much, but AFAICT it's been extremely
> robust.

Both you and Tom have somehow assumed that the patch alters current postgresql behavior. In fact, the opposite is true.
Ihaven't changed any of the existing behavior. The "robust" behavior remains. I merely added fcntl interlocking on top
ofthe lock file to replace the SysV shmem check. If someone deletes the postgresql lock file over NFS, the data
directoryis equally screwed, but with my patch there is chance that two machines sharing a properly-configured NFS
mountcan properly interlock- postgresql cannot offer that today, so this is a feature upgrade with no loss. The worst
casescenario is today's behavior. 

My original goal remains to implement POSIX shared memory, but Tom Lane was right to point out that the current
interlockingcheck relies on SysV, so, even though the startup locking is really orthogonal to shared memory, I
implementedwhat could be considered a separate patch for that and rolled it into one. 

I would encourage you to take a look at the patch.

Cheers,
M

Re: POSIX shared memory redux

From

Robert Haas

Date:

13 April 2011, 22:31:03

On Wed, Apr 13, 2011 at 6:11 PM, A.M. <agentm@themactionfaction.com> wrote:
>> I don't see why we need to get rid of SysV shared memory; needing less
>> of it seems just as good.
>
> 1. As long one keeps SysV shared memory around, the postgresql project has to maintain the annoying platform-specific
documenton how to configure the poorly named kernel parameters. If the SysV region is very small, that means I can run
morepostgresql instances within the same kernel limits, but one can still hit the limits. My patch allows the
postgresqlproject to delete that page and the hassles with it. 
>
> 2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.)

With all due respect, I think this is an unproductive conversation.
Your patch proves that SysV is wholly unnecessary only if we also
agree that fcntl() locking is just as reliable as the nattch
interlock, and Tom and I are trying to explain why we don't believe
that's the case.  Saying that we're just wrong without responding to
our points substantively doesn't move the conversation forward.

In case it's not clear, here again is what we're concerned about: A
System V shm *cannot* be removed until nobody is attached to it.  A
lock file can be removed, or the lock can be accidentally released by
the apparently innocuous operation of closing a file descriptor.

> Both you and Tom have somehow assumed that the patch alters current postgresql behavior. In fact, the opposite is
true.I haven't changed any of the existing behavior. The "robust" behavior remains. I merely added fcntl interlocking
ontop of the lock file to replace the SysV shmem check. 

This seems contradictory.  If you replaced the SysV shmem check, then
it's not there, which means you altered the behavior.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: POSIX shared memory redux

From

Tom Lane

Date:

14 April 2011, 00:38:03

"A.M." <agentm@themactionfaction.com> writes:
> 1. As long one keeps SysV shared memory around, the postgresql project
> has to maintain the annoying platform-specific document on how to
> configure the poorly named kernel parameters.

No, if it's just a small area, I don't see that that's an issue.
You're going to max out on other things (like I/O bandwidth) long before
you run into the limit on how many postmasters you can have from this.
The reason that those parameters are problematic now is that people tend
to want *large* shmem segments and the typical defaults aren't friendly
to that.

> 2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.)

You were losing this argument already, but ad hominem attacks are pretty
much guaranteed to get people to tune you out.  There are real,
substantive, unfixable reasons to not trust fcntl locking completely.

> I would encourage you to take a look at the patch.

Just to be perfectly clear: I have not read your patch, and am not
likely to before the next commitfest starts, because I have
approximately forty times too many things to do already.  I'm just going
off your own abbreviated description of the patch.  But from what I know
about fcntl locking, it's not a sufficient substitute for the SysV shmem
interlock, because it has failure modes that the SysV interlock doesn't,
and those failure modes occur in real-world cases.  Yeah, it'd be nice
to also be able to lock against other postmasters on other NFS clients,
but we hardly ever hear of somebody getting burnt by the lack of that
(and fcntl wouldn't be a bulletproof defense anyway).  On the other
hand, accidentally trying to start a duplicate postmaster on the same
machine is an everyday occurrence.
        regards, tom lane

Re: POSIX shared memory redux

From

Florian Weimer

Date:

14 April 2011, 09:22:10

* Tom Lane:

> Well, the fundamental point is that "ignoring NFS" is not the real
> world.  We can't tell people not to put data directories on NFS,
> and even if we did tell them not to, they'd still do it.  And NFS
> locking is not trustworthy, because the remote lock daemon can crash
> and restart (forgetting everything it ever knew) while your own machine
> and the postmaster remain blissfully awake.

Is this still the case with NFSv4?  Does the local daemon still keep
the lock state?

> None of this is to say that an fcntl lock might not be a useful addition
> to what we do already.  It is to say that fcntl can't just replace what
> we do already, because there are real-world failure cases that the
> current solution handles and fcntl alone wouldn't.

If it requires NFS misbehavior (possibly in an older version), and you
have to start postmasters on separate nodes (which you normally
wouldn't do), doesn't this make it increasingly unlikely that it's
going to be triggered in the wild?

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99

Re: POSIX shared memory redux

From

"A.M."

Date:

14 April 2011, 11:27:01

On Apr 13, 2011, at 9:30 PM, Robert Haas wrote:

> On Wed, Apr 13, 2011 at 6:11 PM, A.M. <agentm@themactionfaction.com> wrote:
>>> I don't see why we need to get rid of SysV shared memory; needing less
>>> of it seems just as good.
>>
>> 1. As long one keeps SysV shared memory around, the postgresql project has to maintain the annoying
platform-specificdocument on how to configure the poorly named kernel parameters. If the SysV region is very small,
thatmeans I can run more postgresql instances within the same kernel limits, but one can still hit the limits. My patch
allowsthe postgresql project to delete that page and the hassles with it. 
>>
>> 2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.)
>
> With all due respect, I think this is an unproductive conversation.
> Your patch proves that SysV is wholly unnecessary only if we also
> agree that fcntl() locking is just as reliable as the nattch
> interlock, and Tom and I are trying to explain why we don't believe
> that's the case.  Saying that we're just wrong without responding to
> our points substantively doesn't move the conversation forward.

Sorry- it wasn't meant to be an attack- just a dumb pun. I am trying to argue that, even if the fcntl is unreliable,
thestartup procedure is just as reliable as it is now. The reasons being: 

1) the SysV nattch method's primary purpose is to protect the shmem region. This is no longer necessary in my patch
becausethe shared memory in unlinked immediately after creation, so only the initial postmaster and its children have
access.

2) the standard postgresql lock file remains the same

Furthermore, there is indeed a case where the SysV nattch cannot work while the fcntl locking can indeed catch: if two
separatemachines have a postgresql data directory mounted over NFS, postgresql will currently allow both machines to
starta postmaster in that directory because the SysV nattch check fails and then the pid in the lock file is the pid on
thefirst machine, so postgresql will say "starting anyway". With fcntl locking, this can be fixed. SysV only has
presenceon one kernel. 

>
> In case it's not clear, here again is what we're concerned about: A
> System V shm *cannot* be removed until nobody is attached to it.  A
> lock file can be removed, or the lock can be accidentally released by
> the apparently innocuous operation of closing a file descriptor.
>
>> Both you and Tom have somehow assumed that the patch alters current postgresql behavior. In fact, the opposite is
true.I haven't changed any of the existing behavior. The "robust" behavior remains. I merely added fcntl interlocking
ontop of the lock file to replace the SysV shmem check. 
>
> This seems contradictory.  If you replaced the SysV shmem check, then
> it's not there, which means you altered the behavior.

From what I understood, the primary purpose of the SysV check was to protect the shared memory from multiple stompers.
Theinterlock was a neat side-effect.  

The lock file contents are currently important to get the pid of a potential, conflicting postmaster. With the fcntl
API,we can return a live conflicting PID (whether a postmaster or a stuck child), so that's an improvement. This could
beused, for example, for STONITH, to reliably kill a dying replication clone- just loop on the pids returned from the
lock.

Even if the fcntl check passes, the pid in the lock file is checked, so the lock file behavior remains the same.

If you were to implement a daemon with a shared data directory but no shared memory, how would implement the interlock?
Wouldyou still insist on SysV shmem? Unix daemons generally rely on lock files alone. Perhaps there is a different API
onwhich we can agree. 

Cheers,
M

Re: POSIX shared memory redux

From

"A.M."

Date:

14 April 2011, 11:34:26

On Apr 14, 2011, at 8:22 AM, Florian Weimer wrote:

> * Tom Lane:
>
>> Well, the fundamental point is that "ignoring NFS" is not the real
>> world.  We can't tell people not to put data directories on NFS,
>> and even if we did tell them not to, they'd still do it.  And NFS
>> locking is not trustworthy, because the remote lock daemon can crash
>> and restart (forgetting everything it ever knew) while your own machine
>> and the postmaster remain blissfully awake.
>
> Is this still the case with NFSv4?  Does the local daemon still keep
> the lock state?

The lock handling has been fixed in NFSv4.

http://nfs.sourceforge.net/
"NFS Version 4 introduces support for byte-range locking and share reservation. Locking in NFS Version 4 is
lease-based,so an NFS Version 4 client must maintain contact with an NFS Version 4 server to continue extending its
openand lock leases." 

http://linux.die.net/man/2/flock
"flock(2) does not lock files over NFS. Use fcntl(2) instead: that does work over NFS, given a sufficiently recent
versionof Linux and a server which supports locking." 

I would need some more time to dig up what "recent version of Linux" specifies, but NFSv4 is likely required.

>
>> None of this is to say that an fcntl lock might not be a useful addition
>> to what we do already.  It is to say that fcntl can't just replace what
>> we do already, because there are real-world failure cases that the
>> current solution handles and fcntl alone wouldn't.
>
> If it requires NFS misbehavior (possibly in an older version), and you
> have to start postmasters on separate nodes (which you normally
> wouldn't do), doesn't this make it increasingly unlikely that it's
> going to be triggered in the wild?

With the patch I offer, it would be possible to use shared storage and failover postgresql nodes on different machines
overNFS. (The second postmaster blocks and waits for the lock to be released.) Obviously, such as a setup isn't as
strongas using replication, but given a sufficiently fail-safe shared storage setup, it could be made reliable. 

Cheers,
M

Re: POSIX shared memory redux

From

"A.M."

Date:

14 April 2011, 11:48:07

On Apr 13, 2011, at 11:37 PM, Tom Lane wrote:

> "A.M." <agentm@themactionfaction.com> writes:
>> 1. As long one keeps SysV shared memory around, the postgresql project
>> has to maintain the annoying platform-specific document on how to
>> configure the poorly named kernel parameters.
>
> No, if it's just a small area, I don't see that that's an issue.
> You're going to max out on other things (like I/O bandwidth) long before
> you run into the limit on how many postmasters you can have from this.
> The reason that those parameters are problematic now is that people tend
> to want *large* shmem segments and the typical defaults aren't friendly
> to that.

That's assuming that no other processes on the system are using up the available segments (such as older postgresql
instances).


>> 2. My patch proves that SysV is wholly unnecessary. Are you attached to it? (Pun intended.)
>
> You were losing this argument already, but ad hominem attacks are pretty
> much guaranteed to get people to tune you out.

I apologized to Robert Haas in another post- no offense was intended.

> There are real,
> substantive, unfixable reasons to not trust fcntl locking completely.

...on NFS which the postgresql community doesn't recommend anyway. But even in that case, the existing lock file (even
withoutthe fcntl lock), can catch that case via the PID in the file contents. That is what I meant when I claimed that
thebehavior remains the same. 

>
>> I would encourage you to take a look at the patch.
>
> Just to be perfectly clear: I have not read your patch, and am not
> likely to before the next commitfest starts, because I have
> approximately forty times too many things to do already.  I'm just going
> off your own abbreviated description of the patch.  But from what I know
> about fcntl locking, it's not a sufficient substitute for the SysV shmem
> interlock, because it has failure modes that the SysV interlock doesn't,
> and those failure modes occur in real-world cases.  Yeah, it'd be nice
> to also be able to lock against other postmasters on other NFS clients,
> but we hardly ever hear of somebody getting burnt by the lack of that
> (and fcntl wouldn't be a bulletproof defense anyway).  On the other
> hand, accidentally trying to start a duplicate postmaster on the same
> machine is an everyday occurrence.

I really do appreciate the time you have put into feedback. I posed this question also to Robert Haas: is there a
differentAPI which you would find acceptable? I chose fcntl because it seemed well-suited for this task, but the
feedbackhas been regarding NFS v<4 concerns. 

Cheers,
M

Re: POSIX shared memory redux

From

Martijn van Oosterhout

Date:

14 April 2011, 12:33:26

On Thu, Apr 14, 2011 at 10:26:33AM -0400, A.M. wrote:
> 1) the SysV nattch method's primary purpose is to protect the shmem
> region. This is no longer necessary in my patch because the shared
> memory in unlinked immediately after creation, so only the initial
> postmaster and its children have access.

Umm, you don't unlink SysV shared memory. All the flag does is make
sure it goes away when the last user goes away. In the mean time people
can still connect to it.

> The lock file contents are currently important to get the pid of a
> potential, conflicting postmaster. With the fcntl API, we can return
> a live conflicting PID (whether a postmaster or a stuck child), so
> that's an improvement. This could be used, for example, for STONITH,
> to reliably kill a dying replication clone- just loop on the pids
> returned from the lock.

SysV shared memory also gives you a PID, that's the point.
>
> Even if the fcntl check passes, the pid in the lock file is checked, so the lock file behavior remains the same.

The interlock is to make sure there are no living postmaster children.
The lockfile won't tell you that. So the issue is that while fcntl can
work, sysv can do better.

Also, I think you underestimate the value of the current interlock.
Before this people did manage to trash their databases regularly this
way. Lockfiles can be deleted and yes, people do it all the time.

Actually, it occurs to me you can solve NFS problem by putting the
lockfile in the socket dir. That can't possibly be on NFS.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patriotism is when love of your own people comes first; nationalism,
> when hate for people other than your own comes first.
>                                       - Charles de Gaulle

Re: POSIX shared memory redux

From

Robert Haas

Date:

14 April 2011, 15:23:26

On Thu, Apr 14, 2011 at 7:26 AM, A.M. <agentm@themactionfaction.com> wrote:
> From what I understood, the primary purpose of the SysV check was to protect the shared memory from multiple
stompers.The interlock was a neat side-effect.

Not really - the purpose of the interlock is to protect the underlying
data files. The nattch interlock allows us to be very confident that
there isn't another postmaster running on the same data directory on
the same machine, and that is extremely important.

You've just about convinced me that it might not be a bad idea to add
the fcntl() interlock in addition because, as you say, that has a
chance of working even over NFS. But the interlock we have now is
*extremely* reliable, and I think we'd need to get some other
amazingly compelling benefit to consider changing it (even if we were
convinced that the alternate method was also reliable). I don't see
that there is one. Anyone running an existing version of PostgreSQL
in an environment where they care *at all* about performance has
already adjusted their SysV shm settings way up. The benefit of using
POSIX shm is that in, say, PostgreSQL 9.2, it might be possible for
shared buffers to have a somewhat higher default setting out of the
box, and be further increased from there without kernel parameter
changes. And there might be more benefits besides that, but certainly
those by themselves seem pretty worthwhile. SysV shm is extremely
portable, so I don't think that we're losing anything by continuing to
allocate a small amount of it (a few kilobytes, perhaps) and just push
everything else out into POSIX shm.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company