SOLVED: unexpected EIDRM on Linux - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | SOLVED: unexpected EIDRM on Linux |
Date | |
Msg-id | 20773.1183400658@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: SOLVED: unexpected EIDRM on Linux
|
List | pgsql-hackers |
It's a plain old Linux kernel bug: it returns EIDRM when it really ought to say EINVAL, and apparently always has. The surprising part is really that we've not seen it many times before. Kudos to Michael Fuhr for thinking to write a test program investigating whether randomly-chosen IDs would yield EIDRM --- that was what led me to study the kernel source code closely enough to realize it was just wrong. regards, tom lane ------- Forwarded Messages Date: Mon, 2 Jul 2007 10:59:43 -0600 From: Michael Fuhr <mike@fuhr.org> To: Tom Lane <tgl@sss.pgh.pa.us> Subject: Re: [GENERAL] shmctl EIDRM preventing startup I don't know if this is relevant but on both the box that rebooted and on another box that's been up for several weeks I see a pattern of shmid's for which shmctl() returns EIDRM (the EACCES errors are for segments that are in use by another user; I'm not running as root): $ ./shmctl-test 0 1048576 shmctl(0 / 0): ERROR: Identifier removed shmctl(1 / 0x1): ERROR: Identifier removed shmctl(2 / 0x2): ERROR: Identifier removed shmctl(32768 / 0x8000): ERROR: Identifier removed shmctl(32769 / 0x8001): ERROR: Identifier removed shmctl(32770 / 0x8002): ERROR: Identifier removed shmctl(65536 / 0x10000): ERROR: Permission denied shmctl(65537 / 0x10001): ERROR: Identifier removed shmctl(65538 / 0x10002): ERROR: Identifier removed shmctl(98304 / 0x18000): ERROR: Identifier removed shmctl(98305 / 0x18001): ERROR: Permission denied shmctl(98306 / 0x18002): ERROR: Identifier removed shmctl(131072 / 0x20000): ERROR: Identifier removed shmctl(131073 / 0x20001): ERROR: Identifier removed shmctl(131074 / 0x20002): ERROR: Identifier removed shmctl(163840 / 0x28000): ERROR: Identifier removed shmctl(163841 / 0x28001): ERROR: Identifier removed shmctl(163842 / 0x28002): ERROR: Permission denied [...] shmctl(983040 / 0xf0000): ERROR: Identifier removed shmctl(983041 / 0xf0001): ERROR: Identifier removed shmctl(983042 / 0xf0002): ERROR: Identifier removed shmctl(1015808 / 0xf8000): ERROR: Identifier removed shmctl(1015809 / 0xf8001): ERROR: Identifier removed shmctl(1015810 / 0xf8002): ERROR: Identifier removed shmctl(1048576 / 0x100000): ERROR: Identifier removed -- Michael Fuhr #include <sys/ipc.h> #include <sys/shm.h> #include <errno.h> #include <stdio.h> #include <stdlib.h> #include <string.h> int main(int argc, char *argv[]) { int shmid, min_shmid, max_shmid, tmp_shmid; struct shmid_ds buf; if (argc != 3) { fprintf(stderr, "Usage: %s min_shmid max_shmid\n", argv[0]);return EXIT_FAILURE; } min_shmid = atoi(argv[1]); max_shmid = atoi(argv[2]); if (min_shmid > max_shmid) { tmp_shmid = min_shmid;min_shmid = max_shmid;max_shmid = tmp_shmid; } for (shmid = min_shmid; shmid <= max_shmid; shmid++) { if (shmctl(shmid, IPC_STAT, &buf) == -1 && errno != EINVAL){ printf("shmctl(%d / %#x): ERROR: %s\n", shmid, shmid, strerror(errno)); } } return EXIT_SUCCESS; } ------- Message 2 Date: Mon, 02 Jul 2007 14:17:05 -0400 From: Tom Lane <tgl@sss.pgh.pa.us> To: Michael Fuhr <mike@fuhr.org> Subject: Re: [GENERAL] shmctl EIDRM preventing startup Michael Fuhr <mike@fuhr.org> writes: > On Mon, Jul 02, 2007 at 01:14:01PM -0400, Tom Lane wrote: >> Oh, that's pretty durn interesting. I get the same type of pattern on >> my FC6 box, but not on HPUX. > I don't get this pattern on FreeBSD 6.2 or Solaris 9 either. Well, I've just traced through the Linux code, and I find: 1. The low-order 15 bits of the shmid are simply an index into an array of valid shmem entries. I'm not sure what is in index 0, but there's apparently a live entry of some sort there. Index 1 is the first actual shmem segment allocated, and thereafter the first free slot is chosen whenever you make a new shmem segment. 2. When you try to stat a segment, it takes the low-order 15 bits of the supplied ID and indexes into this array. If no such entry (out of range, or NULL entry) you get EINVAL as expected. If there's an entry but its high-order ID bits don't match the supplied ID, you get EIDRM. This is why the set of EIDRM IDs moves around as you create and delete valid segments. As near as I can tell, this is flat out a case of the kernel returning the wrong error code. It should say EINVAL when there's a mismatch. It's a bit surprising that we have not seen a lot more reports of this problem, because AFAICS the probability of a collision is extremely high if there's more than one creator of shmem segments on a system. I can reproduce the bug as follows: 1. Start postmaster 1. 2. Start postmaster 2 (different data directory and port). 3. Manually kill -9 both postmasters. 4. Manually ipcrm both shmem segments. 5. Start postmaster 2. 6. (Try to) start postmaster 1 --- it will fail because of EIDRM, because its saved shmem id points at slot 1 which is nowin use by postmaster 2. I'm going to generate a smaller test program showing this and file a bug report at Red Hat. In the mean time, it looks like we should assume EIDRM means EINVAL on Linux, because AFAICS there is not actually anyplace in that code that should return EIDRM; their data structure doesn't really have any state that would justify returning such a code. regards, tom lane ------- End of Forwarded Messages
pgsql-hackers by date: