Thread: postgresql-server exiting abnormally after an OS upgrade
Hi, I've been running postgresql-server for some time on an OpenBSD box, and the latest OS upgrade (which also upgraded from 9.3 to 9.4) seems to have broke= n it in some way. I tried downgrading back to 9.3.6, but the issue persists, so = it looks like something else is broken, and not postgresql per se. Of course, I did the proper upgrade procedure, pg_dumpall, move everything = out of the way, and reinit the db. I also tried *not* reimporting the db on my last wipeout, and keeping the completely empty database. It crashed too upon connection attempts (which f= ail to authenticate because I haven't even created the users), so I guess the d= ata is no problem. The database is pretty small, and very few concurrent users = (the *whole* backup is ~400kB). Here's the error which repeats ad infinitum: LOG: database system was shut down at 2015-02-12 14:41:29 GMT LOG: database system is ready to accept connections LOG: autovacuum launcher started LOG: server process (PID 21205) was terminated by signal 6: Abort trap LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server pro= cess DETAIL: The postmaster has commanded this server process to roll back = the current transaction and exit, because another server process exited abn= ormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and = repeat your command. LOG: all server processes terminated; reinitializing LOG: database system was interrupted; last known up at 2015-02-12 14:4= 2:04 GMT LOG: database system was not properly shut down; automatic recovery in= progress LOG: record with zero length at 0/1690A90 LOG: redo is not required LOG: database system is ready to accept connections LOG: autovacuum launcher started I've searchd all over, and found little or nothing of help. Nothing seemed = to change the situation. I believe that these settings *are* relevant, so: kern.seminfo.semmni=3D256 kern.seminfo.semmns=3D512 kern.seminfo.semmnu=3D256 kern.seminfo.semmap=3D256 kern.shminfo.shmmax=3D134217728 kern.shminfo.shmseg=3D256 I also tried setting shared_buffers =3D 16MB, which, while small, discards several potential configuration issues. Note that before upgrading to the latest OS version, pg worked fine. Also relevant, other users of the same release have confirmed not having this is= sue and pg working fine for them. Any pointer? Thanks, --=20 Hugo Osvaldo Barrera A: Because we read from top to bottom, left to right. Q: Why should I start my reply below the quoted text?
On 02/13/2015 12:42 AM, Hugo Osvaldo Barrera wrote: > Here's the error which repeats ad infinitum: > > LOG: database system was shut down at 2015-02-12 14:41:29 GMT > LOG: database system is ready to accept connections > LOG: autovacuum launcher started > LOG: server process (PID 21205) was terminated by signal 6: Abort trap > LOG: terminating any other active server processes > WARNING: terminating connection because of crash of another server process > DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because anotherserver process exited abnormally and possibly corrupted shared memory. > HINT: In a moment you should be able to reconnect to the database and repeat your command. Can you get a core dump and a stack trace from it? - Heikki
On 2015-02-13 15:29, Heikki Linnakangas wrote: > On 02/13/2015 12:42 AM, Hugo Osvaldo Barrera wrote: > >Here's the error which repeats ad infinitum: > > > > LOG: database system was shut down at 2015-02-12 14:41:29 GMT > > LOG: database system is ready to accept connections > > LOG: autovacuum launcher started > > LOG: server process (PID 21205) was terminated by signal 6: Abort = trap > > LOG: terminating any other active server processes > > WARNING: terminating connection because of crash of another server= process > > DETAIL: The postmaster has commanded this server process to roll b= ack the current transaction and exit, because another server process exited= abnormally and possibly corrupted shared memory. > > HINT: In a moment you should be able to reconnect to the database = and repeat your command. >=20 > Can you get a core dump and a stack trace from it? >=20 > - Heikki >=20 >=20 >=20 > --=20 > Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-bugs I'm slightly unsure how to do this - the docs give some references on how to select a single process and attempt to attach gdb to it, but they seem to be too short lived for me to do anything (<1s between errors), so I can't real= ly work with that. Do you have any suggestions for this? I'll try to enable core dumps in the meantime and get back to you on this, thank, --=20 Hugo Osvaldo Barrera A: Because we read from top to bottom, left to right. Q: Why should I start my reply below the quoted text?
On 2015-02-13 15:29, Heikki Linnakangas wrote: > On 02/13/2015 12:42 AM, Hugo Osvaldo Barrera wrote: > >Here's the error which repeats ad infinitum: > > > > LOG: database system was shut down at 2015-02-12 14:41:29 GMT > > LOG: database system is ready to accept connections > > LOG: autovacuum launcher started > > LOG: server process (PID 21205) was terminated by signal 6: Abort = trap > > LOG: terminating any other active server processes > > WARNING: terminating connection because of crash of another server= process > > DETAIL: The postmaster has commanded this server process to roll b= ack the current transaction and exit, because another server process exited= abnormally and possibly corrupted shared memory. > > HINT: In a moment you should be able to reconnect to the database = and repeat your command. >=20 > Can you get a core dump and a stack trace from it? >=20 > - Heikki >=20 It certainly took me a bit, but, with some help, I managed to get one: (gdb) bt #0 0x0000110a2815b92a in kill () at <stdin>:2 #1 0x0000110a28195119 in abort () at /usr/src/lib/libc/stdlib/abort.c:53 #2 0x0000110a2816a238 in memcpy (dst0=3D0xfb8d4, src0=3D0x6, length=3D0)= at /usr/src/lib/libc/string/memcpy.c:65 #3 0x000011080cf8d1b1 in check_ip (raddr=3D0x110a899f7918, addr=3D0x110a= 899f9058, mask=3D0x110a899f9158) at hba.c:704 #4 0x000011080cf90a04 in check_hba (port=3D0x110a899f7800) at hba.c:1718 #5 0x000011080cf91d34 in hba_getauthmethod (port=3D0x110a899f7800) at hb= a.c:2256 #6 0x000011080cf88eb3 in ClientAuthentication (port=3D0x110a899f7800) at= auth.c:307 #7 0x000011080d1edf5d in PerformAuthentication (port=3D0x110a899f7800) a= t postinit.c:223 #8 0x000011080d1eeae7 in InitPostgres (in_dbname=3D0x110af4508c00 "virts= tart-dev", dboid=3D0,=20 username=3D0x110af4508be0 "virtstart-dev", out_dbname=3D0x0) at posti= nit.c:688 #9 0x000011080d0a3eb1 in PostgresMain (argc=3D1, argv=3D0x110af4508c20, = dbname=3D0x110af4508c00 "virtstart-dev",=20 username=3D0x110af4508be0 "virtstart-dev") at postgres.c:3749 #10 0x000011080d033537 in BackendRun (port=3DCould not find the frame bas= e for "BackendRun". ) at postmaster.c:4155 #11 0x000011080d032be8 in BackendStartup (port=3D0x110a899f7800) at postm= aster.c:3829 #12 0x000011080d02f2d0 in ServerLoop () at postmaster.c:1597 #13 0x000011080d02e968 in PostmasterMain (argc=3D3, argv=3D0x7f7ffffd9658= ) at postmaster.c:1244 #14 0x000011080cf96dc8 in main (argc=3DCould not find the frame base for = "main". ) at main.c:228 Current language: auto; currently asm I stopped pg once it crashed and then bt'd the core. Does this help at all?= Is there something else I can provide? Thanks, --=20 Hugo Osvaldo Barrera A: Because we read from top to bottom, left to right. Q: Why should I start my reply below the quoted text?
Hugo Osvaldo Barrera <hugo@barrera.io> writes: > On 2015-02-13 15:29, Heikki Linnakangas wrote: >> Can you get a core dump and a stack trace from it? > It certainly took me a bit, but, with some help, I managed to get one: > (gdb) bt > #0 0x0000110a2815b92a in kill () at <stdin>:2 > #1 0x0000110a28195119 in abort () at /usr/src/lib/libc/stdlib/abort.c:53 > #2 0x0000110a2816a238 in memcpy (dst0=0xfb8d4, src0=0x6, length=0) at /usr/src/lib/libc/string/memcpy.c:65 > #3 0x000011080cf8d1b1 in check_ip (raddr=0x110a899f7918, addr=0x110a899f9058, mask=0x110a899f9158) at hba.c:704 > #4 0x000011080cf90a04 in check_hba (port=0x110a899f7800) at hba.c:1718 Huh. You didn't say whether this is 9.3 or 9.4, but it doesn't matter because that code is the same in both; the crash is happening here: static bool check_ip(SockAddr *raddr, struct sockaddr * addr, struct sockaddr * mask) { ... /* * If we're connected on IPv6 but the file specifies an IPv4 address * to match against, promote the latter to an IPv6 address before * trying to match the client's address. */ struct sockaddr_storage addrcopy, maskcopy; memcpy(&addrcopy, &addr, sizeof(addrcopy)); ---> memcpy(&maskcopy, &mask, sizeof(maskcopy)); Looking at this, the code seems wrong on its face: the memcpy sources should be "addr" and "mask", not "&addr" and "&mask". Apparently, no one has ever used or tested this case since it was put in :-(. I find that a bit astonishing given the steady creep of IPv6 support, but it's hard to see how the code could possibly work correctly as-is. Probably the reason why this was triggered by an OS upgrade is that your system is now choosing to make the DB connection over IPv6 not IPv4. Or maybe it was doing that right along but you now have a memcpy that is doing more checking of its arguments than before (this is evidently an argument sanity check and not a straight SIGSEGV). Anyway you could work around it for now by being sure to list IPv6 equivalent addresses before IPv4 addresses in your pg_hba.conf. We'll fix it properly in the next releases. regards, tom lane
On 2015-02-16 15:55, Tom Lane wrote: > Hugo Osvaldo Barrera <hugo@barrera.io> writes: > > On 2015-02-13 15:29, Heikki Linnakangas wrote: > >> Can you get a core dump and a stack trace from it? >=20 > > It certainly took me a bit, but, with some help, I managed to get one: >=20 > > (gdb) bt > > #0 0x0000110a2815b92a in kill () at <stdin>:2 > > #1 0x0000110a28195119 in abort () at /usr/src/lib/libc/stdlib/abort.= c:53 > > #2 0x0000110a2816a238 in memcpy (dst0=3D0xfb8d4, src0=3D0x6, length= =3D0) at /usr/src/lib/libc/string/memcpy.c:65 > > #3 0x000011080cf8d1b1 in check_ip (raddr=3D0x110a899f7918, addr=3D0x= 110a899f9058, mask=3D0x110a899f9158) at hba.c:704 > > #4 0x000011080cf90a04 in check_hba (port=3D0x110a899f7800) at hba.c:= 1718 >=20 > Huh. You didn't say whether this is 9.3 or 9.4, but it doesn't matter > because that code is the same in both; the crash is happening here: >=20 It's 9.4, the issue appeared after an OS upgrade which was accompanied by an upgrade to 9.3 I *did* try to downgrade back to 9.3 at that point, but got = the same issue. > static bool > check_ip(SockAddr *raddr, struct sockaddr * addr, struct sockaddr * mask) > { > ... > /* > * If we're connected on IPv6 but the file specifies an IPv4 addr= ess > * to match against, promote the latter to an IPv6 address before > * trying to match the client's address. > */ > struct sockaddr_storage addrcopy, > maskcopy; >=20 > memcpy(&addrcopy, &addr, sizeof(addrcopy)); > ---> memcpy(&maskcopy, &mask, sizeof(maskcopy)); >=20 > Looking at this, the code seems wrong on its face: the memcpy sources > should be "addr" and "mask", not "&addr" and "&mask". Apparently, no one > has ever used or tested this case since it was put in :-(. I find that > a bit astonishing given the steady creep of IPv6 support, but it's hard > to see how the code could possibly work correctly as-is. >=20 > Probably the reason why this was triggered by an OS upgrade is that your > system is now choosing to make the DB connection over IPv6 not IPv4. > Or maybe it was doing that right along but you now have a memcpy that is > doing more checking of its arguments than before (this is evidently an > argument sanity check and not a straight SIGSEGV). >=20 My host has been configured to use IPv6 by default for some time now, but it *is* possible that there has been some OS-level change during this upgrade = that affected this in some way. > Anyway you could work around it for now by being sure to list IPv6 > equivalent addresses before IPv4 addresses in your pg_hba.conf. > We'll fix it properly in the next releases. >=20 Do you mean IPv6 *before* IPv4? Wouldn't that add a preference to IPv6? I currently have the stock out-of-the-box pg_hba: # TYPE DATABASE USER ADDRESS METHOD # "local" is for Unix domain socket connections only local all all md5 # IPv4 local connections: host all all 127.0.0.1/32 md5 # IPv6 local connections: host all all ::1/128 md5 (I deleted commented out lines for readability). > regards, tom lane >=20 Thanks, --=20 Hugo Osvaldo Barrera A: Because we read from top to bottom, left to right. Q: Why should I start my reply below the quoted text?
Hugo Osvaldo Barrera <hugo@barrera.io> writes: > On 2015-02-16 15:55, Tom Lane wrote: >> Probably the reason why this was triggered by an OS upgrade is that your >> system is now choosing to make the DB connection over IPv6 not IPv4. >> Or maybe it was doing that right along but you now have a memcpy that is >> doing more checking of its arguments than before (this is evidently an >> argument sanity check and not a straight SIGSEGV). > My host has been configured to use IPv6 by default for some time now, but it > *is* possible that there has been some OS-level change during this upgrade that > affected this in some way. On further thought the answer almost certainly is that you now have a libc version in which memcpy() complains strenuously (ie, aborts) if given overlapping source and target areas. Such "helpfulness" is the new fad among libc maintainers :-( >> Anyway you could work around it for now by being sure to list IPv6 >> equivalent addresses before IPv4 addresses in your pg_hba.conf. >> We'll fix it properly in the next releases. > Do you mean IPv6 *before* IPv4? Wouldn't that add a preference to IPv6? No, not particularly, considering the cases are disjoint. An incoming connection is one or the other. > I currently have the stock out-of-the-box pg_hba: > # TYPE DATABASE USER ADDRESS METHOD > # "local" is for Unix domain socket connections only > local all all md5 > # IPv4 local connections: > host all all 127.0.0.1/32 md5 > # IPv6 local connections: > host all all ::1/128 md5 The problem comes when an incoming IPv6 connection is compared to that 127.0.0.1 line. If you're using only IPv6 anyway, you could just remove that line. Or you could do something like this: host all all ::1/128 md5 host all all ::0/0 reject host all all 127.0.0.1/32 md5 so that all IPv6 connections are either accepted or rejected before reaching the 127.0.0.1 line. regards, tom lane
Fwd: Re: postgresql-server exiting abnormally after upgrade to -snapshot
From
Hugo Osvaldo Barrera
Date:
Hi, This comes as a follow-up on a recent issue I mentioned and was discussed on this list. (See thread [1]). The included patch fixes the issue and postgres has been working fine for me with it. See this thread[2] for details. Cheers, ----- Forwarded message from J=E9r=E9mie Courr=E8ges-Anglas <jca@wxcvbn.org= > ----- > Date: Mon, 16 Feb 2015 23:21:05 +0100 > From: J=E9r=E9mie Courr=E8ges-Anglas <jca@wxcvbn.org> > Subject: Re: postgresql-server exiting abnormally after upgrade to -snaps= hot > To: Hugo Osvaldo Barrera <hugo@barrera.io> > Cc: misc@openbsd.org > Mail-Followup-To: Hugo Osvaldo Barrera <hugo@barrera.io>, misc@openbsd.org >=20 > jca@wxcvbn.org (J=E9r=E9mie Courr=E8ges-Anglas) writes: >=20 > > Please try the diff below. It fixes the "backwards memcpy" problem > > easily noticeable with psql -h ::1. >=20 > Updated diff. Thanks to Stuart for reminding me that netmasks sa_len > values can be much surprising. >=20 > $OpenBSD$ > --- src/backend/libpq/hba.c.orig Mon Feb 16 21:53:21 2015 > +++ src/backend/libpq/hba.c Mon Feb 16 23:08:38 2015 > @@ -700,8 +700,13 @@ check_ip(SockAddr *raddr, struct sockaddr * addr, st= ru > struct sockaddr_storage addrcopy, > maskcopy; > =20 > - memcpy(&addrcopy, &addr, sizeof(addrcopy)); > - memcpy(&maskcopy, &mask, sizeof(maskcopy)); > + memcpy(&addrcopy, addr, sizeof(struct sockaddr_in)); > + /* > + * On some OSes, if mask is obtained from eg. getifaddrs(3), sa_len > + * can vary wildly. We already know that addr->sa_family =3D=3D AF_INE= T, > + * so just use sizeof(struct sockaddr_in). > + */ > + memcpy(&maskcopy, mask, sizeof(struct sockaddr_in)); > pg_promote_v4_to_v6_addr(&addrcopy); > pg_promote_v4_to_v6_mask(&maskcopy); > =20 >=20 >=20 > --=20 > jca | PGP : 0x1524E7EE / 5135 92C1 AD36 5293 2BDF DDCC 0DFA 74AE 1524 E7= EE >=20 ----- End forwarded message ----- [1]: http://www.postgresql.org/message-id/20150216202138.GB20474@athena.bar= rera.io [2]: https://marc.info/?t=3D142367714200001&r=3D1&w=3D2 --=20 Hugo Osvaldo Barrera A: Because we read from top to bottom, left to right. Q: Why should I start my reply below the quoted text?
Hugo Osvaldo Barrera <hugo@barrera.io> writes: > The included patch fixes the issue and postgres has been working fine for me > with it. See this thread[2] for details. Hm, I was wondering whether copying sizeof(struct sockaddr_storage) might not be a risky choice. However, the conclusion on the -hackers thread http://www.postgresql.org/message-id/flat/23855.1424136475@sss.pgh.pa.us seems to be that we're going to eliminate the entire issue by just ripping out this code section, lock stock n barrel. Thanks for confirming that this fixes it for you, though --- at least we now have a certain diagnosis of the problem. regards, tom lane