Re: [ADMIN]openvz and shared memory trouble - Mailing list pgsql-general
From | Adrian Klaver |
---|---|
Subject | Re: [ADMIN]openvz and shared memory trouble |
Date | |
Msg-id | 53397DE7.2070903@aklaver.com Whole thread Raw |
In response to | Re: [ADMIN]openvz and shared memory trouble (Willy-Bas Loos <willybas@gmail.com>) |
Responses |
Re: [ADMIN]openvz and shared memory trouble
|
List | pgsql-general |
On 03/31/2014 04:12 AM, Willy-Bas Loos wrote: > > On Sat, Mar 29, 2014 at 6:17 PM, Adrian Klaver > <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>> wrote: > > On 03/29/2014 08:19 AM, Willy-Bas Loos wrote: > > The error that shows up is a Bus error. > That's on the replication slave. > Here's the log about it: > 2014-03-29 12:41:33 CET db: ip: us: FATAL: could not receive > data from > WAL stream: server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > > cp: cannot stat > `/data/postgresql/9.1/main/__wal_archive/__00000001000000720000000A': > No > such file or directory > 2014-03-29 12:41:33 CET db: ip: us: LOG: unexpected pageaddr > 71/E9DA0000 in log file 114, segment 10, offset 14286848 > cp: cannot stat > `/data/postgresql/9.1/main/__wal_archive/__00000001000000720000000A': > No > such file or directory > 2014-03-29 12:41:33 CET db: ip: us: LOG: streaming replication > successfully connected to primary > 2014-03-29 12:41:48 CET db: ip: us: LOG: startup process (PID > 17452) > was terminated by signal 7: Bus error > 2014-03-29 12:41:48 CET db: ip: us: LOG: terminating any other > active > server processes > 2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos WARNING: > terminating connection because of crash of another server process > 2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos DETAIL: The > postmaster has commanded this server process to roll back the > current > transaction and exit, because another server process exited > abnormally > and possibly corrupted shared memory. > 2014-03-29 12:41:48 CET db:wbloos ip:[local] us:wbloos HINT: In a > moment you should be able to reconnect to the database and > repeat your > command. > > > Well what I am seeing are WAL log errors. One saying no file is > present, the other pointing at a possible file corruption. > > Those are normal notices, nothing to worry about. Well other then they cause the standby to reconnect to the primary, during which a crash occurs. > > Shared memory problems are offered as a possible cause only. Right > now I would say we are seeing only half the picture. The Postgres > logs from the same time period for the primary server, as well as > the system logs for the openvz container would help fill in the > other half of the picture. > > > Here's the log from the primary postgres server: > 2014-03-29 12:41:29 CET db:wbloos ip:[local] us:wbloos NOTICE: ALTER > TABLE will create implicit sequence "test_x_seq" for serial column "test.x" > 2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication > LOG: SSL renegotiation failure > 2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication > LOG: SSL error: unexpected record > 2014-03-29 12:41:33 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication > LOG: could not send data to client: Connection reset by peer > 2014-03-29 12:41:48 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication > LOG: could not receive data from client: Connection reset by peer > 2014-03-29 12:41:48 CET db:[unknown] ip:xxx.xxx.xxx.xxx us:replication > LOG: unexpected EOF on standby connection > > (the SSL renegotiation failure happens all the time, without the crash) > > And here's the syslog form the container: > Mar 29 12:41:01 mycontainer snmpd[8819]: Connection from UDP: > [xxx.xxx.xxx.xxx]:59090->[xxx.xxx.xxx.xxx] > Mar 29 12:42:30 mycontainer snmpd[8819]: Connection from UDP: > [xxx.xxx.xxx.xxx]:35949->[xxx.xxx.xxx.xxx] > > The log on the host doesn't say anything interesting either. > > A cursory look at memory management in openvz shows it is different > from other virtualization software and physical machines. Whether > that is a problem would seem to be dependent on where you are on the > learning curve:) > > That sounds like "there is a solution to the problem, all you have to do > is find out what it is". There doesn't seem to be a variable in the > beancounters or anywhere else that can prevent the bus error from happening. > There's seems to be no separate way of guaranteeing shared memory. > There's no OOM killer active either, nor is host or server running short > of memory. At this point I am not sure it is even obvious what is causing the error, so finding a solution would be a hit or miss affair at best. > > I'm still worried that it's like Tom Lane said in another discussion:"So > basically, you've got a broken kernel here: it claimed to give PG circa > (135MB) of memory, but what's actually there is only about (128MB). I > don't see any connection between those numbers and the shmmax/shmall > settings, either --- so I think this must be some busted implementation > of a VM-level limitation." > (here: > http://www.postgresql.org/message-id/CAK3UJREBcyVBtr8D7vMfU=uDdkjXkrPnGcuy8erYB0tMfKe1LA@mail.gmail.com) > > And it makes me wonder what else may be issues that arise from that. But > especially, what i can do about it. I do not use openvz so I do not have a test bed to try out, but this page seems to be related to your problem: http://openvz.org/Resource_shortage or if you want more detail and a link to what looks to a replacement for beancounters: http://openvz.org/Setting_UBC_parameters > > Cheers, > > WBL > > -- > "Quality comes from focus and clarity of purpose" -- Mark Shuttleworth -- Adrian Klaver adrian.klaver@aklaver.com
pgsql-general by date: