Re: performance tuning: shared_buffers, sort_mem; swap - Mailing list pgsql-admin
From | Thomas O'Connell |
---|---|
Subject | Re: performance tuning: shared_buffers, sort_mem; swap |
Date | |
Msg-id | tfo-650CFB.16232613082002@news.hub.org Whole thread Raw |
In response to | Re: performance tuning: shared_buffers, sort_mem; swap (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: performance tuning: shared_buffers, sort_mem; swap
|
List | pgsql-admin |
In article <1101.1029272567@sss.pgh.pa.us>, tgl@sss.pgh.pa.us (Tom Lane) wrote: > Hmm. That's definitely a startup-time error. The only way that code > could be executed later than postmaster startup is if you suffer a > database crash and the postmaster is trying to reinitialize the system > with a fresh shared-memory arena. That would say that this isn't your > primary problem, but a consequence of a crash that'd already occurred. Interesting. Particularly interesting because postgres actually intelligently restarts itself after a crash under duress. We've gotten this error every time, and postgres is always running properly after a minute or two of downtime. I've always thought this message was why it died in the first place, but I guess it's related to a startup failure after the first crash, instead. > I am curious why you'd get "Invalid argument" (EINVAL), as presumably > these are the same arguments that the kernel accepted on the previous > cycle of life. But that's probably not the issue to focus on. Right. I think this is related to your speculation, below. > If it happens to select > a database backend to kill, the postmaster will interpret the backend's > unexpected exit as a crash, and will force a database restart. I guess this is what we're seeing, then. Right before the IPC error, there are usually several of these: "NOTICE: Message from PostgreSQL backend: The Postmaster has informed me that some other backend died abnormally and possibly corrupted shared memory. I have rolled back the current transaction and am going to terminate your database system connection and exit. Please reconnect to the database system and repeat your query." I had always thought this just meant that postmaster children were dying. Does it instead mean that the main backend server is dying repeatedly? i.e., is this the forced database restart you mention above? > Perhaps > when the postmaster tries to reallocate the shmem segment a few > milliseconds later, the kernel still thinks it's under load and rejects > a shmem request that it'd normally have accepted. (That last bit is > just speculation though.) I think this is pretty good speculation, considering that after things settle down a bit, it perks right up. Wow, this is all great stuff to know. > Possible solutions: (a) buy more RAM and/or increase available swap > space (I'm not sure whether more swap, without more physical RAM, > actually helps; anyone know?); (b) reduce peak load by reducing > max_connections and/or scaling back your other servers; (c) switch to > another OS --- I don't think the *BSD kernels have this brain-damaged > idea about how to cope with low memory... Well, our solution for the time being has been to have saner rate-limiting so that the web server is not even able to pound the database as much. In essence, we were experiencing DoS attacks, meaning requests were coming several times a minute from the same IP. We still accept a reasonable number of requests for a public web application server, but we've managed to stop the crashing, for now. Still, all of this is great added knowledge to the quest for better tuning. I was under the mistaken impression that my bad memory math was somehow responsible for postgres being the point of failure during the stress. Lucky me, as a DBA, to learn otherwise! Thanks! -tfo
pgsql-admin by date: