Update on my 6.4.2 progress - Mailing list pgsql-hackers
From | Wayne Piekarski |
---|---|
Subject | Update on my 6.4.2 progress |
Date | |
Msg-id | 199906190739.RAA00864@helpdesk.senet.com.au Whole thread Raw |
Responses |
Re: [HACKERS] Update on my 6.4.2 progress
|
List | pgsql-hackers |
Hi, Just thought I'd drop an email again - I was the one having neverending trouble with 6.4.2 jamming with backends waiting, and other types of problems. Although I'm still using 6.4.2, I hope this will still be helpful for the developers in case it impacts on things in 6.5. We installed Tom Lanes shared memory patches, which I emailed about earlier, and they helped a bit, but unfortunately, we still get backends stuck waiting even today.... The interesting thing is, we went and put in another 128 mb of ram (from 256 to 384 now) and recompiled the kernel with more semaphores and shared memory, and the improvement was incredible! Before, we would get semget failures every so often when we had about 50 backends going, causing the whole thing to fall over, but now we get "fmgr_info: function 111257088: cache lookup failed" after 64 backends (which is what we compiled postgres for) which I assume isn't so fatal and the whole system keeps running. For three days after our little upgrade, the whole thing ran smoothly, then we ran into the problem of the stuck waiting backends. We thought the problem was gone but it was still there. So what would happen is a backend would get stuck, cause others to get stuck, and the postgres' would just build up until it hit 64, then we'd have to kill them off and would be ok again. At least now the number of problems have decreased slightly. One interesting message we got during this problem was: NOTICE: LockRelease: locktable lookup failed, no lock It seems as though the backends are waiting for a lock that got deleted accidentally, although I have no idea how the code works so can't offer any advice where. Lately though, the problems are happening with higher frequency, and every so often we still get the BTP_CHAIN problems with tables (which I sent another email about fixing) so I need to fix this. One thing I was disappointed with was after adding an extra 128 mb of ram, I was hoping that this would be used for disk caching, but when performing repeated select queries on tables, where I did something like: select sum(some_value) from some_table; The result took the same amount of time to run each time, and was not cached at all (the table was about 100 mb) and when doing the query, our raid controller would just light up which I wanted to avoid. After seeing this, I read posts on the hackers list where people were talking about fsync'ing the pg_log to note down whether things had been commited or not. The table I was testing was totally read only, no modifications being made, however, another table gets almost continuous changes 24 hours per day, more than 1 per second, so would this be causing the machine to continuously flush pg_log to disk and cause my read-only tables to still not be cached? I guess my next question is, can i comment out the fsync call? <grin> With the disks performing more efficient updates, the whole thing would run faster and run less risks of crashing. Currently, the performance can be quite bad sometimes when the machine is doing lots of disk activity, because even the simplest read only queries block because they aren't cached. Would moving pg_log to a 2nd disk make a difference? Are there other important files like pg_log which should go onto separate disks as well? I have no problem with multiple disks, but it was only recently that I discovered this fsyncing thing on pg_log. Is pg_log more speed and fsync critical than the actual data itself? I have two raid controllers, a slow and a fast one, and I want to move pg_log to one of them, but not sure which one. So in summary, I've learned that if you are having troubles, put in more memory, (even if you have some free) and increase your kernels internal sizes for semaphores and shared memory values to really large values, even when postgres isn't complaining. It makes a difference for some reason and everything was a lot happier. BTP_CHAIN and the backends waiting problem are still occuring, although I cannot build a test case for either of them, they are very much problems which occur accidentally and at random times. thanks again, Wayne ------------------------------------------------------------------------------ Wayne Piekarski Tel: (08) 8221 5221 Research & Development Manager Fax: (08) 8221 5220 SE Network Access Pty Ltd Mob: 0407 395 889 222 Grote Street Email: wayne@senet.com.au Adelaide SA 5000 WWW: http://www.senet.com.au
pgsql-hackers by date: