Re: Need help with 8.4 Performance Testing - Mailing list pgsql-performance
From | Scott Carey |
---|---|
Subject | Re: Need help with 8.4 Performance Testing |
Date | |
Msg-id | BDFBB77C9E07BE4A984DAAE981D19F961ACA17D9FC@EXVMBX018-1.exch018.msoutlookonline.net Whole thread Raw |
In response to | Re: Need help with 8.4 Performance Testing (Jean-David Beyer <jeandavid8@verizon.net>) |
Responses |
Re: Need help with 8.4 Performance Testing
Re: Need help with 8.4 Performance Testing Re: Need help with 8.4 Performance Testing Re: Need help with 8.4 Performance Testing |
List | pgsql-performance |
> ________________________________________ > From: pgsql-performance-owner@postgresql.org [pgsql-performance-owner@postgresql.org] On Behalf Of > Jean-David Beyer [jeandavid8@verizon.net] > Sent: Tuesday, December 09, 2008 5:08 AM > To: pgsql-performance@postgresql.org > Subject: Re: [PERFORM] Need help with 8.4 Performance Testing > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > But one thing that can matter is whether you want to improve just the > performance of the dbms, or the performance of the entire system, on which > the dbms runs. Because if you want to improve the whole system, you would > want as much of the caching to take place in the system's buffers so the use > of the memory could be optimized over the entire workload, not just the load > of the dbms itself. I suppose on a dedicated system with only one dbms > running with only one database open (at a time, anyway), this might be moot, > but not otherwise. Yes, the OS is in better position to arbitrate between multiple things. Of course, we aren't talking about the highest performancedatabases if we are talking about mixed use systems though. Additionally, the OS can never really get it right, with a DB or other apps. Any app can behave badly and grab too muchRAM and access it regularly enough for it to not be 'idle' much but give the OS VM fits trying to figure out if its importantor not versus other processes. > Now I agree that it would be good to get the entire index (or at least the > working set of the index) into the memory of the computer. But does it > really matter if it is in the system's cache, or the postgres cache? Is it > any more likely to be in postgres's cache than in the system cache if the > system is hurting for memory? I would think the system would be equally > likely to page out "idle" pages no matter where they are unless they are > locked to memory, and I do not know if all operating systems can do this, > and even if they can, I do not know if postgres uses that ability. I doubt > it, since I believe (at least in Linux) a process can do that only if run as > root, which I imagine few (if any) users do. The problem, is when none of them are really 'idle'. When the OS has to decide which pages, all of which have been accessedrecently, to evict. Most OS's will make bad choices if the load is mixed random and sequential access, as they treatall pages equally with respect to freshness versus eviction. Another problem is that there IS a difference between being in postgres' cache and the OS cache. One is more expensive toretrieve than the other. Significantly. Aaccessing buffers in shared_buffers, in process, uses a good chunk less CPU (and data copy and shared buffer eviction overhead)than going over the sys call to the OS. And as far as I can tell, even after the 8.4 fadvise patch, all I/O is in block_size chunks. (hopefully I am wrong) My system is now CPU bound, the I/O can do sequential reads of more than 1.2GB/sec but Postgres can't do a seqscan 30% asfast because it eats up CPU like crazy just reading and identifying tuples. It does seqscans ~ 25% faster if its fromshared_buffers than from the OS's page cache though. Seqscans are between 250MB/sec and 400MB/sec peak, from mem ordisk, typically showing no more than 35% iostat utilization of the array if off disk -- so we run a few concurrently wherewe can. In addition to the fadvise patch, postgres needs to merge adjacent I/O's into larger ones to reduce the overhead. It onlyreally needs to merge up to sizes of about 128k or 256k, and gain a 8x to 16x drop in syscall overhead, and additionallypotentially save code trips down the shared buffer management code paths. At lest, thats my guess I haven'tlooked at any code and could be wrong. Additionally, the "If your operating system has any reasonable caching itself" comment earlier in this conversation --- Linux (2.6.18, Centos 5.2) does NOT. I can easily make it spend 100% CPU in system time trying to figure out what to dowith the system cache for an hour. Just do large seqscans with memory pressure from work_mem or other forces that theOS will not deem 'idle'. Once the requested memory is ~75% of the system total, it will freak out. Linux simply willnot give up that last 25% or so of the RAM for anything but page cache, even though the disk subsustem is very fast andmost of the access is sequential, marginalizing the benefit of the cache. Depending on how you tune it, it will eitherspin system cpu or swap storm, but the system cpu spin times for the same work load are a lot shorter than an equivalentswap storm. Mount the data drive in O_DIRECT and the problem vanishes. I've been told that this problem may be gone in some of the latestkernels. I have seriously considered bumping shared_buffers up a lot and mounting the thing direct -- but then welose the useful scheduler and readahead algorithms. The other way around (small shared_buffers, let the OS do it) hurtsperformance overall quite a bit -- randomly accessed pages get pushed out to the OS cache more often, and the OS tossesthouse out when a big seqscan occurs, resulting in a lot more random access from disk and more disk bound periods oftime. Great wonder, this operating system caching, eh? In any event, don't hold up these OS page cache things as if they're the best thing in the world for a database, they haveserious flaws themselves and typically are difficult or impossible to tune to be ideal for a database. Its one thing to propose that a database build its own file system (hard, and why bother?) versus have a database manageits own page cache intelligently and access the OS file system as optimally as it can. In both of the latter, theDB knows much more about what data is really important than the OS (and could for example, prioritize cache versus work_memintelligently while the OS can get that one horribly wrong in my experience, and knows when a huge seqscan occursto make caching those results low priority). No matter how you do it using the OS cache, you cache twice and copytwice. O_DIRECT isn't usually an option for other reasons, the OS disk scheduler, readahead, and other benefits of afile system are real and substantial. If you are caching twice, you might as well have the "closer" copy of that data bethe larger, more efficient pool. As for tipping points and pg_bench -- It doesn't seem to reflect the kind of workload we use postgres for at all, thoughmy workload does a lot of big hashes and seqscans, and I'm curious how much improved those may be due to the hash improvements. 32GB RAM and 3TB data (about 250GB scanned regularly) here. And yes, we are almost completely CPU bound nowexcept for a few tasks. Iostat only reports above 65% disk utilization for about 5% of the workload duty-cycle, and isregularly < 20%. COPY doesn't get anywhere near platter speeds, on indexless bulk transfer. The highest disk usage spikesoccur when some of our radom-access data/indexes get shoved out of cache. These aren't too large, but high enoughseqscan load will cause postgres and the OS to dump them from cache. If we put these on some SSD's the disk utilization% would drop a lot further. I feel confident in saying that in about a year, I could spec out a medium sized budget for hardware ($25k) for almost anypostgres setup and make it almost pure CPU bound. SSDs and hybrid tech such as ZFS L2ARC make this possible with easy access to 10k+ iops, and it it will take no more than12 SATA drives in raid 10 next year (and a good controller or software raid) to get 1GB/sec sequential reads.
pgsql-performance by date: