Re: Bug: Buffer cache is not scan resistant - Mailing list pgsql-hackers
From | Sherry Moore |
---|---|
Subject | Re: Bug: Buffer cache is not scan resistant |
Date | |
Msg-id | 20070306053419.GC240523@sun.com Whole thread Raw |
In response to | Re: Bug: Buffer cache is not scan resistant (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Bug: Buffer cache is not scan resistant
|
List | pgsql-hackers |
Hi Tom, Sorry about the delay. I have been away from computers all day. In the current Solaris release in development (Code name Nevada, available for download at http://opensolaris.org), I have implemented non-temporal access (NTA) which bypasses L2 for most writes, and reads larger than copyout_max_cached (patchable, default to 128K). The block size used by Postgres is 8KB. If I patch copyout_max_cached to 4KB to trigger NTA for reads, the access time with 16KB buffer or 128MB buffer are very close. I wrote readtest to simulate the access pattern of VACUUM (attached). tread is a 4-socket dual-core Opteron box. <81 tread >./readtest -h Usage: readtest [-v] [-N] -s <size> -n iter [-d delta] [-c count] -v: Verbose mode -N: Normalize results by number of reads -s <size>: Working set size (may specify K,M,G suffix) -n iter: Number of test iterations -f filename: Name of the file to read from -d [+|-]delta: Distance between subsequent reads -c count: Number of reads -h: Print this help With copyout_max_cached at 128K (in nanoseconds, NTA not triggered): <82 tread >./readtest -s 16k -f boot_archive 46445262 <83 tread >./readtest -s 128M -f boot_archive 118294230 <84 tread >./readtest -s 16k -f boot_archive -n 100 4230210856 <85 tread >./readtest -s 128M -f boot_archive -n 100 6343619546 With copyout_max_cached at 4K (in nanoseconds, NTA triggered): <89 tread >./readtest -s 16k -f boot_archive 43606882 <90 tread >./readtest -s 128M -f boot_archive 100547909 <91 tread >./readtest -s 16k -f boot_archive -n 100 4251823995 <92 tread >./readtest -s 128M -f boot_archive -n 100 4205491984 When the iteration is 1 (the default), the timing difference between using 16k buffer and 128M buffer is much bigger for both copyout_max_cached sizes, mostly due to the cost of TLB misses. When the iteration count is bigger, most of the page tables would be in Page Descriptor Cache for the later page accesses so the overhead of TLB misses become smaller. As you can see, when we do bypass L2, the performance with either buffer size is comparable. I am sure your next question is why the 128K limitation for reads. Here are the main reasons: - Based on a lot of the benchmarks and workloads I traced, the target buffer of read operations are typically accessed again shortly after the read, while writes are usually not. Therefore, the default operation mode is to bypass L2 for writes, but not for reads. - The Opteron's L1 cache size is 64K. If reads are larger than 128KB, it would have displacement flushed itself anyway, so for large reads, I will also bypass L2. I am working on dynamically setting copyout_max_cached based on the L1 D-cache size on the system. The above heuristic should have worked well in Luke's test case. However, due to the fact that the reads was done as 16,000 8K reads rather than one 128MB read, the NTA code was not triggered. Since the OS code has to be general enough to handle with most workloads, we have to pick some defaults that might not work best for some specific operations. It is a calculated balance. Thanks, Sherry On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote: > "Luke Lonergan" <LLonergan@greenplum.com> writes: > > Good info - it's the same in Solaris, the routine is uiomove (Sherry > > wrote it). > > Cool. Maybe Sherry can comment on the question whether it's possible > for a large-scale-memcpy to not take a hit on filling a cache line > that wasn't previously in cache? > > I looked a bit at the Linux code that's being used here, but it's all > x86_64 assembler which is something I've never studied :-(. > > regards, tom lane -- Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym
Attachment
pgsql-hackers by date: