Re: Sequential Scan Read-Ahead - Mailing list pgsql-hackers
From | Curt Sampson |
---|---|
Subject | Re: Sequential Scan Read-Ahead |
Date | |
Msg-id | Pine.NEB.4.43.0204251534590.3111-100000@angelic.cynic.net Whole thread Raw |
In response to | Re: Sequential Scan Read-Ahead (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: Sequential Scan Read-Ahead
Re: Sequential Scan Read-Ahead Re: Sequential Scan Read-Ahead |
List | pgsql-hackers |
On Wed, 24 Apr 2002, Tom Lane wrote: > Curt Sampson <cjs@cynic.net> writes: > > Grabbing bigger chunks is always optimal, AFICT, if they're not > > *too* big and you use the data. A single 64K read takes very little > > longer than a single 8K read. > > Proof? Well, there are various sorts of "proof" for this assertion. What sort do you want? Here's a few samples; if you're looking for something different to satisfy you, let's discuss it. 1. Theoretical proof: two components of the delay in retrieving a block from disk are the disk arm movement and the wait for the right block to rotate under the head. When retrieving, say, eight adjacent blocks, these will be spread across no more than two cylinders (with luck, only one). The worst case access time for a single block is the disk arm movement plus the full rotational wait; this is the same as the worst case for eight blocks if they're all on one cylinder. If they're not on one cylinder, they're still on adjacent cylinders, requiring a very short seek. 2. Proof by others using it: SQL server uses 64K reads when doing table scans, as they say that their research indicates that the major limitation is usually the number of I/O requests, not the I/O capacity of the disk. BSD's explicitly separates the optimum allocation size for storage (1K fragments) and optimum read size (8K blocks) because they found performance to be much better when a larger size block was read. Most file system vendors, too, do read-ahead for this very reason. 3. Proof by testing. I wrote a little ruby program to seek to a random point in the first 2 GB of my raw disk partition and read 1-8 8K blocks of data. (This was done as one I/O request.) (Using the raw disk partition I avoid any filesystem buffering.) Here are typical results: 125 reads of 16x8K blocks: 1.9 sec, 66.04 req/sec. 15.1 ms/req, 0.946 ms/block250 reads of 8x8K blocks: 1.9 sec, 132.3 req/sec.7.56 ms/req, 0.945 ms/block500 reads of 4x8K blocks: 2.5 sec, 199 req/sec. 5.03 ms/req, 1.26 ms/block 1000 reads of 2x8K blocks: 3.8 sec, 261.6 req/sec. 3.82 ms/req, 1.91 ms/block 2000 reads of 1x8K blocks: 6.4 sec, 310.4 req/sec. 3.22 ms/req, 3.22 ms/block The ratios of data retrieval speed per read for groups of adjacent 8K blocks, assuming a single 8K block reads in 1 time unit, are: 1 block 1.00 2 blocks 1.18 4 blocks 1.56 8 blocks 2.34 16 blocks 4.68 At less than 20% more expensive, certainly two-block read requests could be considered to cost "very little more" than one-block read requests. Even four-block read requests are only half-again as expensive. And if you know you're really going to be using the data, read in 8 block chunks and your cost per block (in terms of time) drops to less than a third of the cost of single-block reads. Let me put paid to comments about multiple simultaneous readers making this invalid. Here's a typical result I get with four instances of the program running simultaneously: 125 reads of 16x8K blocks: 4.4 sec, 28.21 req/sec. 35.4 ms/req, 2.22 ms/block 250 reads of 8x8K blocks: 3.9 sec, 64.88 req/sec. 15.4 ms/req, 1.93 ms/block 500 reads of 4x8K blocks: 5.8 sec, 86.52 req/sec. 11.6 ms/req, 2.89 ms/block 1000 reads of 2x8K blocks: 10 sec, 100.2 req/sec. 9.98 ms/req, 4.99 ms/block 2000 reads of 1x8K blocks: 18 sec, 110 req/sec. 9.09 ms/req, 9.09 ms/block Here's the ratio table again, with another column comparing the aggregate number of requests per second for one process and four processes: 1 block 1.00 310 : 440 2 blocks 1.10 262 : 401 4 blocks 1.28 199 : 346 8 blocks 1.69 132 : 260 16 blocks 3.89 66 : 113 Note that, here the relative increase in performance for increasing sizes of reads is even *better* until we get past 64K chunks. The overall throughput is better, of course, because with more requests per second coming in, the disk seek ordering code has more to work with and the average seek time spent seeking vs. reading will be reduced. You know, this is not rocket science; I'm sure there must be papers all over the place about this. If anybody still disagrees that it's a good thing to read chunks up to 64K or so when the blocks are adjacent and you know you'll need the data, I'd like to see some tangible evidence to support that. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org Don't you know, in this new Dark Age, we're alllight. --XTC
pgsql-hackers by date: