Re: Enabling Checksums - Mailing list pgsql-hackers
From | Ants Aasma |
---|---|
Subject | Re: Enabling Checksums |
Date | |
Msg-id | CA+CSw_su1fopLNBz1NAfkSNw4_=gv+5pf0KdLQmpvuKW1Q4v+Q@mail.gmail.com Whole thread Raw |
In response to | Re: Enabling Checksums (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Enabling Checksums
Re: Enabling Checksums Re: Enabling Checksums |
List | pgsql-hackers |
On Fri, Mar 22, 2013 at 3:04 AM, Jeff Davis <pgsql@j-davis.com> wrote: > I've been following your analysis and testing, and it looks like there > are still at least three viable approaches: > > 1. Some variant of Fletcher > 2. Some variant of CRC32 > 3. Some SIMD-based checksum > > Each of those has some open implementation questions, as well. If we > settle on one of those approaches, we don't necessarily need the fastest > implementation right away. I might even argue that the first patch to be > committed should be a simple implementation of whatever algorithm we > choose, and then optimization should be done in a separate patch (if it > is tricky to get right). +1 on correct first, fast second. > Of course, it's hard to settle on the general algorithm to use without > knowing the final performance numbers. So right now I'm in somewhat of a > holding pattern until we settle on something. For performance the K8 results gave me confidence that we have a reasonably good overview what the performance is like for the class of CPU's that PostgreSQL is likely to run on. I don't think there is anything left to optimize there, all algorithms are pretty close to maximum theoretical performance. Still, benchmarks on AMD's Bulldozer arch and maybe on some non-x86 machines (Power, Itanium, Sparc) would be very welcome to ensure that I haven't missed anything. To see real world performance numbers I dumped the algorithms on top of the checksums patch. I set up postgres with 32MB shared buffers, and ran with concurrency 4 select only pgbench and a worst case workload, results are median of 5 1-minute runs. I used fletcher as it was in the checksums patch without unrolling. Unrolling would cut the performance hit by a third or so. The worst case workload is set up using CREATE TABLE sparse (id serial primary key, v text) WITH (fillfactor=10); INSERT INTO sparse (v) SELECT REPEAT('x', 1000) FROM generate_series(1,100000); VACUUM ANALYZE sparse; The test query itself is a simple SELECT count(v) FROM sparse; Results for the worst case workload: No checksums: tps = 14.710519 Fletcher checksums: tps = 10.825564 (1.359x slowdown) CRC checksums: tps = 5.844995 (2.517x slowdown) SIMD checksums: tps = 14.062388 (1.046x slowdown) Results for pgbench scale 100: No checksums: tps = 56623.819783 Fletcher checksums: tps = 55282.222687 (1.024x slowdown) CRC Checksums: tps = 50571.324795 (1.120x slowdown) SIMD Checksums: tps = 56608.888985 (1.000x slowdown) So to conclude, the 3 approaches: CRC: Time to checksum 8192 bytes: 12'000 - 16'000 cycles best case without special hardware 1'200 cycles with hardware (newIntel only) Code size: 131 bytes * Can calculate arbitrary number of bytes per invocation, state is 4 bytes. Implementation can be shared with WAL. * Quite slow without hardware acceleration. * Software implementation requires a 8kB table for calculation or it will be even slower. Quite likely to fall out of cache. * If we wish to use hardware acceleration then the polynomial should be switched to Castagnoli. I think the old polynomial needs to stay as the values seem to be stored in indexes by tsvector compression and multibyte trigrams. (not 100% sure, just skimmed the code) * Error detection of 32bit Castagnoli CRC is known to be good, the effect of truncating to 16 bits is not analyzed yet. Fletcher: Time to checksum 8192 bytes: 2'600 cycles +- 100 Code size: 170 bytes unrolled * Very simple implementation for optimal speed. * Needs to calculate 4 bytes at a time, requires 8 bytes of state. Implementation that can work for WAL would be tricky but not impossible. Probably wouldn't share code. * Should give good enough error detection with suitable choice for final recombination. SIMD Checksums: Time to checksum 8192 bytes: 730 cycles for processors with 128bit SIMD units 1830 cycles for processors with 64bitSIMD units Code size: 436 bytes * Requires vectorization, intrinsics or ASM for decent performance. * Needs to calculate 128bytes at a time, requires 128 bytes of state. Using for anything other than summing fixed size blocks looks tricky. * Loosely based on Fowler-Noll-Vo and should have reasonably good error detection capabilities. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
pgsql-hackers by date: