Re: Enabling Checksums - Mailing list pgsql-hackers
From | Ants Aasma |
---|---|
Subject | Re: Enabling Checksums |
Date | |
Msg-id | CA+CSw_vGGpbomdXOnXdkPaAbxmF4RXJF4Mo3L-AkFi9ZHSUqhQ@mail.gmail.com Whole thread Raw |
In response to | Re: Enabling Checksums (Jeff Davis <pgsql@j-davis.com>) |
Responses |
Re: Enabling Checksums
|
List | pgsql-hackers |
On Fri, Mar 22, 2013 at 7:35 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote: >> For performance the K8 results gave me confidence that we have a >> reasonably good overview what the performance is like for the class of >> CPU's that PostgreSQL is likely to run on. I don't think there is >> anything left to optimize there, all algorithms are pretty close to >> maximum theoretical performance. > > Great work! Thanks. >> The worst case workload is set up using >> CREATE TABLE sparse (id serial primary key, v text) WITH (fillfactor=10); >> INSERT INTO sparse (v) SELECT REPEAT('x', 1000) FROM generate_series(1,100000); >> VACUUM ANALYZE sparse; >> >> The test query itself is a simple SELECT count(v) FROM sparse; >> >> Results for the worst case workload: >> No checksums: tps = 14.710519 >> Fletcher checksums: tps = 10.825564 (1.359x slowdown) >> CRC checksums: tps = 5.844995 (2.517x slowdown) >> SIMD checksums: tps = 14.062388 (1.046x slowdown) > > I assume this is in the "bad region" identified by Greg, where there is > no disk activity, but shared_buffers is small, leading to a lot of > movement between the OS cache and shared buffers? > > What do you mean by TPS exactly? If the select query is writing hint > bits, then you wouldn't be able to repeat it because they are already > set. So are you repeating the creation/loading of the table, as well? The table is created once, size is 800MB with one hinted tuple per page. Shared buffers is set to 32MB, machine is Intel Core i5-2500K with 16GB of memory (2 memory channels, 1333MHz, overheads are likely to be larger with faster memory). This is the worst case workload for in-memory workload that doesn't fit into shared_buffers as almost no work other than swapping buffer pages in is done. I think things like bitmap heap scans might show similar characteristics. >> Results for pgbench scale 100: >> No checksums: tps = 56623.819783 >> Fletcher checksums: tps = 55282.222687 (1.024x slowdown) >> CRC Checksums: tps = 50571.324795 (1.120x slowdown) >> SIMD Checksums: tps = 56608.888985 (1.000x slowdown) >> >> So to conclude, the 3 approaches: > > Great analysis. Still a tough choice. > > One thing that might be interesting is to look at doing SIMD for both > data and WAL. I wonder if that would be a noticeable speedup for WAL > full-page writes? That would give greater justification for the extra > work it will take (intrinsics/ASM), and it would be a nice win for > non-checksum users. Andres showed that switching out the existing CRC for zlib's would result in 8-30% increase in INSERT-SELECT speed (http://www.postgresql.org/message-id/201005202227.49990.andres@anarazel.de) with the speeded up CRC still showing up as 10% of the profile. So I guess another 5% speedup by doing the CRC 8 bytes at a time instead of the used 4. And another couple % by using Fletcher or SIMD. > I also notice that http://en.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2% > 80%93Vo_hash_function explicitly mentions adapting FNV to a smaller > size. That gives me a little more confidence. Do you have other links we > should read about this approach, or possible weaknesses? It mentions that one should use 32bit FNV and fold it down to 16bit via xor. This doesn't work here because SSE2 doesn't have pmulld (SSE4.1). I have taken some liberties here by actually doing 64 16bit FNV like operations in parallel and then doing an FNV like combination of them at the end. However the choices there are concerned with good hashing performance, while for checksums it should matter much even if the average error detection rate goes from 99.998% to 99.99% as long as common error scenarios don't match up with the collisions. If decide to go this route we should definitely research what the effectiveness consequences here are and what are good choices for the prime values used. On the face of it multiply by prime and add/xor looks like it provides pretty good mixing, resists transposed sequences, zeroing out blocks. The worst case seems to be bit errors. As far as I can see, this implementation should detect all single bit errors, but if one of the bit errors is on MSB, a second single error in MSB will cancel it out. I haven't done the math but it should still work out as better than 99% chance to detect random 2 bit errors. On Fri, Mar 22, 2013 at 8:00 PM, Jeff Davis <pgsql@j-davis.com> wrote: > On Fri, 2013-03-22 at 17:09 +0200, Ants Aasma wrote: >> So to conclude, the 3 approaches: > > One other question: assuming that the algorithms use the full 16-bit > space, is there a good way to avoid zero without skewing the result? Can > we do something like un-finalize (after we figure out that it's zero), > compute in an extra salt value, and then re-finalize? That might work > for Fletcher; but I don't think that works for CRC or Fowler-Noll-Vo > because the final value is the same as the state. Taking the Fletcher or CRC32 result modulo 65521 (largest prime < 16bits) only gives a very slight skew that shouldn't really matter for all practical purposes. For the SIMD FNV implementation we can just reduce the 64 16bit values down to 4, concat them together to a single 64bit number (by just skipping the last two reduction steps) and take a modulo from that. Ants Aasma -- Cybertec Schönig & Schönig GmbH Gröhrmühlgasse 26 A-2700 Wiener Neustadt Web: http://www.postgresql-support.de
pgsql-hackers by date: