Re: Review [was Re: MD5 aggregate] - Mailing list pgsql-hackers
From | David Fetter |
---|---|
Subject | Re: Review [was Re: MD5 aggregate] |
Date | |
Msg-id | 20130621200428.GJ5395@fetter.org Whole thread Raw |
In response to | Review [was Re: MD5 aggregate] (David Fetter <david@fetter.org>) |
Responses |
Re: Review [was Re: MD5 aggregate]
|
List | pgsql-hackers |
On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote: > On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: > > On 15 June 2013 10:22, Dean Rasheed <dean.a.rasheed@gmail.com> wrote: > > > There seem to be 2 separate directions that this could go, which > > > really meet different requirements: > > > > > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of > > > the order in which they are scanned. A possible approach to this might > > > be something like an aggregate > > > > > > md5_total(text/bytea) returns text > > > > > > that returns the sum of the md5 values of each input value, treating > > > each md5 value as an unsigned 128-bit integer, and then producing the > > > hexadecimal representation of the final sum. This should out-perform a > > > solution based on numeric addition, and in typical cases, the result > > > wouldn't be much longer than a regular md5 sum, and so would be easy > > > to eyeball for differences. > > > > > > > I've been playing around with the idea of an aggregate that computes > > the sum of the md5 hashes of each of its inputs, which I've called > > md5_total() for now, although I'm not particularly wedded to that > > name. Comparing it with md5_agg() on a 100M row table (see attached > > test script) produces interesting results: > > > > SELECT md5_agg(foo.*::text) > > FROM (SELECT * FROM foo ORDER BY id) foo; > > > > 50bc42127fb9b028c9708248f835ed8f > > > > Time: 92960.021 ms > > > > SELECT md5_total(foo.*::text) FROM foo; > > > > 02faea7fafee4d253fc94cfae031afc43c03479c > > > > Time: 96190.343 ms > > > > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its > > result is longer) but it seems like it would be very useful for > > quickly comparing data in SQL, since its value is not dependent on the > > row-order making it easier to use and better performing if there is no > > usable index for ordering. > > > > Note, however, that if there is an index that can be used for > > ordering, the performance is not necessarily better than md5_agg(), as > > this example shows. There is a small additional overhead per row for > > initialising the MD5 sums, and adding the results to the total, but I > > think the biggest factor is that md5_total() is processing more data. > > The reason is that MD5 works on 64-byte blocks, so the total amount of > > data going through the core MD5 algorithm is each row's size is > > rounded up to a multiple of 64. In this simple case it ends up > > processing around 1.5 times as much data: > > > > SELECT sum(length(foo.*::text)) AS md5_agg, > > sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo; > > > > md5_agg | md5_total > > ------------+------------- > > 8103815438 | 12799909248 > > > > although of course that overhead won't be as large on wider tables, > > and even in this case the overall performance is still on a par with > > md5_agg(). > > > > ISTM that both aggregates are potentially useful in different > > situations. I would probably typically use md5_total() because of its > > simplicity/order-independence and consistent performance, but > > md5_agg() might also be useful when comparing with external data. > > > > Regards, > > Dean > Performance review (skills needed: ability to time performance) > > Does the patch slow down simple tests? > > Yes. For an MD5 checksum of some random data, here are > results from master: > > shackle@postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)), > postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000)) > postgres-# select md5(a::text) FROM t2; > shackle@postgres:5493=# \timing > Timing is on. > shackle@postgres:5493=# \g > Time: 955.393 ms > shackle@postgres:5493=# \g > Time: 950.909 ms > shackle@postgres:5493=# \g > Time: 947.955 ms > shackle@postgres:5493=# \g > Time: 946.193 ms > shackle@postgres:5493=# \g > Time: 950.591 ms > shackle@postgres:5493=# \g > Time: 952.346 ms > shackle@postgres:5493=# \g > Time: 948.623 ms > shackle@postgres:5493=# \g > Time: 939.819 ms > > and here from master + the patch: > > WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)), > t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000)) > select md5(a::text) FROM t2; > Time: 1129.178 ms > shackle@postgres:5494=# \g > Time: 1134.013 ms > shackle@postgres:5494=# \g > Time: 1124.387 ms > shackle@postgres:5494=# \g > Time: 1119.733 ms > shackle@postgres:5494=# \g > Time: 1104.906 ms > shackle@postgres:5494=# \g > Time: 1137.055 ms > shackle@postgres:5494=# \g > Time: 1128.977 ms > shackle@postgres:5494=# \g > Time: 1143.619 ms > > Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms > Avg, StddevPopulation with patch: 1127.73 ms, 11.06 ms The above was with -O0. Please find below with vanilla ./configure (-O2): Master: shackle@postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)), t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000)) select md5(a::text) FROM t2; Time: 469.101 ms shackle@postgres:5493=# \g Time: 464.122 ms shackle@postgres:5493=# \g Time: 461.411 ms shackle@postgres:5493=# \g Time: 458.222 ms shackle@postgres:5493=# \g Time: 463.616 ms shackle@postgres:5493=# \g Time: 463.983 ms shackle@postgres:5493=# \g Time: 453.770 ms shackle@postgres:5493=# \g Time: 456.729 ms MD5: WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)), t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000)) select md5(a::text) FROM t2; Time: 436.862 ms shackle@postgres:5494=# \t Showing only tuples. shackle@postgres:5494=# \t Tuples only is off. shackle@postgres:5494=# \g Time: 438.686 ms shackle@postgres:5494=# \g Time: 443.789 ms shackle@postgres:5494=# \g Time: 452.522 ms shackle@postgres:5494=# \g Time: 447.824 ms shackle@postgres:5494=# \g Time: 448.547 ms shackle@postgres:5494=# \g Time: 446.901 ms shackle@postgres:5494=# \g Time: 448.537 ms Average, stddev population for master: 461.36925, 4.5883631545 Average, stddev population for md5: 445.4585, 4.9892286729 so slightly faster on MD5, as expected. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
pgsql-hackers by date: