Home > mailing lists
Re: Review [was Re: MD5 aggregate] - Mailing list pgsql-hackers

From	David Fetter
Subject	Re: Review [was Re: MD5 aggregate]
Date	June 21, 2013 20:04:32
Msg-id	20130621200428.GJ5395@fetter.org Whole thread Raw
In response to	Review [was Re: MD5 aggregate] (David Fetter <david@fetter.org>)
Responses	Re: Review [was Re: MD5 aggregate]
List	pgsql-hackers
Tree view
On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote:
> On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote:
> > On 15 June 2013 10:22, Dean Rasheed <dean.a.rasheed@gmail.com> wrote:
> > > There seem to be 2 separate directions that this could go, which
> > > really meet different requirements:
> > >
> > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of
> > > the order in which they are scanned. A possible approach to this might
> > > be something like an aggregate
> > >
> > > md5_total(text/bytea) returns text
> > >
> > > that returns the sum of the md5 values of each input value, treating
> > > each md5 value as an unsigned 128-bit integer, and then producing the
> > > hexadecimal representation of the final sum. This should out-perform a
> > > solution based on numeric addition, and in typical cases, the result
> > > wouldn't be much longer than a regular md5 sum, and so would be easy
> > > to eyeball for differences.
> > >
> > 
> > I've been playing around with the idea of an aggregate that computes
> > the sum of the md5 hashes of each of its inputs, which I've called
> > md5_total() for now, although I'm not particularly wedded to that
> > name. Comparing it with md5_agg() on a 100M row table (see attached
> > test script) produces interesting results:
> > 
> > SELECT md5_agg(foo.*::text)
> >   FROM (SELECT * FROM foo ORDER BY id) foo;
> > 
> >  50bc42127fb9b028c9708248f835ed8f
> > 
> > Time: 92960.021 ms
> > 
> > SELECT md5_total(foo.*::text) FROM foo;
> > 
> >  02faea7fafee4d253fc94cfae031afc43c03479c
> > 
> > Time: 96190.343 ms
> > 
> > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its
> > result is longer) but it seems like it would be very useful for
> > quickly comparing data in SQL, since its value is not dependent on the
> > row-order making it easier to use and better performing if there is no
> > usable index for ordering.
> > 
> > Note, however, that if there is an index that can be used for
> > ordering, the performance is not necessarily better than md5_agg(), as
> > this example shows. There is a small additional overhead per row for
> > initialising the MD5 sums, and adding the results to the total, but I
> > think the biggest factor is that md5_total() is processing more data.
> > The reason is that MD5 works on 64-byte blocks, so the total amount of
> > data going through the core MD5 algorithm is each row's size is
> > rounded up to a multiple of 64. In this simple case it ends up
> > processing around 1.5 times as much data:
> > 
> > SELECT sum(length(foo.*::text)) AS md5_agg,
> >        sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo;
> > 
> >   md5_agg   |  md5_total
> > ------------+-------------
> >  8103815438 | 12799909248
> > 
> > although of course that overhead won't be as large on wider tables,
> > and even in this case the overall performance is still on a par with
> > md5_agg().
> > 
> > ISTM that both aggregates are potentially useful in different
> > situations. I would probably typically use md5_total() because of its
> > simplicity/order-independence and consistent performance, but
> > md5_agg() might also be useful when comparing with external data.
> > 
> > Regards,
> > Dean

> Performance review (skills needed: ability to time performance)
> 
>     Does the patch slow down simple tests?
> 
>         Yes.  For an MD5 checksum of some random data, here are
>         results from master:
> 
>             shackle@postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM
generate_series(1,10000)),
>             postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
>             postgres-# select md5(a::text) FROM t2;
>             shackle@postgres:5493=# \timing 
>             Timing is on.
>             shackle@postgres:5493=# \g
>             Time: 955.393 ms
>             shackle@postgres:5493=# \g
>             Time: 950.909 ms
>             shackle@postgres:5493=# \g
>             Time: 947.955 ms
>             shackle@postgres:5493=# \g
>             Time: 946.193 ms
>             shackle@postgres:5493=# \g
>             Time: 950.591 ms
>             shackle@postgres:5493=# \g
>             Time: 952.346 ms
>             shackle@postgres:5493=# \g
>             Time: 948.623 ms
>             shackle@postgres:5493=# \g
>             Time: 939.819 ms
> 
>         and here from master + the patch:
> 
>             WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,10000)),
>             t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
>             select md5(a::text) FROM t2;
>             Time: 1129.178 ms
>             shackle@postgres:5494=# \g
>             Time: 1134.013 ms
>             shackle@postgres:5494=# \g
>             Time: 1124.387 ms
>             shackle@postgres:5494=# \g
>             Time: 1119.733 ms
>             shackle@postgres:5494=# \g
>             Time: 1104.906 ms
>             shackle@postgres:5494=# \g
>             Time: 1137.055 ms
>             shackle@postgres:5494=# \g
>             Time: 1128.977 ms
>             shackle@postgres:5494=# \g
>             Time: 1143.619 ms
> 
>             Avg, StddevPopulation without patch: 948.97 ms, 4.353 ms
>             Avg, StddevPopulation with patch: 1127.73 ms, 11.06 ms

The above was with -O0.  Please find below with vanilla ./configure (-O2):

Master:
shackle@postgres:5493=# WITH t1 AS (SELECT
string_agg(chr(floor(255*random()+1)::int),'') AS a FROM
generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 469.101 ms
shackle@postgres:5493=# \g
Time: 464.122 ms
shackle@postgres:5493=# \g
Time: 461.411 ms
shackle@postgres:5493=# \g
Time: 458.222 ms
shackle@postgres:5493=# \g
Time: 463.616 ms
shackle@postgres:5493=# \g
Time: 463.983 ms
shackle@postgres:5493=# \g
Time: 453.770 ms
shackle@postgres:5493=# \g
Time: 456.729 ms

MD5:
WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a
FROM generate_series(1,10000)),
t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,10000))
select md5(a::text) FROM t2;
Time: 436.862 ms
shackle@postgres:5494=# \t
Showing only tuples.
shackle@postgres:5494=# \t
Tuples only is off.
shackle@postgres:5494=# \g
Time: 438.686 ms
shackle@postgres:5494=# \g
Time: 443.789 ms
shackle@postgres:5494=# \g
Time: 452.522 ms
shackle@postgres:5494=# \g
Time: 447.824 ms
shackle@postgres:5494=# \g
Time: 448.547 ms
shackle@postgres:5494=# \g
Time: 446.901 ms
shackle@postgres:5494=# \g
Time: 448.537 ms

Average, stddev population for master: 461.36925, 4.5883631545    
Average, stddev population for md5:    445.4585,  4.9892286729

so slightly faster on MD5, as expected.

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate
pgsql-hackers by date:
From: Heikki Linnakangas
Date: 21 June 2013, 19:43:16
Subject: Re: GIN improvements part2: fast scan
From: Robert Haas
Date: 21 June 2013, 20:40:00
Subject: Re: backend hangs at immediate shutdown (Re: Back-branch update releases coming in a couple weeks)
Re: Review [was Re: MD5 aggregate] - Mailing list pgsql-hackers

Previous

Next