Home > mailing lists

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date	February 16, 2017 05:05:14
Msg-id	CAEepm=3A3srTeZUy_iQQyPpzV6S7M1dq+0rOZYgYr6JaQUN+hQ@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@bowt.ie>)
Responses	Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
List	pgsql-hackers

Tree view

On Sat, Feb 4, 2017 at 2:45 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> It might just have been that the table was too small to be an
> effective target for parallel sequential scan with so many workers,
> and so a presorted best case CREATE INDEX, which isn't that different,
> also fails to see much benefit (compared to what you'd see with a
> similar case involving a larger table). In other words, I might have
> jumped the gun in emphasizing issues with hardware and I/O bandwidth
> over issues around data volume (that I/O parallelism is inherently not
> very helpful with these relatively small tables).
>
> As I've pointed out a couple of times before, bigger sorts will be
> more CPU bound because sorting itself has costs that grow
> linearithmically, whereas writing out runs has costs that grow
> linearly. The relative cost of the I/O can be expected to go down as
> input goes up for this reason. At the same time, a larger input might
> make better use of I/O parallelism, which reduces the cost paid in
> latency to write out runs in absolute terms.

Here are some results with your latest patch, using the same test as
before but this time with SCALE=100 (= 100,000,000 rows).  The table
sizes are:

                             List of relations
 Schema |         Name         | Type  |    Owner     | Size  | Description
--------+----------------------+-------+--------------+-------+-------------
 public | million_words        | table | thomas.munro | 42 MB |
 public | some_words           | table | thomas.munro | 19 MB |
 public | test_intwide_u_asc   | table | thomas.munro | 18 GB |
 public | test_intwide_u_desc  | table | thomas.munro | 18 GB |
 public | test_intwide_u_rand  | table | thomas.munro | 18 GB |
 public | test_textwide_u_asc  | table | thomas.munro | 19 GB |
 public | test_textwide_u_desc | table | thomas.munro | 19 GB |
 public | test_textwide_u_rand | table | thomas.munro | 19 GB |

To reduce the number of combinations I did only unique data and built
only non-unique indexes with only 'wide' tuples (= key plus a text
column that holds a 151-character wide string, rather than just the
key), and also didn't bother with the 1MB memory size as suggested.
Here are the results up to 4 workers (a results table going up to 8
workers is attached, since it wouldn't format nicely if I pasted it
here).  Again, the w = 0 time is seconds, the rest show relative
speed-up.  This data was all in the OS page cache because of a dummy
run done first, and I verified with 'sar' that there was exactly 0
reading from the block device.  The CPU was pegged on leader + workers
during sort runs, and then the leader's CPU hovered around 93-98%
during the merge/btree build.  I had some technical problems getting a
cold-cache read-from-actual-disk-each-time test run to work properly,
but can go back and do that again if anyone thinks that would be
interesting data to see.

   tab    | ord  | mem |  w = 0  | w = 1 | w = 2 | w = 3 | w = 4
----------+------+-----+---------+-------+-------+-------+-------
 intwide  | asc  |  64 |   67.91 | 1.26x | 1.46x | 1.62x | 1.73x
 intwide  | asc  | 256 |   67.84 | 1.23x | 1.48x | 1.63x | 1.79x
 intwide  | asc  | 512 |   69.01 | 1.25x | 1.50x | 1.63x | 1.80x
 intwide  | desc |  64 |   98.08 | 1.48x | 1.83x | 2.03x | 2.25x
 intwide  | desc | 256 |   99.87 | 1.43x | 1.80x | 2.03x | 2.29x
 intwide  | desc | 512 |  104.09 | 1.44x | 1.85x | 2.09x | 2.33x
 intwide  | rand |  64 |  138.03 | 1.56x | 2.04x | 2.42x | 2.58x
 intwide  | rand | 256 |  139.44 | 1.61x | 2.04x | 2.38x | 2.56x
 intwide  | rand | 512 |  138.96 | 1.52x | 2.03x | 2.28x | 2.57x
 textwide | asc  |  64 |  207.10 | 1.20x | 1.07x | 1.09x | 1.11x
 textwide | asc  | 256 |  200.62 | 1.19x | 1.06x | 1.04x | 0.99x
 textwide | asc  | 512 |  191.42 | 1.16x | 0.97x | 1.01x | 0.94x
 textwide | desc |  64 | 1382.48 | 1.89x | 2.37x | 3.18x | 3.87x
 textwide | desc | 256 | 1427.99 | 1.89x | 2.42x | 3.24x | 4.00x
 textwide | desc | 512 | 1453.21 | 1.86x | 2.39x | 3.23x | 3.75x
 textwide | rand |  64 | 1587.28 | 1.89x | 2.37x | 2.66x | 2.75x
 textwide | rand | 256 | 1557.90 | 1.85x | 2.34x | 2.64x | 2.73x
 textwide | rand | 512 | 1547.97 | 1.87x | 2.32x | 2.64x | 2.71x

"textwide" "asc" is nearly an order of magnitude faster than other
initial orders without parallelism, but then parallelism doesn't seem
to help it much.  Also, using more that 64MB doesn't ever seem to help
very much; in the "desc" case it hinders.

I was curious to understand how performance changes if we become just
a bit less correlated (rather than completely uncorrelated or
perfectly inversely correlated), so I tried out a 'banana skin' case:
I took the contents of the textwide asc table and copied it to a new
table, and then moved the 900 words matching 'banana%' to the physical
end of the heap by deleting and reinserting them in one transaction.
I guess if we were to use this technology for CLUSTER, this might be
representative of a situation where you regularly recluster a growing
table.  The results were pretty much like "asc":

   tab    |  ord   | mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4
----------+--------+-----+--------+-------+-------+-------+-------
 textwide | banana |  64 | 213.39 | 1.17x | 1.11x | 1.15x | 1.09x

It's hard to speculate about this, but I guess that a significant
number of indexes in real world databases might be uncorrelated to
insert order.  A newly imported or insert-only table might have one
highly correlated index for a surrogate primary key or time column,
but other indexes might tend to be uncorrelated.  But really, who
knows...  in a kind of textbook perfectly correlated case such as a
time series table with an append-only time or sequence based key, you
might want to use BRIN rather than B-Tree anyway.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

speedup-100.txt

pgsql-hackers by date:

From: Michael Paquier
Date: 16 February 2017, 04:55:02
Subject: Re: [HACKERS] [Bug fix] PQsendQuery occurs error whentarget_session_attrs is set to read-write

From: Michael Paquier
Date: 16 February 2017, 05:08:59
Subject: Re: [HACKERS] Proposal: GetOldestXminExtend for ignoring arbitraryvacuum flags

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

Attachment

Previous

Next