Home > mailing lists

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

From	Thomas Munro
Subject	Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Date	February 3, 2017 16:04:55
Msg-id	CAEepm=1_LTOeGGJr2BquEwUJmBnT6PtEQx5GmD=i42GvVS3bcQ@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@bowt.ie>)
Responses	Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
List	pgsql-hackers

Tree view

On Wed, Feb 1, 2017 at 8:46 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Jan 31, 2017 at 11:23 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> So I'm in favour of this patch, which is relatively simple and give us
>> faster index builds soon.  Eventually we might also be able to have
>> approach 1.  From what I gather, it's entirely possible that we might
>> still need 2 to fall back on in some cases.
>
> Right. And it can form the basis of an implementation of 1, which in
> any case seems to be much more compelling for parallel query, when a
> great deal more can be pushed down, and we are not particularly likely
> to be I/O bound (usually not much writing to the heap, or WAL
> logging).

I ran some tests today.  First I created test tables representing the
permutations of these choices:

Table structure:

  int = Integer key only
  intwide = Integer key + wide row
  text = Text key only (using dictionary words)
  textwide = Text key + wide row

Uniqueness:

  u = each value unique
  d = 10 duplicates of each value

Heap physical order:

  rand = Random
  asc = Ascending order (already sorted)
  desc = Descending order (sorted backwards)

I used 10 million rows for this test run, so that gave me 24 tables of
the following sizes as reported in "\d+":

  int tables = 346MB each
  intwide tables = 1817MB each
  text tables = 441MB each
  textwide tables = 1953MB each

It'd be interesting to test larger tables of course but I had a lot of
permutations to get through.

For each of those tables I ran tests corresponding to the permutations
of these three variables:

Index type:

  uniq = CREATE UNIQUE INDEX ("u" tables only, ie no duplicates)
  nonu = CREATE INDEX ("u" and "d" tables)

Maintenance memory: 1M, 64MB, 256MB, 512MB

Workers: from 0 up to 8

Environment:  EDB test machine "cthulhu", Intel(R) Xeon(R) CPU E7-
8830  @ 2.13GHz, 8 socket, 8 cores (16 threads) per socket, CentOS
7.2, Linux kernel 3.10.0-229.7.2.el7.x86_64, 512GB RAM, pgdata on SSD.
Database initialised with en_US.utf-8 collation, all defaults except
max_wal_size increased to 4GB (otherwise warnings about too frequent
checkpoints) and max_parallel_workers_maintenance = 8.  Testing done
with warm OS cache.

I applied your v2 patch on top of
7ac4a389a7dbddaa8b19deb228f0a988e79c5795^ to avoid a conflict.  It
still had a couple of harmless conflicts that I was able to deal with
(not code, just some header stuff moving around).

See full results from all permutations attached, but I wanted to
highlight the measurements from 'textwide', 'u', 'nonu' which show
interesting 'asc' numbers (data already sorted).  The 'mem' column is
maintenance_work_mem in megabytes.  The 'w = 0' column shows the time
in seconds for parallel_workers = 0.  The other 'w = N' columns show
times with higher parallel_workers settings, represented as speed-up
relative to the 'w = 0' time.

1. 'asc' = pre-sorted data (w = 0 shows time in seconds, other columns
show speed-up relative to that time):

 mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+--------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 119.97 | 4.61x | 4.83x | 5.32x | 5.61x | 5.88x | 6.10x | 6.18x | 6.09x
  64 |  19.42 | 1.18x | 1.10x | 1.23x | 1.23x | 1.16x | 1.19x | 1.20x | 1.21x
 256 |  18.35 | 1.02x | 0.92x | 0.98x | 1.02x | 1.06x | 1.07x | 1.08x | 1.10x
 512 |  17.75 | 1.01x | 0.89x | 0.95x | 0.99x | 1.02x | 1.05x | 1.06x | 1.07x

2. 'rand' = randomised data:

 mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+--------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 130.25 | 1.82x | 2.19x | 2.52x | 2.58x | 2.72x | 2.72x | 2.83x | 2.89x
  64 | 117.36 | 1.80x | 2.20x | 2.43x | 2.47x | 2.55x | 2.51x | 2.59x | 2.69x
 256 | 124.68 | 1.87x | 2.20x | 2.49x | 2.52x | 2.64x | 2.70x | 2.72x | 2.75x
 512 | 115.77 | 1.51x | 1.72x | 2.14x | 2.08x | 2.19x | 2.31x | 2.44x | 2.48x

3. 'desc' = reverse-sorted data:

 mem | w = 0  | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+--------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 115.19 | 1.88x | 2.39x | 2.78x | 3.50x | 3.62x | 4.20x | 4.19x | 4.39x
  64 | 112.17 | 1.85x | 2.25x | 2.99x | 3.63x | 3.65x | 4.01x | 4.31x | 4.62x
 256 | 119.55 | 1.76x | 2.21x | 2.85x | 3.43x | 3.37x | 3.77x | 4.24x | 4.28x
 512 | 119.50 | 1.85x | 2.19x | 2.87x | 3.26x | 3.28x | 3.74x | 4.24x | 3.93x

The 'asc' effects are much less pronounced when the key is an int.
Here is the equivalent data for 'intwide', 'u', 'nonu':

1.  'asc'

 mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+-------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 12.19 | 1.55x | 1.93x | 2.21x | 2.44x | 2.64x | 2.76x | 2.91x | 2.83x
  64 |  7.35 | 1.29x | 1.53x | 1.69x | 1.86x | 1.98x | 2.04x | 2.07x | 2.09x
 256 |  7.34 | 1.26x | 1.47x | 1.64x | 1.79x | 1.92x | 1.96x | 1.98x | 2.02x
 512 |  7.24 | 1.24x | 1.46x | 1.65x | 1.80x | 1.91x | 1.97x | 2.00x | 1.92x

2. 'rand'

 mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+-------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 15.16 | 1.56x | 2.01x | 2.32x | 2.57x | 2.73x | 2.87x | 2.95x | 2.91x
  64 | 12.97 | 1.55x | 1.97x | 2.25x | 2.44x | 2.58x | 2.70x | 2.74x | 2.71x
 256 | 13.14 | 1.47x | 1.86x | 2.12x | 2.31x | 2.50x | 2.62x | 2.58x | 2.69x
 512 | 13.61 | 1.48x | 1.91x | 2.22x | 2.37x | 2.55x | 2.65x | 2.73x | 2.73x

3. 'desc'

 mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 | w = 5 | w = 6 | w = 7 | w = 8
-----+-------+-------+-------+-------+-------+-------+-------+-------+-------
   1 | 13.45 | 1.51x | 1.94x | 2.31x | 2.56x | 2.75x | 2.95x | 3.05x | 3.00x
  64 | 10.27 | 1.42x | 1.82x | 2.05x | 2.30x | 2.46x | 2.59x | 2.64x | 2.65x
 256 | 10.52 | 1.39x | 1.70x | 2.02x | 2.24x | 2.34x | 2.39x | 2.48x | 2.56x
 512 | 10.62 | 1.43x | 1.82x | 2.06x | 2.32x | 2.51x | 2.61x | 2.68x | 2.69x

Full result summary and scripts used for testing attached.

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

pgsql-hackers by date:

From: Heikki Linnakangas
Date: 03 February 2017, 15:52:52
Subject: Re: [HACKERS] Password identifiers, protocol aging and SCRAM protocol

From: Fabien COELHO
Date: 03 February 2017, 16:24:33
Subject: Re: \if, \elseif, \else, \endif (was Re: [HACKERS] PSQL commands:\quit_if, \quit_unless)

Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers

Attachment

Previous

Next