Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CAEepm=3A3srTeZUy_iQQyPpzV6S7M1dq+0rOZYgYr6JaQUN+hQ@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
|
List | pgsql-hackers |
On Sat, Feb 4, 2017 at 2:45 PM, Peter Geoghegan <pg@bowt.ie> wrote: > It might just have been that the table was too small to be an > effective target for parallel sequential scan with so many workers, > and so a presorted best case CREATE INDEX, which isn't that different, > also fails to see much benefit (compared to what you'd see with a > similar case involving a larger table). In other words, I might have > jumped the gun in emphasizing issues with hardware and I/O bandwidth > over issues around data volume (that I/O parallelism is inherently not > very helpful with these relatively small tables). > > As I've pointed out a couple of times before, bigger sorts will be > more CPU bound because sorting itself has costs that grow > linearithmically, whereas writing out runs has costs that grow > linearly. The relative cost of the I/O can be expected to go down as > input goes up for this reason. At the same time, a larger input might > make better use of I/O parallelism, which reduces the cost paid in > latency to write out runs in absolute terms. Here are some results with your latest patch, using the same test as before but this time with SCALE=100 (= 100,000,000 rows). The table sizes are: List of relations Schema | Name | Type | Owner | Size | Description --------+----------------------+-------+--------------+-------+------------- public | million_words | table | thomas.munro | 42 MB | public | some_words | table | thomas.munro | 19 MB | public | test_intwide_u_asc | table | thomas.munro | 18 GB | public | test_intwide_u_desc | table | thomas.munro | 18 GB | public | test_intwide_u_rand | table | thomas.munro | 18 GB | public | test_textwide_u_asc | table | thomas.munro | 19 GB | public | test_textwide_u_desc | table | thomas.munro | 19 GB | public | test_textwide_u_rand | table | thomas.munro | 19 GB | To reduce the number of combinations I did only unique data and built only non-unique indexes with only 'wide' tuples (= key plus a text column that holds a 151-character wide string, rather than just the key), and also didn't bother with the 1MB memory size as suggested. Here are the results up to 4 workers (a results table going up to 8 workers is attached, since it wouldn't format nicely if I pasted it here). Again, the w = 0 time is seconds, the rest show relative speed-up. This data was all in the OS page cache because of a dummy run done first, and I verified with 'sar' that there was exactly 0 reading from the block device. The CPU was pegged on leader + workers during sort runs, and then the leader's CPU hovered around 93-98% during the merge/btree build. I had some technical problems getting a cold-cache read-from-actual-disk-each-time test run to work properly, but can go back and do that again if anyone thinks that would be interesting data to see. tab | ord | mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 ----------+------+-----+---------+-------+-------+-------+------- intwide | asc | 64 | 67.91 | 1.26x | 1.46x | 1.62x | 1.73x intwide | asc | 256 | 67.84 | 1.23x | 1.48x | 1.63x | 1.79x intwide | asc | 512 | 69.01 | 1.25x | 1.50x | 1.63x | 1.80x intwide | desc | 64 | 98.08 | 1.48x | 1.83x | 2.03x | 2.25x intwide | desc | 256 | 99.87 | 1.43x | 1.80x | 2.03x | 2.29x intwide | desc | 512 | 104.09 | 1.44x | 1.85x | 2.09x | 2.33x intwide | rand | 64 | 138.03 | 1.56x | 2.04x | 2.42x | 2.58x intwide | rand | 256 | 139.44 | 1.61x | 2.04x | 2.38x | 2.56x intwide | rand | 512 | 138.96 | 1.52x | 2.03x | 2.28x | 2.57x textwide | asc | 64 | 207.10 | 1.20x | 1.07x | 1.09x | 1.11x textwide | asc | 256 | 200.62 | 1.19x | 1.06x | 1.04x | 0.99x textwide | asc | 512 | 191.42 | 1.16x | 0.97x | 1.01x | 0.94x textwide | desc | 64 | 1382.48 | 1.89x | 2.37x | 3.18x | 3.87x textwide | desc | 256 | 1427.99 | 1.89x | 2.42x | 3.24x | 4.00x textwide | desc | 512 | 1453.21 | 1.86x | 2.39x | 3.23x | 3.75x textwide | rand | 64 | 1587.28 | 1.89x | 2.37x | 2.66x | 2.75x textwide | rand | 256 | 1557.90 | 1.85x | 2.34x | 2.64x | 2.73x textwide | rand | 512 | 1547.97 | 1.87x | 2.32x | 2.64x | 2.71x "textwide" "asc" is nearly an order of magnitude faster than other initial orders without parallelism, but then parallelism doesn't seem to help it much. Also, using more that 64MB doesn't ever seem to help very much; in the "desc" case it hinders. I was curious to understand how performance changes if we become just a bit less correlated (rather than completely uncorrelated or perfectly inversely correlated), so I tried out a 'banana skin' case: I took the contents of the textwide asc table and copied it to a new table, and then moved the 900 words matching 'banana%' to the physical end of the heap by deleting and reinserting them in one transaction. I guess if we were to use this technology for CLUSTER, this might be representative of a situation where you regularly recluster a growing table. The results were pretty much like "asc": tab | ord | mem | w = 0 | w = 1 | w = 2 | w = 3 | w = 4 ----------+--------+-----+--------+-------+-------+-------+------- textwide | banana | 64 | 213.39 | 1.17x | 1.11x | 1.15x | 1.09x It's hard to speculate about this, but I guess that a significant number of indexes in real world databases might be uncorrelated to insert order. A newly imported or insert-only table might have one highly correlated index for a surrogate primary key or time column, but other indexes might tend to be uncorrelated. But really, who knows... in a kind of textbook perfectly correlated case such as a time series table with an append-only time or sequence based key, you might want to use BRIN rather than B-Tree anyway. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: