Thread: parallel index creation: maintenance_work_mem not honored?
Hi all,
I noticed that after an upgrade from 9.6 to 14.x, Postgres started producing temp files.
I narrowed down the behavior to parallel operations.
I was able to reproduce the following on a test machine running latest Postgres 14.4 on Ubuntu 18.04
CREATE TABLE test_parallel (i int);
INSERT INTO test_parallel VALUES (generate_series(0,10000000));
table size: 346 MB
maintenance_work_mem = '5GB' (but the same would happen if we set much higher values)
CREATE INDEX test_parallel_i_idx ON test_parallel (i);
... LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp4611.1.sharedfileset/2.0", size 37879808
... STATEMENT: CREATE INDEX test_parallel_i_idx ON test_parallel (i);
... LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp4611.1.sharedfileset/0.0", size 59326464
... STATEMENT: CREATE INDEX test_parallel_i_idx ON test_parallel (i);
... LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp4611.1.sharedfileset/1.0", size 103194624
... STATEMENT: CREATE INDEX test_parallel_i_idx ON test_parallel (i);
After disabling parallelism with:
ALTER TABLE test_parallel SET (parallel_workers = 0);
The same 'CREATE INDEX' statement does not produce temp files.
Is this behavior expected? I cannot find mentions on the documentation nor on the literature available to me.
regards,
fabio pardi
I noticed that after an upgrade from 9.6 to 14.x, Postgres started producing temp files.
I narrowed down the behavior to parallel operations.
I was able to reproduce the following on a test machine running latest Postgres 14.4 on Ubuntu 18.04
CREATE TABLE test_parallel (i int);
INSERT INTO test_parallel VALUES (generate_series(0,10000000));
table size: 346 MB
maintenance_work_mem = '5GB' (but the same would happen if we set much higher values)
CREATE INDEX test_parallel_i_idx ON test_parallel (i);
... LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp4611.1.sharedfileset/2.0", size 37879808
... STATEMENT: CREATE INDEX test_parallel_i_idx ON test_parallel (i);
... LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp4611.1.sharedfileset/0.0", size 59326464
... STATEMENT: CREATE INDEX test_parallel_i_idx ON test_parallel (i);
... LOG: temporary file: path "base/pgsql_tmp/pgsql_tmp4611.1.sharedfileset/1.0", size 103194624
... STATEMENT: CREATE INDEX test_parallel_i_idx ON test_parallel (i);
After disabling parallelism with:
ALTER TABLE test_parallel SET (parallel_workers = 0);
The same 'CREATE INDEX' statement does not produce temp files.
Is this behavior expected? I cannot find mentions on the documentation nor on the literature available to me.
regards,
fabio pardi
On Mon, Jun 20, 2022 at 2:42 AM Fabio Pardi <f.pardi@portavita.eu> wrote: > Is this behavior expected? I cannot find mentions on the documentation nor on the literature available to me. Yes, this is expected. Parallel CREATE INDEX performs its parallel sort by having workers generate sorted runs, which are written to temp files, and then having the leader process merge the sorted runs together as the index is built. In the case where there is amble maintenance_work_mem, all writes to and reads from temp files will be sequential. -- Peter Geoghegan
Thanks Peter for your answer,
Just to make sure we are on the same page: every parallel 'CREATE INDEX' writes to temp files, ok.
But why the temp files are always on disk and not in RAM as other operations do?
On 20/06/2022 17:47, Peter Geoghegan wrote:
Yes, this is expected. Parallel CREATE INDEX performs its parallel sort by having workers generate sorted runs, which are written to temp files, and then having the leader process merge the sorted runs together as the index is built.
Just to make sure we are on the same page: every parallel 'CREATE INDEX' writes to temp files, ok.
But why the temp files are always on disk and not in RAM as other operations do?
In the case where there is amble maintenance_work_mem, all writes to and reads from temp files will be sequential.
On Mon, Jun 20, 2022 at 10:38 PM Fabio Pardi <f.pardi@portavita.eu> wrote: > Just to make sure we are on the same page: every parallel 'CREATE INDEX' writes to temp files, ok. Yes. > But why the temp files are always on disk and not in RAM as other operations do? Why not? It is probably true that it would be better to use shared memory in the case where there is sufficient memory, but it's not all that significant compared to everything else that must happen at the same time. It's rarely the bottleneck. External sorts are often faster than internal sorts. That happens because they can be more CPU cache friendly, for reasons that aren't fundamental. > In the case where there is amble maintenance_work_mem, all writes to > and reads from temp files will be sequential. Yes. -- Peter Geoghegan
Thanks for the clarification Peter, On 21/06/2022 08:46, Peter Geoghegan wrote: > >> But why the temp files are always on disk and not in RAM as other operations do? > Why not? Systems running on slow disks will probably suffer from workers writing temp files to disk. > It is probably true that it would be better to use shared memory in > the case where there is sufficient memory, but it's not all that > significant compared to everything else that must happen at the same > time. It's rarely the bottleneck. External sorts are often faster than > internal sorts. That happens because they can be more CPU cache > friendly, for reasons that aren't fundamental. > I would expect this behavior to be mentioned in the documentation. regards, fabio pardi
Hi all,
Not quite following this line of logic... If "External sorts are often faster than internal sorts", why bother with increasing work_mem to avoid sorting on disk?
Regards,
Michael Vitale
Fabio Pardi wrote on 6/22/2022 2:15 AM:

Not quite following this line of logic... If "External sorts are often faster than internal sorts", why bother with increasing work_mem to avoid sorting on disk?
Regards,
Michael Vitale
Fabio Pardi wrote on 6/22/2022 2:15 AM:
Thanks for the clarification Peter,
On 21/06/2022 08:46, Peter Geoghegan wrote:But why the temp files are always on disk and not in RAM as other operations do?Why not?
Systems running on slow disks will probably suffer from workers writing temp files to disk.It is probably true that it would be better to use shared memory in
the case where there is sufficient memory, but it's not all that
significant compared to everything else that must happen at the same
time. It's rarely the bottleneck. External sorts are often faster than
internal sorts. That happens because they can be more CPU cache
friendly, for reasons that aren't fundamental.
I would expect this behavior to be mentioned in the documentation.
regards,
fabio pardi
Regards,
Michael Vitale
703-600-9343

Attachment
On Sat, Jun 25, 2022 at 9:37 AM MichaelDBA <MichaelDBA@sqlexec.com> wrote: > If "External sorts are often faster than internal sorts", why bother with increasing work_mem to avoid sorting on disk? They're faster despite spilling to disk -- not because of it. This effect is quite visible at times, though usually only with large sorts. It's noticeable, but not huge. And it depends on things like the datatypes being sorted, to a degree. (My original point was that external sorts shouldn't be assumed to be significantly slower or significantly less efficient than equivalent large internal sorts, where CPU costs tend to dominate either way.) Only an external sort can do a final on-the-fly merge, where tuplesort outputs tuples in their final sorted order before the entire input has been fully sorted. It's not uncommon for that to more than make up for any latency from writing out temp files. The added latency is usually not noticeable because we also need to read from random locations in memory to gather tuples to output to temp files. That can totally hide any I/O latency in many important cases. In principle you could get the same benefit with an internal sort, without any I/O for temp files -- nothing makes it fundamentally impossible to teach tuplesort how to do a similar final on-the-fly strategy. But that's tricky for a variety of reasons, mostly related to the fact that the final size of the input cannot easily be predicted. That's why it hasn't been implemented. -- Peter Geoghegan