Re: pg_dump additional options for performance - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: pg_dump additional options for performance |
Date | |
Msg-id | 1204023962.4252.231.camel@ebony.site Whole thread Raw |
In response to | Re: pg_dump additional options for performance (Tom Lane <tgl@sss.pgh.pa.us>) |
Responses |
Re: pg_dump additional options for performance
|
List | pgsql-hackers |
On Tue, 2008-02-26 at 00:39 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > ... So it would be good if we could dump objects in 3 groups > > 1. all commands required to re-create table > > 2. data > > 3. all commands required to complete table after data load > > [ much subsequent discussion snipped ] > > BTW, what exactly was the use-case for this? The recent discussions > about parallelizing pg_restore make it clear that the all-in-one > dump file format still has lots to recommend it. So I'm just wondering > what the actual advantage of splitting the dump into multiple files > will be. It clearly makes life more complicated; what are we buying? One file is convenient, but not fast. Convenience should be the default, with speed as an option. A current pg_dump spends 99% of its time in the unload phase, and the unload of each table can be parallelised, potentially. A current pg_dump file has many potentially parallelisable tasks during the *reload* process: * the individual COPY statements with respect to each other * the data within a COPY statement for a single table * the addition of indexes * some FK checks can be run in parallel, say 25-75% of them, typically all of the very largest Also, writing data to multiple drives can increase performance of both unload and reload. ISTM the easiest way to gain the most benefit from parallelisation is to allow the COPY TO (unload) operation to be parallelised on pg_dump, splitting the files into one per table. The COPY FROM (load) operation can then be parallelised fairly easily to match. So that would mean we would run an unload like this pg_dump --pre-schema-file=f1 --save-snapshot -snapshot-id=X pg_dump -t bigtable --data-file=f2.1 --snapshot-id=X pg_dump -t bigtable2 --data-file=f2.2 --snapshot-id=X pg_dump -T bigtable -T bigtable2 --data-file=f2.3 --snapshot-id=X ... (add other tables to be run in parallel) pg_dump --post-schema-file=f3 -snapshot-id=X and then run the reload like this psql -f f1 psql -f f2.1 & psql -f f2.2 & psql -f f2.3 & wait psql -f f3 using shell notation for parallel operations just to simplify things. This idea does *not* rely on the idea of saved snapshots, but they do play well together. I hope to write both (as agreed on other thread) and to discuss all the syntax for the snapshot stuff separately. We could go to even greater lengths by expressing each of the reload dependencies explicitly into the pg_dump output, rather than just implicitly via output ordering. It would then be possible to have a concurrent psql/pg_restore to execute multiple threads and execute tasks that respect the dependencies. That would be cute, but it sounds way over-cooked to me and about 10 times more work to extract the maximum possible parallelizability, when we can get 2-3 times performance with some much simpler changes. Note that another approach might be to attempt to parallelise each load individually, but that again is a lot of work and often not required. So speeding up the data dump/reload portion seems quick and easy. (Plus, I've pretty much finished writing the pg_dump changes and was going to publish it today... but that doesn't make it right, I know.) -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
pgsql-hackers by date: