Re: Adding pg_dump flag for parallel export to pipes - Mailing list pgsql-hackers
From | Hannu Krosing |
---|---|
Subject | Re: Adding pg_dump flag for parallel export to pipes |
Date | |
Msg-id | CAMT0RQQ1U3gLfeBhmk0Lef-1=wm707Ce9HuugMx_X5AXoG0Rbg@mail.gmail.com Whole thread Raw |
In response to | Re: Adding pg_dump flag for parallel export to pipes (Hannu Krosing <hannuk@google.com>) |
List | pgsql-hackers |
If there are no objections we will add this to the commitfest On Mon, Apr 7, 2025 at 9:48 PM Hannu Krosing <hannuk@google.com> wrote: > > > Just to bring this out separately : Does anybody have any idea why pipe commands close inside tests ? > > Re: 003-pg_dump_basic_tests has a few basic validation tests for > correctmflag combinations. We need to write more automated tests in > 002_pg_dump.pl but have been running into some issues with environment > setup due to which certain pipe commands result in the shell process > becoming defunct. These same commands are working fine in manual > testing. We are still looking into this. > > ---- > Hannu > > > On Mon, Apr 7, 2025 at 7:17 PM Nitin Motiani <nitinmotiani@google.com> wrote: >> >> Hi Hackers, >> >> We are proposing the ability to specify a pipe command to pg_dump by a >> flag. And attaching the patch set. >> >> Why : Currently it is quite simple to pipe the output of pg_dump for >> text format to a pipe at command line and do any manipulations >> necessary. Following is an example : >> >> pg_dump <flags> <dbname> | lz4 | pv -L 10k | ssh remote.host >> "cat - > remote.dump.lz4" >> >> Here we first compress the stream using lz4 and then send it over ssh >> to a remote host to be saved as a file while rate-limiting the network >> usage to 10KB/s. >> >> Something like this is not possible for format=directory (-Fd) since >> all you can provide is the directory name to store the individual >> files. Note it is not possible to do this irrespective of the usage of >> the parallel dump option ('--jobs' flag). >> >> While the directory format supports compression using a flag, the rest >> of the operations in the above example are not possible. And a pipe >> command provides more flexibility in what compression algorithm one >> wants to use. >> >> This patch set provides pg_dump the ability to pipe the data in the >> directory mode by using a new flag '--pipe-command' (in both parallel >> and non-parallel mode). >> >> We also add a similar option to pg_restore. >> >> The following can be the major use cases of these changes : >> 1. Stream pg_dump output to a cloud storage >> 2. SSH the data to a remote host (with or without throttling) >> 3. Custom compression options >> >> >> Usage Examples : Here is an example of how the pipe-command will look like. >> >> pg_dump -Fd mydb --pipe-command="cat > dumpdir/%f" (dumpdir >> should exist beforehand.) >> >> This is equivalent to >> >> pg_dump -Fd mydb --file=dumpdir >> >> (Please note that the flags '--file' or '--pipe-command' can't be used >> together.) >> >> For the more complex scenario as mentioned above, the command will be >> (with the parallelism of 5) : >> >> pg_dump -Fd mydb -j 5 --pipe-command="lz4 | pv -L 10k | ssh >> remote.host "cat > dumpdir/%f"" >> >> Please note the use of %f in the above examples. As a user would >> almost always want to write the post-processing output to a file (or >> perhaps a cloud location), we provide a format specifier %f in the >> command. The implementation of pipe-command replaces these format >> specifiers with the corresponding file names. These file names are the >> same as they would be in the current usage of directory format with >> '--file' flag (<dump_id>.dat, toc.dat, blob_NNN.toc, >> blob_<blob_id>.dat). >> >> The usage of this flag with pg_restore will also be similar. Here is >> an example of restoring from a gzip compressed dump directory. >> >> pg_restore -C -Fd -d postgres --pipe-commnad="cat >> dumpdir/%f.gz | gunzip" >> >> The new flag in pg_restore also works with '-l' and '-L' options >> >> pg_restore -C -Fd -d postgres --pipe-commnad="cat dumpdir/%f" -L db.list >> >> >> Implementation Details : Here are the major changes : >> 1. We reuse the same variables which store the file name to store >> the pipe command. And add a new bool fSpecIsPipe in _archiveHandle >> (similar bools in pg_dump.c and pg_restore.c) to specify if it's a >> pipe command. >> 2. In the cases when the above bool is set to true, we use popen >> and pclose instead of fopen and fclose. >> 3. To enable the format specifier %f in the pipe-command, we make >> changes to the file name creation logic in a few places. Currently the >> file name (corresponding to a table or large object) is appended to >> the directory name provided by '--file' command. In case of >> '--pipe-command', we use 'replace_percent_placeholders' to replace %f >> with the corresponding file name. This change is made for both table >> files and LO TOC files. >> >> With these core changes, the rest of the code continues working as-is. >> >> We are attaching 4 patches for this change : >> >> 001-pg_dump_pipe has the pg_dump pipe support code. >> 002-pg_restore_pipe has the pg_restore pipe support. >> 003-pg_dump_basic_tests has a few basic validation tests for >> correctmflag combinations. We need to write more automated tests in >> 002_pg_dump.pl but have been running into some issues with environment >> setup due to which certain pipe commands result in the shell process >> becoming defunct. These same commands are working fine in manual >> testing. We are still looking into this. >> 004-pg_dump_documentation has the proposed documentation changes. >> >> We are working on the above test issues and cleanup of the patches. >> >> Open Questions : There are a couple of open questions in the implementation : >> >> 1. Currently the LO TOC file (blob_NNN.toc) is opened in the >> append mode. This is not possible with popen for the pipe command. >> From reading the code, it seems to us that this file doesn't need to >> be opened in the append mode. As '_StartLOs' is called once per >> archive entry in WriteDataChunksForToCEntry followed by the dumper >> function and then '_EndLOs', it should be okay to change this to 'w' >> mode. But this code has been there since the start so we haven't made >> that change yet. In the patch, we have changed it to 'w' pipe-command >> only and added the ideas for potential solutions in the comments. >> 2. We are also not sure yet on how to handle the environment >> issues when trying to add new tests to 002_pg_dump.pl. >> >> Please let us know what you think. >> >> Thanks & Regards, >> Nitin Motiani >> Google
pgsql-hackers by date: