Re: parallelizing the archiver - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: parallelizing the archiver |
Date | |
Msg-id | CA+TgmoZUd6zBNb+boukVXrGAVgLyU-fPY+6yfiKj6abmNUCvWA@mail.gmail.com Whole thread Raw |
In response to | Re: parallelizing the archiver (Julien Rouhaud <rjuju123@gmail.com>) |
Responses |
Re: parallelizing the archiver
Re: parallelizing the archiver |
List | pgsql-hackers |
On Fri, Sep 10, 2021 at 10:19 AM Julien Rouhaud <rjuju123@gmail.com> wrote: > Those approaches don't really seems mutually exclusive? In both case > you will need to internally track the status of each WAL file and > handle non contiguous file sequences. In case of parallel commands > you only need additional knowledge that some commands is already > working on a file. Wouldn't it be even better to eventually be able > launch multiple batches of multiple files rather than a single batch? Well, I guess I'm not convinced. Perhaps people with more knowledge of this than I may already know why it's beneficial, but in my experience commands like 'cp' and 'scp' are usually limited by the speed of I/O, not the fact that you only have one of them running at once. Running several at once, again in my experience, is typically not much faster. On the other hand, scp has a LOT of startup overhead, so it's easy to see the benefits of batching. [rhaas pgsql]$ touch x y z [rhaas pgsql]$ time sh -c 'scp x cthulhu: && scp y cthulhu: && scp z cthulhu:' x 100% 207KB 78.8KB/s 00:02 y 100% 0 0.0KB/s 00:00 z 100% 0 0.0KB/s 00:00 real 0m9.418s user 0m0.045s sys 0m0.071s [rhaas pgsql]$ time sh -c 'scp x y z cthulhu:' x 100% 207KB 273.1KB/s 00:00 y 100% 0 0.0KB/s 00:00 z 100% 0 0.0KB/s 00:00 real 0m3.216s user 0m0.017s sys 0m0.020s > If we start with parallelism first, the whole ecosystem could > immediately benefit from it as is. To be able to handle multiple > files in a single command, we would need some way to let the server > know which files were successfully archived and which files weren't, > so it requires a different communication approach than the command > return code. That is possibly true. I think it might work to just assume that you have to retry everything if it exits non-zero, but that requires the archive command to be smart enough to do something sensible if an identical file is already present in the archive. > But as I said, I'm not convinced that using the archive_command > approach for that is the best approach If I understand correctly, > most of the backup solutions would prefer to have a daemon being > launched and use it at a queuing system. Wouldn't it be better to > have a new archive_mode, e.g. "daemon", and have postgres responsible > to (re)start it, and pass information through the daemon's > stdin/stdout or something like that? Sure. Actually, I think a background worker would be better than a separate daemon. Then it could just talk to shared memory directly. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: