Home > mailing lists

Re: parallelizing the archiver - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: parallelizing the archiver
Date	September 10, 2021 15:22:18
Msg-id	CA+TgmoZUd6zBNb+boukVXrGAVgLyU-fPY+6yfiKj6abmNUCvWA@mail.gmail.com Whole thread Raw
In response to	Re: parallelizing the archiver (Julien Rouhaud <rjuju123@gmail.com>)
Responses	Re: parallelizing the archiver Re: parallelizing the archiver
List	pgsql-hackers

Tree view

On Fri, Sep 10, 2021 at 10:19 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> Those approaches don't really seems mutually exclusive?  In both case
> you will need to internally track the status of each WAL file and
> handle non contiguous file sequences.  In case of parallel commands
> you only need additional knowledge that some commands is already
> working on a file.  Wouldn't it be even better to eventually be able
> launch multiple batches of multiple files rather than a single batch?

Well, I guess I'm not convinced. Perhaps people with more knowledge of
this than I may already know why it's beneficial, but in my experience
commands like 'cp' and 'scp' are usually limited by the speed of I/O,
not the fact that you only have one of them running at once. Running
several at once, again in my experience, is typically not much faster.
On the other hand, scp has a LOT of startup overhead, so it's easy to
see the benefits of batching.

[rhaas pgsql]$ touch x y z
[rhaas pgsql]$ time sh -c 'scp x cthulhu: && scp y cthulhu: && scp z cthulhu:'
x                                             100%  207KB  78.8KB/s   00:02
y                                             100%    0     0.0KB/s   00:00
z                                             100%    0     0.0KB/s   00:00

real 0m9.418s
user 0m0.045s
sys 0m0.071s
[rhaas pgsql]$ time sh -c 'scp x y z cthulhu:'
x                                             100%  207KB 273.1KB/s   00:00
y                                             100%    0     0.0KB/s   00:00
z                                             100%    0     0.0KB/s   00:00

real 0m3.216s
user 0m0.017s
sys 0m0.020s

> If we start with parallelism first, the whole ecosystem could
> immediately benefit from it as is.  To be able to handle multiple
> files in a single command, we would need some way to let the server
> know which files were successfully archived and which files weren't,
> so it requires a different communication approach than the command
> return code.

That is possibly true. I think it might work to just assume that you
have to retry everything if it exits non-zero, but that requires the
archive command to be smart enough to do something sensible if an
identical file is already present in the archive.

> But as I said, I'm not convinced that using the archive_command
> approach for that is the best approach  If I understand correctly,
> most of the backup solutions would prefer to have a daemon being
> launched and use it at a queuing system.  Wouldn't it be better to
> have a new archive_mode, e.g. "daemon", and have postgres responsible
> to (re)start it, and pass information through the daemon's
> stdin/stdout or something like that?

Sure. Actually, I think a background worker would be better than a
separate daemon. Then it could just talk to shared memory directly.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

pgsql-hackers by date:

From: Aleksander Alekseev
Date: 10 September 2021, 14:44:25
Subject: Re: Increase value of OUTER_VAR

From: Mark Dilger
Date: 10 September 2021, 15:42:09
Subject: Re: [Patch] ALTER SYSTEM READ ONLY

Re: parallelizing the archiver - Mailing list pgsql-hackers

Previous

Next