Re: pg_waldump: support decoding of WAL inside tarfile - Mailing list pgsql-hackers

From Amul Sul
Subject Re: pg_waldump: support decoding of WAL inside tarfile
Date
Msg-id CAAJ_b95PZp+XnATM=rhV2MEHRa4mL9HS8_JAMh=n-COHUP_uEg@mail.gmail.com
Whole thread Raw
In response to Re: pg_waldump: support decoding of WAL inside tarfile  (Jakub Wartak <jakub.wartak@enterprisedb.com>)
List pgsql-hackers
On Mon, Sep 8, 2025 at 7:07 PM Jakub Wartak
<jakub.wartak@enterprisedb.com> wrote:
>
> On Tue, Aug 26, 2025 at 1:53 PM Amul Sul <sulamul@gmail.com> wrote:
> >
> [..patch]
>
> Hi Amul!
>

Thanks for your review. I'm replying to a few of your comments now,
but for the rest, I need to think about them. I'm kind of in agreement
with some of them for the fix, but I won't be able to spend time on
that next week due to official travel. I'll try to get back as soon as
possible after that.

>     a.  Why should it be necessary to provide startLSN (-s) ? Couldn't
> it autodetect the first WAL (tar file) inside and just use that with
> some info message?
>     $ /usr/pgsql19/bin/pg_waldump --path=/tmp/base/pg_wal.tar
>     pg_waldump: error: no start WAL location given
>

There are two reasons. First,  existing pg_waldump
--path=some_directory would result in the same error. Second, it would
force us to re-read the archive twice just to locate the first WAL
segment, which is inefficient.

>     c. It doesnt work when using SEGSTART, but it's there:
>     $ /usr/pgsql19/bin/pg_waldump --path=/tmp/base/pg_wal.tar
> 000000010000000000000059
>     pg_waldump: error: could not open file "000000010000000000000059":
> Not a directory
>     $ tar tf /tmp/base/pg_wal.tar | head -1
>     000000010000000000000059
>

I don't believe this is the correct use case. The WAL files are inside
a tar archive, and the requirement is to use a starting LSN and a
timeline (if not the default).

>     d.     I've later noticed that follow-up patches seem to use the
> -s switch and there it seems to work OK. The above SEGSTART issue was
> not detected, probably because tests need to be extended cover  of
> segment name rather than just --start LSN (see test_pg_waldump):
>     $ /usr/pgsql19/bin/pg_waldump --path=/tmp/base/pg_wal.tar --stats
> -s 0/59000358
>     pg_waldump: first record is after 0/59000358, at 0/590003E8,
> skipping over 144 bytes
>     WAL statistics between 0/590003E8 and 0/61000000:
>     [..]
>

Hope previous reasoning makes sense to you.

>     e. Code around`if (walpath == NULL && directory != NULL)` needs
> some comments.
>

I think this is an existing one.

>     f. Code around `if (fname != NULL && is_tar_file(fname,
> &compression))` , so if fname is WAL segment here
> (00000001000000000000005A) and we do check again if that has been
> tar-ed (is_tar_file())? Why?
>

Again, how?

>     g. Just a question: the commit message says `Note that this patch
> requires that the WAL files within the archive be in sequential order;
> an error will be reported otherwise`. I'm wondering if such
> occurrences are known to be happening in the wild? Or is it just an
> assumption that if someone would modify the tar somehow? (either way
> we could just add a reason why we need to handle such a case if we
> know -- is manual alternation the only source of such state?). For the
> record, I've tested crafting custom archives with out of sequence WAL
> archives and the code seems to work (it was done using: tar --append
> -f pg_wal.tar --format=ustar ..)
>

This is an almost nonexistent occurrence. While pg_basebackup archives
WAL files in sequential order, we don't have an explicit code to
enforce that order within it. Furthermore, since we can't control how
external tools might handle the files, this extra precaution is
necessary.


> Another open question I have is this: shouldn't backup_manifest come
> with CRC checksum for the archived WALs? Or does that guarantee that
> backup_manifest WAL-Ranges are present in pg_wal.tar is good enough
> because individual WAL files are CRC-protected itself?
>

I don't know, I have to check pg_verifybackup.

Regards,
Amul



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: How can end users know the cause of LR slot sync delays?
Next
From: Sergey Shinderuk
Date:
Subject: Re: Error with DEFAULT VALUE in temp table