Home > mailing lists
Re: pg_waldump: support decoding of WAL inside tarfile - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: pg_waldump: support decoding of WAL inside tarfile
Date	October 10 21:01:59
Msg-id	CA+TgmoayDY5b+bP1vRRN7A3xOP-=+tK13B2C1g-Xm1j4WTrT9Q@mail.gmail.com Whole thread Raw
In response to	Re: pg_waldump: support decoding of WAL inside tarfile (Amul Sul <sulamul@gmail.com>)
List	pgsql-hackers
Tree view
On Mon, Sep 29, 2025 at 12:17 PM Amul Sul <sulamul@gmail.com> wrote:
> While reusing the buffered data
> from the first iteration is technically possible, that only works if
> the desired start LSN is at the absolute beginning of the archive, or
> later in the sequence, which cannot be reliably guaranteed.

I spent a bunch of time studying this code today and I think that the
problem you're talking about here is evidence of a design problem with
astreamer_wal_read() and some of the other code in
astreamer_waldump.c. Your code calls astreamer_wal_read() when it
wants to peek at the first xlog block to determine the WAL segment
size, and it also calls astreamer_wal_read() when it wants read WAL
sequentially beginning at the start LSN and continuing until it
reaches the end LSN. However, these two cases have very different
requirements. verify_tar_archive(), which is misleadingly named and
really exists to determine the WAL segment size, just wants to read
the first xlog block that physically appears in the archive. Every
xlog block will have the same WAL segment size, so it does not matter
which one we read. On the other hand, TarWALDumpReadPage wants to read
WAL in sequential order. In other words, one call to
astreamer_wal_read() really wants to read a block without any block
reordering, and the other call wants to read a block with block
reordering.

To me, it looks like the problem here is that the block reordering
functionality should live on top of the astreamer, not inside of it.
Imagine that astreamer just spits out the bytes in the order in which
they physically appear in the archive, and then there's another
component that consumes and reorders those bytes. So, you read data
and push it into the astreamer until the number of bytes in the output
buffer is at least XLOG_BLCKSZ, and then from there you extract the
WAL segment size. Then, you call XLogReaderAllocate() and enter the
main loop. The reordering logic lives inside of TarWALDumpReadPage().
Each time it gets data from the astreamer's buffer, it either returns
it to the caller if it's in order or buffers it using temporary files
if not.

I found it's actually quite easy to write a patch that avoids
reopening the file. Here it is, on top of your v4:

diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 2c42df46d43..c4346a5e211 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -368,17 +368,8 @@ init_tar_archive_reader(XLogDumpPrivate *private,
const char *waldir,
                         XLogRecPtr startptr, XLogRecPtr endptr,
                         pg_compress_algorithm compression)
 {
-    int         fd;
     astreamer  *streamer;

-    /* Open tar archive and store its file descriptor */
-    fd = open_file_in_directory(waldir, private->archive_name);
-
-    if (fd < 0)
-        pg_fatal("could not open file \"%s\"", private->archive_name);
-
-    private->archive_fd = fd;
-
     /*
      * Create an appropriate chain of archive streamers for reading the given
      * tar archive.
@@ -1416,12 +1407,22 @@ main(int argc, char **argv)
     /* we have everything we need, start reading */
     if (is_tar)
     {
+        /* Open tar archive and store its file descriptor */
+        private.archive_fd =
+            open_file_in_directory(waldir, private.archive_name);
+        if (private.archive_fd < 0)
+            pg_fatal("could not open file \"%s\"", private.archive_name);
+
         /* Verify that the archive contains valid WAL files */
         waldir = waldir ? pg_strdup(waldir) : pg_strdup(".");
         init_tar_archive_reader(&private, waldir, InvalidXLogRecPtr,
                                 InvalidXLogRecPtr, compression);
         verify_tar_archive(&private);
-        free_tar_archive_reader(&private);
+        astreamer_free(private.archive_streamer);
+
+        if (lseek(private.archive_fd, 0, SEEK_SET) != 0)
+            pg_log_error("could not seek in file \"%s\": %m",
+                         private.archive_name);

         /* Set up for reading tar file */
         init_tar_archive_reader(&private, waldir, private.startptr,

Of course, this is not really what we want to do: it avoids reopening
the file, but because we can't back up the archive streamer once it's
been created, we have to lseek back to the beginning of the file. But
notice how silly this looks: with this patch, we free the archive
reader and immediately create a new archive reader that is exactly the
same in every way except that we call astreamer_waldump_new(startptr,
endptr, private) instead of astreamer_waldump_new(InvalidXLogRecPtr,
InvalidXLogRecPtr, private). We could arrange to update the original
archive streamer with new values of startSegNo and endSegNo after
verify_tar_archive(), but that's still not quite good enough, because
we might have already made some decisions on what to do with the data
that we read that it's too late to reverse. But, what that means is
that the astreamer_waldump machinery is not smart enough to read one
block of data without making irreversible decisions from which we
can't recover without recreating the entire object. I think we can,
and should, try to do better.

It's also worth noting that the unfortunate layering doesn't just
require us to read the first block of the file: it also complicates
the code in various places. The fact that astreamer_wal_read() needs a
special case for XLogRecPtrIsInvalid(recptr) is a direct result of
this problem, and the READ_ANY_WAL() macro and both the places that
test it are also direct results of this problem. In other words, I'm
arguing that astreamer_wal_read() is incorrectly defined, and that
error creates ugliness in the code both above and below
astreamer_wal_read().

While I'm on the topic of astreamer_wal_read(), here are a few other
problems I noticed:

* The return value is not documented, and it seems to always be count,
in which case it might as well return void. The caller already has the
value they passed for count.
* It seems like it would be more appropriate to assert that endPtr >=
len and just set startPtr = endPtr - len. I don't see how len > endPtr
can ever happen, and I bet bad things will happen if it does.
* "pg_waldump never ask the same" -> "pg_waldump never asks for the same"

Also, this is absolutely not OK with me:

                        /* Fetch more data */
                        if (astreamer_archive_read(privateInfo) == 0)
                        {
                                char            fname[MAXFNAMELEN];
                                XLogSegNo       segno;

                                XLByteToSeg(targetPagePtr, segno, WalSegSz); an
                                XLogFileName(fname,
privateInfo->timeline, segno, WalSegSz);

                                pg_fatal("could not find file \"%s\"
in \"%s\" archive",
                                                 fname,
privateInfo->archive_name);
                        }

astreamer_archive_read() will return 0 if we reach the end of the
tarfile, so this is saying that if we reach the end of the tar file
without finding the range of bytes for which we're looking, the
explanation must be that the relevant WAL file is missing from the
archive. But that is way too much action at a distance. I was able to
easily construct a counterexample by copying the first 81920 bytes of
a valid WAL file and then doing this:

[robert.haas pgsql-meson]$ tar tf pg_wal.tar
000000010000000000000005
[robert.haas pgsql-meson]$ pg_waldump -s 0/050008D8 -e 0/05FFED98
pg_wal.tar >/dev/null
pg_waldump: error: could not find file "000000010000000000000005" in
"pg_wal.tar" archive

Without the redirection to /dev/null, what happened was that
pg_waldump printed out a bunch of records from
000000010000000000000005 and then said that 000000010000000000000005
could not be found, which is obviously silly. But the fact that I
found a specific counterexample here isn't even really the point. The
point is that there's a big gap between what we actually know at this
point (which is that we've read the whole input file) and what the
message is claiming (which is that the reason must be that the file is
missing from the archive). Even if the counterexample above didn't
exist and that really were the only way for that to happen as of
today, that's very fragile. Maybe some future code change will make it
so that there's a second reason that could happen. How would somebody
realize that they had created a second condition by means of which
this code could be reached? If they did realize it, how would they get
the correct error to be reported?

I'm not quite sure how this should be fixed, but I strongly suspect
that the error report here needs to move closer to the code that is
doing the file reordering. Aside from the possibility of the file
being missing and the possibility of the file being too short, a third
possibility is that targetPagePtr retreats between one call and the
next. That really shouldn't happen, but there are no asserts here
verifying that it doesn't.

I also don't like the fact that one call to astreamer_archive_read()
checks the return value (but only whether it's zero, the specific
return value apparently doesn't matter, so why doesn't it return
bool?) and the other doesn't. That kind of coding pattern is very
rarely correct. The code says:

    /* Continue reading from the open WAL segment, if any */
    if (state->seg.ws_file >= 0)
    {
        /*
         * To prevent a race condition where the archive streamer is still
         * exporting a file that we are trying to read, we invoke the streamer
         * to ensure enough data is available.
         */
        if (private->curSegNo == state->seg.ws_segno)
            astreamer_archive_read(private);

        return WALDumpReadPage(state, targetPagePtr, reqLen, targetPtr,
                               readBuff);
    }

But it's unclear why this should be good enough to ensure that enough
data is available. astreamer_archive_read() might read zero bytes and
return 0, so this doesn't really guarantee anything at all. On the
other hand, even if astereamer_archive_read() returns a non-zero
value, it's only going to read READ_CHUNK_SIZE bytes from the
underlying file, so if more than that needs to be read in order for us
to have enough data, we won't. I think it's very hard to imagine a
situation in which you can call astreamer_archive_read() without using
some loop. That's what astreamer_wal_read() does: it calls
astreamer_archive_read() until it either returns 0 -- in which case we
know we've failed -- or until we have enough data. Here we just hope
that calling it once is enough, and that checking for errors is
unimportant. I also don't understand the reference to a race
condition, because there's only one process with one thread here, I
believe, so what would be racing against?

Another thing I noticed is that astreamer_archive_read() makes
reference to decrypting, but there's no cryptography involved in any
of this.

--
Robert Haas
EDB: http://www.enterprisedb.com
pgsql-hackers by date:
From: Nathan Bossart
Date: 10 October, 20:31:13
Subject: Re: another autovacuum scheduling thread
From: Robert Haas
Date: 10 October, 21:02:35
Subject: Re: RFC: extensible planner state
Re: pg_waldump: support decoding of WAL inside tarfile - Mailing list pgsql-hackers

Previous

Next