Re: pg_waldump: support decoding of WAL inside tarfile - Mailing list pgsql-hackers

From Amul Sul
Subject Re: pg_waldump: support decoding of WAL inside tarfile
Date
Msg-id CAAJ_b959x5VjmLJFmN78r_QohQuuj=fde11VbbAOHn5TzgEzng@mail.gmail.com
Whole thread Raw
In response to Re: pg_waldump: support decoding of WAL inside tarfile  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Mon, Oct 20, 2025 at 8:05 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Oct 16, 2025 at 7:49 AM Amul Sul <sulamul@gmail.com> wrote:
>
> > So, if we put the reordering logic outside the streamer, we’d
> > sometimes be receiving buffers containing mixed data from two WAL
> > files. The caller would then need to correctly identify WAL file
> > boundaries within those buffers. This would require passing extra
> > metadata -- like segment numbers for the WAL files in the buffer, plus
> > start and end offsets of those segments within the buffer. While not
> > impossible, it feels a bit hacky and I'm unsure if that’s the best
> > approach.
>
> I agree that we need that kind of metadata, but I don't see why our
> need for it depends on where we do the reordering. That is, if we do
> the reordering above the astreamer layer, we need to keep track of the
> origin of each chunk of WAL bytes, and if we do the reordering within
> the astreamer layer, we still need to keep track of the origin of the
> WAL bytes. Doing the ordering properly requires that tracking, but it
> doesn't say anything about where that tracking has to be performed.
>
> I think it might be better if we didn't write to the astreamer's
> buffer at all. For example, suppose we create a struct that looks
> approximately like this:
>
> struct ChunkOfDecodedWAL
> {
>      XLogSegNo segno; // could also be XLogRecPtr start_lsn or char
> *walfilename or whatever
>      StringInfoData buffer;
>      char *spillfilename; // or whatever we use to identify the temporary files
>      bool already_removed;
>      // potentially other metadata
> };
>
> Then, create a hash table and key it on the segno whatever. Have the
> astreamer write to the hash table: when it gets a chunk of WAL, it
> looks up or creates the relevant hash table entry and appends the data
> to the buffer. At any convenient point in the code, you can decide to
> write the data from the buffer to a spill file, after which you
> resetStringInfo() on the buffer and populate the spill file name. When
> you've used up the data, you remove the spill file and set the
> already_removed flag.
>
> I think this could also help with the error reporting stuff. When you
> get to the end of the file, you'll know all the files you saw and how
> much data you read from each of them. So you could possibly do
> something like
>
> ERROR: LSN %08X/%08X not found in archive "\%s\"
> DETAIL: WAL segment %s is not present in the archive
> -or
> DETAIL: WAL segment %s was expected to be %u bytes, but was only %u bytes
> -or-
> DETAIL: whatever else can go wrong
>
> The point is that every file you've ever seen has a hash table entry,
> and in that hash table entry you can store everything about that file
> that you need to know, whether that's the file data, the disk file
> that contains the file data, the fact that we already threw the data
> away, or any other fact that you can imagine wanting to know.
>
> Said differently, the astreamer buffer is not really a great place to
> write data. It exists because when we're just forwarding data from one
> astreamer to the next, we will often need to buffer a small amount of
> data to avoid terrible performance. However, it's only there to be
> used when we don't have something better. I don't think any astreamer
> that is intended to be the last one in the chain currently writes to
> the buffer -- they write to the output file, or whatever, because
> using an in-memory buffer as your final output destination is not a
> real good plan.
>

Make sense, I implemented this approach in the attached version, but
with a different structure name and a slightly different error
message. In the error output using the WAL file name instead of the
LSN. This is because the LSN at that point may differ from the
user-provided one (it might have been adjusted to the start of a WAL
page by xlogreader). This follows the same style used in the routine
that reads the WAL file. The LSN values (user provided) are only used
in error messages generated at the very beginning, specifically in the
main() function of pg_waldump.

I have also restructured the code by moving most of the tar file
reading logic out of pg_waldump.c into astreamer_waldump.c, which has
now been renamed to archive_waldump.c.

Kindly have a look at the attached version. Thank you !

Regards,
Amul

Attachment

pgsql-hackers by date:

Previous
From: "Joel Jacobson"
Date:
Subject: Re: Optimize LISTEN/NOTIFY
Next
From: Álvaro Herrera
Date:
Subject: Re: Consistently use the XLogRecPtrIsInvalid() macro