Re: pg_waldump: support decoding of WAL inside tarfile - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: pg_waldump: support decoding of WAL inside tarfile |
Date | |
Msg-id | CA+TgmoayDY5b+bP1vRRN7A3xOP-=+tK13B2C1g-Xm1j4WTrT9Q@mail.gmail.com Whole thread Raw |
In response to | Re: pg_waldump: support decoding of WAL inside tarfile (Amul Sul <sulamul@gmail.com>) |
List | pgsql-hackers |
On Mon, Sep 29, 2025 at 12:17 PM Amul Sul <sulamul@gmail.com> wrote: > While reusing the buffered data > from the first iteration is technically possible, that only works if > the desired start LSN is at the absolute beginning of the archive, or > later in the sequence, which cannot be reliably guaranteed. I spent a bunch of time studying this code today and I think that the problem you're talking about here is evidence of a design problem with astreamer_wal_read() and some of the other code in astreamer_waldump.c. Your code calls astreamer_wal_read() when it wants to peek at the first xlog block to determine the WAL segment size, and it also calls astreamer_wal_read() when it wants read WAL sequentially beginning at the start LSN and continuing until it reaches the end LSN. However, these two cases have very different requirements. verify_tar_archive(), which is misleadingly named and really exists to determine the WAL segment size, just wants to read the first xlog block that physically appears in the archive. Every xlog block will have the same WAL segment size, so it does not matter which one we read. On the other hand, TarWALDumpReadPage wants to read WAL in sequential order. In other words, one call to astreamer_wal_read() really wants to read a block without any block reordering, and the other call wants to read a block with block reordering. To me, it looks like the problem here is that the block reordering functionality should live on top of the astreamer, not inside of it. Imagine that astreamer just spits out the bytes in the order in which they physically appear in the archive, and then there's another component that consumes and reorders those bytes. So, you read data and push it into the astreamer until the number of bytes in the output buffer is at least XLOG_BLCKSZ, and then from there you extract the WAL segment size. Then, you call XLogReaderAllocate() and enter the main loop. The reordering logic lives inside of TarWALDumpReadPage(). Each time it gets data from the astreamer's buffer, it either returns it to the caller if it's in order or buffers it using temporary files if not. I found it's actually quite easy to write a patch that avoids reopening the file. Here it is, on top of your v4: diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c index 2c42df46d43..c4346a5e211 100644 --- a/src/bin/pg_waldump/pg_waldump.c +++ b/src/bin/pg_waldump/pg_waldump.c @@ -368,17 +368,8 @@ init_tar_archive_reader(XLogDumpPrivate *private, const char *waldir, XLogRecPtr startptr, XLogRecPtr endptr, pg_compress_algorithm compression) { - int fd; astreamer *streamer; - /* Open tar archive and store its file descriptor */ - fd = open_file_in_directory(waldir, private->archive_name); - - if (fd < 0) - pg_fatal("could not open file \"%s\"", private->archive_name); - - private->archive_fd = fd; - /* * Create an appropriate chain of archive streamers for reading the given * tar archive. @@ -1416,12 +1407,22 @@ main(int argc, char **argv) /* we have everything we need, start reading */ if (is_tar) { + /* Open tar archive and store its file descriptor */ + private.archive_fd = + open_file_in_directory(waldir, private.archive_name); + if (private.archive_fd < 0) + pg_fatal("could not open file \"%s\"", private.archive_name); + /* Verify that the archive contains valid WAL files */ waldir = waldir ? pg_strdup(waldir) : pg_strdup("."); init_tar_archive_reader(&private, waldir, InvalidXLogRecPtr, InvalidXLogRecPtr, compression); verify_tar_archive(&private); - free_tar_archive_reader(&private); + astreamer_free(private.archive_streamer); + + if (lseek(private.archive_fd, 0, SEEK_SET) != 0) + pg_log_error("could not seek in file \"%s\": %m", + private.archive_name); /* Set up for reading tar file */ init_tar_archive_reader(&private, waldir, private.startptr, Of course, this is not really what we want to do: it avoids reopening the file, but because we can't back up the archive streamer once it's been created, we have to lseek back to the beginning of the file. But notice how silly this looks: with this patch, we free the archive reader and immediately create a new archive reader that is exactly the same in every way except that we call astreamer_waldump_new(startptr, endptr, private) instead of astreamer_waldump_new(InvalidXLogRecPtr, InvalidXLogRecPtr, private). We could arrange to update the original archive streamer with new values of startSegNo and endSegNo after verify_tar_archive(), but that's still not quite good enough, because we might have already made some decisions on what to do with the data that we read that it's too late to reverse. But, what that means is that the astreamer_waldump machinery is not smart enough to read one block of data without making irreversible decisions from which we can't recover without recreating the entire object. I think we can, and should, try to do better. It's also worth noting that the unfortunate layering doesn't just require us to read the first block of the file: it also complicates the code in various places. The fact that astreamer_wal_read() needs a special case for XLogRecPtrIsInvalid(recptr) is a direct result of this problem, and the READ_ANY_WAL() macro and both the places that test it are also direct results of this problem. In other words, I'm arguing that astreamer_wal_read() is incorrectly defined, and that error creates ugliness in the code both above and below astreamer_wal_read(). While I'm on the topic of astreamer_wal_read(), here are a few other problems I noticed: * The return value is not documented, and it seems to always be count, in which case it might as well return void. The caller already has the value they passed for count. * It seems like it would be more appropriate to assert that endPtr >= len and just set startPtr = endPtr - len. I don't see how len > endPtr can ever happen, and I bet bad things will happen if it does. * "pg_waldump never ask the same" -> "pg_waldump never asks for the same" Also, this is absolutely not OK with me: /* Fetch more data */ if (astreamer_archive_read(privateInfo) == 0) { char fname[MAXFNAMELEN]; XLogSegNo segno; XLByteToSeg(targetPagePtr, segno, WalSegSz); an XLogFileName(fname, privateInfo->timeline, segno, WalSegSz); pg_fatal("could not find file \"%s\" in \"%s\" archive", fname, privateInfo->archive_name); } astreamer_archive_read() will return 0 if we reach the end of the tarfile, so this is saying that if we reach the end of the tar file without finding the range of bytes for which we're looking, the explanation must be that the relevant WAL file is missing from the archive. But that is way too much action at a distance. I was able to easily construct a counterexample by copying the first 81920 bytes of a valid WAL file and then doing this: [robert.haas pgsql-meson]$ tar tf pg_wal.tar 000000010000000000000005 [robert.haas pgsql-meson]$ pg_waldump -s 0/050008D8 -e 0/05FFED98 pg_wal.tar >/dev/null pg_waldump: error: could not find file "000000010000000000000005" in "pg_wal.tar" archive Without the redirection to /dev/null, what happened was that pg_waldump printed out a bunch of records from 000000010000000000000005 and then said that 000000010000000000000005 could not be found, which is obviously silly. But the fact that I found a specific counterexample here isn't even really the point. The point is that there's a big gap between what we actually know at this point (which is that we've read the whole input file) and what the message is claiming (which is that the reason must be that the file is missing from the archive). Even if the counterexample above didn't exist and that really were the only way for that to happen as of today, that's very fragile. Maybe some future code change will make it so that there's a second reason that could happen. How would somebody realize that they had created a second condition by means of which this code could be reached? If they did realize it, how would they get the correct error to be reported? I'm not quite sure how this should be fixed, but I strongly suspect that the error report here needs to move closer to the code that is doing the file reordering. Aside from the possibility of the file being missing and the possibility of the file being too short, a third possibility is that targetPagePtr retreats between one call and the next. That really shouldn't happen, but there are no asserts here verifying that it doesn't. I also don't like the fact that one call to astreamer_archive_read() checks the return value (but only whether it's zero, the specific return value apparently doesn't matter, so why doesn't it return bool?) and the other doesn't. That kind of coding pattern is very rarely correct. The code says: /* Continue reading from the open WAL segment, if any */ if (state->seg.ws_file >= 0) { /* * To prevent a race condition where the archive streamer is still * exporting a file that we are trying to read, we invoke the streamer * to ensure enough data is available. */ if (private->curSegNo == state->seg.ws_segno) astreamer_archive_read(private); return WALDumpReadPage(state, targetPagePtr, reqLen, targetPtr, readBuff); } But it's unclear why this should be good enough to ensure that enough data is available. astreamer_archive_read() might read zero bytes and return 0, so this doesn't really guarantee anything at all. On the other hand, even if astereamer_archive_read() returns a non-zero value, it's only going to read READ_CHUNK_SIZE bytes from the underlying file, so if more than that needs to be read in order for us to have enough data, we won't. I think it's very hard to imagine a situation in which you can call astreamer_archive_read() without using some loop. That's what astreamer_wal_read() does: it calls astreamer_archive_read() until it either returns 0 -- in which case we know we've failed -- or until we have enough data. Here we just hope that calling it once is enough, and that checking for errors is unimportant. I also don't understand the reference to a race condition, because there's only one process with one thread here, I believe, so what would be racing against? Another thing I noticed is that astreamer_archive_read() makes reference to decrypting, but there's no cryptography involved in any of this. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: