Thread: pg_verifybackup: TAR format backup verification
Hi, Currently, pg_verifybackup only works with plain (directory) format backups. This proposal aims to support tar-format backups too. We will read the tar files from start to finish and verify each file inside against the backup_manifest information, similar to how it verifies plain files. We are introducing new options to pg_verifybackup: -F, --format=p|t and -Z, --compress=METHOD, which allow users to specify the backup format and compression type, similar to the options available in pg_basebackup. If these options are not provided, the backup format and compression type will be automatically detected. To determine the format, we will search for PG_VERSION file in the backup directory — if found, it indicates a plain backup; otherwise, it is a tar-format backup. For the compression type, we will check the extension of base.tar.xxx file of tar-format backup. Refer to patch 0008 for the details. The main challenge is to structure the code neatly. For plain-format backups, we verify bytes directly from the files. For tar-format backups, we read bytes from the tar file of the specific file we care about. We need an abstraction to handle both formats smoothly, without using many if statements or special cases. To achieve this goal, we need to reuse existing infrastructure without duplicating code, and for that, the major work involved here is the code refactoring. Here is a breakdown of the work: 1. BBSTREAMER Rename and Relocate: BBSTREAMER, currently used in pg_basebackup for reading and decompressing TAR files; can also be used for pg_verifybackup. In the future, it could support other tools like pg_combinebackup for merging TAR backups without extraction, and pg_waldump for verifying WAL files from the tar backup. For that accessibility, BBSTREAMER needs to be relocated to a shared directory. Moreover, renaming BBSTREAMER to ASTREAMER (short for Archive Streamer) would better indicate its general application across multiple tools. Moving it to src/fe_utils directory is appropriate, given its frontend infrastructure use. 2. pg_verifybackup Code Refactoring: The existing code for plain backup verification will be split into separate files or functions, so it can also be reused for tar backup verification. 3. Adding TAR Backup Verification: Finally, patches will be added to implement TAR backup verification, along with tests and documentation. Patches 0001-0003 focus on renaming and relocating BBSTREAMER, patches 0004-0007 on splitting the existing verification code, and patches 0008-0010 on adding TAR backup verification capabilities, tests, and documentation. The last set could be a single patch but is split to make the review easier. Please take a look at the attached patches and share your comments, suggestions, or any ways to enhance them. Your feedback is greatly appreciated. Thank you ! -- Regards, Amul Sul EDB: http://www.enterprisedb.com
Attachment
- v1-0006-Refactor-split-verify_file_checksum-function.patch
- v1-0008-pg_verifybackup-Add-backup-format-and-compression.patch
- v1-0010-pg_verifybackup-Tests-and-document.patch
- v1-0007-Refactor-split-verify_control_file.patch
- v1-0009-pg_verifybackup-Read-tar-files-and-verify-its-con.patch
- v1-0005-Refactor-split-verify_backup_file-function.patch
- v1-0004-Refactor-move-some-part-of-pg_verifybackup.c-to-p.patch
- v1-0002-Refactor-Add-astreamer_inject.h-and-move-related-.patch
- v1-0001-Refactor-Rename-all-bbstreamer-references-to-astr.patch
- v1-0003-Refactor-move-astreamer-files-to-fe_utils-to-make.patch
Hi Amul,
thanks for working on this.
+ file_name_len = strlen(relpath);+ if (file_name_len < file_extn_len ||+ strcmp(relpath + file_name_len - file_extn_len, file_extn) != 0)+ {+ if (compress_algorithm == PG_COMPRESSION_NONE)+ report_backup_error(context,+ "\"%s\" is not a valid file, expecting tar file",+ relpath);+ else+ report_backup_error(context,+ "\"%s\" is not a valid file, expecting \"%s\" compressed tar file",+ relpath,+ get_compress_algorithm_name(compress_algorithm));+ return;+ }
I believe pg_verifybackup needs to exit after reporting a failure here since it could not figure out a streamer to allocate.
Also, v1-0002 removes #include "pqexpbuffer.h" from astreamer.h and adds it to the new .h file and in v1-0004 it
reverts the change. So this can be avoided altogether.
On Tue, Jul 9, 2024 at 3:24 PM Amul Sul <sulamul@gmail.com> wrote:
Hi,
Currently, pg_verifybackup only works with plain (directory) format backups.
This proposal aims to support tar-format backups too. We will read the tar
files from start to finish and verify each file inside against the
backup_manifest information, similar to how it verifies plain files.
We are introducing new options to pg_verifybackup: -F, --format=p|t and -Z,
--compress=METHOD, which allow users to specify the backup format and
compression type, similar to the options available in pg_basebackup. If these
options are not provided, the backup format and compression type will be
automatically detected. To determine the format, we will search for PG_VERSION
file in the backup directory — if found, it indicates a plain backup;
otherwise, it
is a tar-format backup. For the compression type, we will check the extension
of base.tar.xxx file of tar-format backup. Refer to patch 0008 for the details.
The main challenge is to structure the code neatly. For plain-format backups,
we verify bytes directly from the files. For tar-format backups, we read bytes
from the tar file of the specific file we care about. We need an abstraction
to handle both formats smoothly, without using many if statements or special
cases.
To achieve this goal, we need to reuse existing infrastructure without
duplicating code, and for that, the major work involved here is the code
refactoring. Here is a breakdown of the work:
1. BBSTREAMER Rename and Relocate:
BBSTREAMER, currently used in pg_basebackup for reading and decompressing TAR
files; can also be used for pg_verifybackup. In the future, it could support
other tools like pg_combinebackup for merging TAR backups without extraction,
and pg_waldump for verifying WAL files from the tar backup. For that
accessibility,
BBSTREAMER needs to be relocated to a shared directory.
Moreover, renaming BBSTREAMER to ASTREAMER (short for Archive Streamer) would
better indicate its general application across multiple tools. Moving it to
src/fe_utils directory is appropriate, given its frontend infrastructure use.
2. pg_verifybackup Code Refactoring:
The existing code for plain backup verification will be split into separate
files or functions, so it can also be reused for tar backup verification.
3. Adding TAR Backup Verification:
Finally, patches will be added to implement TAR backup verification, along with
tests and documentation.
Patches 0001-0003 focus on renaming and relocating BBSTREAMER, patches
0004-0007 on splitting the existing verification code, and patches 0008-0010 on
adding TAR backup verification capabilities, tests, and documentation. The last
set could be a single patch but is split to make the review easier.
Please take a look at the attached patches and share your comments,
suggestions, or any ways to enhance them. Your feedback is greatly
appreciated.
Thank you !
--
Regards,
Amul Sul
EDB: http://www.enterprisedb.com
Thanks & Regards,
Sravan Velagandula
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Jul 22, 2024 at 8:29 AM Sravan Kumar <sravanvcybage@gmail.com> wrote: > > Hi Amul, > thanks for working on this. > Thanks, for your review. >> + file_name_len = strlen(relpath); >> + if (file_name_len < file_extn_len || >> + strcmp(relpath + file_name_len - file_extn_len, file_extn) != 0) >> + { >> + if (compress_algorithm == PG_COMPRESSION_NONE) >> + report_backup_error(context, >> + "\"%s\" is not a valid file, expecting tar file", >> + relpath); >> + else >> + report_backup_error(context, >> + "\"%s\" is not a valid file, expecting \"%s\" compressed tar file", >> + relpath, >> + get_compress_algorithm_name(compress_algorithm)); >> + return; >> + } > > > I believe pg_verifybackup needs to exit after reporting a failure here since it could not figure out a streamer to allocate. > The intention here is to continue the verification of the remaining tar files instead of exiting immediately in case of an error. If the user prefers an immediate exit, they can use the --exit-on-error option of pg_verifybackup. > Also, v1-0002 removes #include "pqexpbuffer.h" from astreamer.h and adds it to the new .h file and in v1-0004 it > reverts the change. So this can be avoided altogether. > Fix in the attached version. Regards, Amul
Attachment
- v2-0007-Refactor-split-verify_control_file.patch
- v2-0008-pg_verifybackup-Add-backup-format-and-compression.patch
- v2-0006-Refactor-split-verify_file_checksum-function.patch
- v2-0010-pg_verifybackup-Tests-and-document.patch
- v2-0009-pg_verifybackup-Read-tar-files-and-verify-its-con.patch
- v2-0003-Refactor-move-astreamer-files-to-fe_utils-to-make.patch
- v2-0004-Refactor-move-some-part-of-pg_verifybackup.c-to-p.patch
- v2-0002-Refactor-Add-astreamer_inject.h-and-move-related-.patch
- v2-0005-Refactor-split-verify_backup_file-function.patch
- v2-0001-Refactor-Rename-all-bbstreamer-references-to-astr.patch
On Mon, Jul 22, 2024 at 7:53 AM Amul Sul <sulamul@gmail.com> wrote: > Fix in the attached version. First of all, in the interest of full disclosure, I suggested this project to Amul, so I'm +1 on the concept. I think making more of our backup-related tools work with tar and compressed tar formats -- and perhaps eventually data not stored locally -- will make them a lot more usable. If, for example, you take a full backup and an incremental backup, each in tar format, store them in the cloud someplace, and then want to verify and afterwards restore the incremental backup, you would need to download the tar files from the cloud, then extract all the tar files, then run pg_verifybackup and pg_combinebackup over the results. With this patch set, and similar work for pg_combinebackup, you could skip the step where you need to extract the tar files, saving significant amounts of time and disk space. If the tools also had the ability to access data remotely, you could save even more, but that's a much harder project, so it makes sense to me to start with this. Second, I think this patch set is quite well-organized and easy to read. That's not to say there is nothing in these patches to which someone might object, but it seems to me that it should at least be simple for anyone who wants to review to find the things to which they object in the patch set without having to spend too much time on it, which is fantastic. Third, I think the general approach that these patches take to the problem - namely, renaming bbstreamer to astreamer and moving it somewhere that permits it to be reused - makes a lot of sense. To be honest, I don't think I had it in mind that bbstreamer would be a reusable component when I wrote it, or if I did have it in mind, it was off in some dusty corner of my mind that doesn't get visited very often. I was imagining that you would need to build new infrastructure to deal with reading the tar file, but I think what you've done here is better. Reusing the bbstreamer stuff gives you tar file parsing, and decompression if necessary, basically for free, and IMHO the result looks rather elegant. However, I'm not very convinced by 0003. The handling between the meson and make-based build systems doesn't seem consistent. On the meson side, you just add the objects to the same list that contains all of the other files (but not in alphabetical order, which should be fixed). But on the make side, you for some reason invent a separate AOBJS list instead of just adding the files to OBJS. I don't think it should be necessary to treat these objects any differently from any other objects, so they should be able to just go in OBJS: but if it were necessary, then I feel like the meson side would need something similar. Also, I'm not so sure about this change to src/fe_utils/meson.build: - dependencies: frontend_common_code, + dependencies: [frontend_common_code, lz4, zlib, zstd], frontend_common_code already includes dependencies on zlib and zstd, so we probably don't need to add those again here. I checked the result of otool -L src/bin/pg_controldata/pg_controldata from the meson build directory, and I find that currently it links against libz and libzstd but not liblz4. However, if I either make this line say dependencies: [frontend_common_code, lz4] or if I just update frontend_common_code to include lz4, then it starts linking against liblz4 as well. I'm not entirely sure if there's any reason to do one or the other of those things, but I think I'd be inclined to make frontend_common_code just include lz4 since it already includes zlib and zstd anyway, and then you don't need this change. Alternatively, we could take the position that we need to avoid having random front-end tools that don't do anything with compression at all, like pg_controldata for example, to link with compression libraries at all. But then we'd need to rethink a bunch of things that have not much to do with this patch. Regarding 0004, I would rather not move show_progress and skip_checksums to the new header file. I suppose I was a bit lazy in making these file-level global variables instead of passing them down using function arguments and/or a context object, but at least right now they're only global within a single file. Can we think of inserting a preparatory patch that moves these into verifier_context? Regarding 0005, the comment /* Check whether there's an entry in the manifest hash. */ should move inside verify_manifest_entry, where manifest_files_lookup is called. The call to the new function verify_manifest_entry() needs its own, proper comment. Also, I think there's a null-pointer deference hazard here, because verify_manifest_entry() can return NULL but the "Validate the manifest system identifier" chunk assumes it isn't. I think you could hit this - and presumably seg fault - if pg_control is on disk but not in the manifest. Seems like just adding an m != NULL test is easiest, but see also below comments about 0006. Regarding 0006, suppose that the member file within the tar archive is longer than expected. With the unpatched code, we'll feed all of the data to the checksum context, but then, after the read-loop terminates, we'll complain about the file being the wrong length. With the patched code, we'll complain about the checksum mismatch before returning from verify_content_checksum(). I think that's an unintended behavior change, and I think the new behavior is worse than the old behavior. But also, I think that in the case of a tar file, the desired behavior is quite different. In that case, we know the length of the file from the member header, so we can check whether the length is as expected before we read any of the data bytes. If we discover that the size is wrong, we can complain about that and need not feed the checksum bytes to the checksum context at all -- we can just skip them, which will be faster. That approach doesn't really make sense for a file, because even if we were to stat() the file before we started reading it, the length could theoretically change as we are reading it, if someone is concurrently modifying it, but for a tar file I think it does. I would suggest abandoning this refactoring. There's very little logic in verify_file_checksum() that you can actually reuse. I think you should just duplicate the code. If you really want, you could arrange to reuse the error-reporting code that checks for checksumlen != m->checksum_length and memcmp(checksumbuf, m->checksum_payload, checksumlen) != 0, but even that I think is little enough that it's fine to just duplicate it. The rest is either (1) OS calls like open(), read(), etc. which won't be applicable to the read-from-archive case or (2) calls to pg_checksum_WHATEVER, which are fine to just duplicate, IMHO. My eyes are starting to glaze over a bit here so expect comments below this point to be only a partial review of the corresponding patch. Regarding 0007, I think that the should_verify_sysid terminology is problematic. I made all the code and identifier names talk only about the control file, not the specific thing in the control file that we are going to verify, in case in the future we want to verify additional things. This breaks that abstraction. Regarding 0009, I feel like astreamer_verify_content() might want to grow some subroutines. One idea could be to move the ASTREAMER_MEMBER_HEADER case and likewise ASTREAMER_MEMBER_CONTENTS cases into a new function for each; another idea could be to move smaller chunks of logic, e.g. under the ASTREAMER_MEMBER_CONTENTS case, the verify_checksums could be one subroutine and the ill-named verify_sysid stuff could be another. I'm not certain exactly what's best here, but some of this code is as deeply as six levels nested, which is not such a terrible thing that nobody should ever do it, but it is bad enough that we should at least look around for a better way. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Jul 30, 2024 at 9:04 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jul 22, 2024 at 7:53 AM Amul Sul <sulamul@gmail.com> wrote: > > Fix in the attached version. > > First of all, in the interest of full disclosure, I suggested this > project to Amul, so I'm +1 on the concept. I think making more of our > backup-related tools work with tar and compressed tar formats -- and > perhaps eventually data not stored locally -- will make them a lot > more usable. If, for example, you take a full backup and an > incremental backup, each in tar format, store them in the cloud > someplace, and then want to verify and afterwards restore the > incremental backup, you would need to download the tar files from the > cloud, then extract all the tar files, then run pg_verifybackup and > pg_combinebackup over the results. With this patch set, and similar > work for pg_combinebackup, you could skip the step where you need to > extract the tar files, saving significant amounts of time and disk > space. If the tools also had the ability to access data remotely, you > could save even more, but that's a much harder project, so it makes > sense to me to start with this. > > Second, I think this patch set is quite well-organized and easy to > read. That's not to say there is nothing in these patches to which > someone might object, but it seems to me that it should at least be > simple for anyone who wants to review to find the things to which they > object in the patch set without having to spend too much time on it, > which is fantastic. > > Third, I think the general approach that these patches take to the > problem - namely, renaming bbstreamer to astreamer and moving it > somewhere that permits it to be reused - makes a lot of sense. To be > honest, I don't think I had it in mind that bbstreamer would be a > reusable component when I wrote it, or if I did have it in mind, it > was off in some dusty corner of my mind that doesn't get visited very > often. I was imagining that you would need to build new infrastructure > to deal with reading the tar file, but I think what you've done here > is better. Reusing the bbstreamer stuff gives you tar file parsing, > and decompression if necessary, basically for free, and IMHO the > result looks rather elegant. > Thank you so much for the summary and the review. > However, I'm not very convinced by 0003. The handling between the > meson and make-based build systems doesn't seem consistent. On the > meson side, you just add the objects to the same list that contains > all of the other files (but not in alphabetical order, which should be > fixed). But on the make side, you for some reason invent a separate > AOBJS list instead of just adding the files to OBJS. I don't think it > should be necessary to treat these objects any differently from any > other objects, so they should be able to just go in OBJS: but if it > were necessary, then I feel like the meson side would need something > similar. > Fixed -- I did that because it was part of a separate group in pg_basebackup. > Also, I'm not so sure about this change to src/fe_utils/meson.build: > > - dependencies: frontend_common_code, > + dependencies: [frontend_common_code, lz4, zlib, zstd], > > frontend_common_code already includes dependencies on zlib and zstd, > so we probably don't need to add those again here. I checked the > result of otool -L src/bin/pg_controldata/pg_controldata from the > meson build directory, and I find that currently it links against libz > and libzstd but not liblz4. However, if I either make this line say > dependencies: [frontend_common_code, lz4] or if I just update > frontend_common_code to include lz4, then it starts linking against > liblz4 as well. I'm not entirely sure if there's any reason to do one > or the other of those things, but I think I'd be inclined to make > frontend_common_code just include lz4 since it already includes zlib > and zstd anyway, and then you don't need this change. > Fixed -- frontend_common_code now includes lz4 as well. > Alternatively, we could take the position that we need to avoid having > random front-end tools that don't do anything with compression at all, > like pg_controldata for example, to link with compression libraries at > all. But then we'd need to rethink a bunch of things that have not > much to do with this patch. > Noted. I might give it a try another day, unless someone else beats me, perhaps in a separate thread. > Regarding 0004, I would rather not move show_progress and > skip_checksums to the new header file. I suppose I was a bit lazy in > making these file-level global variables instead of passing them down > using function arguments and/or a context object, but at least right > now they're only global within a single file. Can we think of > inserting a preparatory patch that moves these into verifier_context? > Done -- added a new patch as 0004, and the subsequent patch numbers have been incremented accordingly. > Regarding 0005, the comment /* Check whether there's an entry in the > manifest hash. */ should move inside verify_manifest_entry, where > manifest_files_lookup is called. The call to the new function > verify_manifest_entry() needs its own, proper comment. Also, I think > there's a null-pointer deference hazard here, because > verify_manifest_entry() can return NULL but the "Validate the manifest > system identifier" chunk assumes it isn't. I think you could hit this > - and presumably seg fault - if pg_control is on disk but not in the > manifest. Seems like just adding an m != NULL test is easiest, but > see also below comments about 0006. > Fixed -- I did the NULL check in the earlier 0007 patch, but it should have been done in this patch. > Regarding 0006, suppose that the member file within the tar archive is > longer than expected. With the unpatched code, we'll feed all of the > data to the checksum context, but then, after the read-loop > terminates, we'll complain about the file being the wrong length. With > the patched code, we'll complain about the checksum mismatch before > returning from verify_content_checksum(). I think that's an unintended > behavior change, and I think the new behavior is worse than the old > behavior. But also, I think that in the case of a tar file, the > desired behavior is quite different. In that case, we know the length > of the file from the member header, so we can check whether the length > is as expected before we read any of the data bytes. If we discover > that the size is wrong, we can complain about that and need not feed > the checksum bytes to the checksum context at all -- we can just skip > them, which will be faster. That approach doesn't really make sense > for a file, because even if we were to stat() the file before we > started reading it, the length could theoretically change as we are > reading it, if someone is concurrently modifying it, but for a tar > file I think it does. > In the case of a file size mismatch, we never reach the point where checksum calculation is performed, because verify_manifest_entry() encounters an error and sets manifest_file->bad to true, which causes skip_checksum to be set to false. For that reason, I didn’t include the size check again in the checksum calculation part. This behavior is the same for plain backups, but the additional file size check was added as a precaution (per comment in verify_file_checksum()), possibly for the same reasons you mentioned. I agree, changing the order of errors could create confusion. Previously, a file size mismatch was a clear and appropriate error that was reported before the checksum failure error. However, this can be fixed by delaying the checksum calculation until the expected file content size is received. Specifically, return from verify_content_checksum(), if (*computed_len != m->size). If the file size is incorrect, the checksum calculation won't be performed, and the caller's loop reading file (I mean in verify_file_checksum()) will exit at some point which later encounters the size mismatch error. > I would suggest abandoning this refactoring. There's very little logic > in verify_file_checksum() that you can actually reuse. I think you > should just duplicate the code. If you really want, you could arrange > to reuse the error-reporting code that checks for checksumlen != > m->checksum_length and memcmp(checksumbuf, m->checksum_payload, > checksumlen) != 0, but even that I think is little enough that it's > fine to just duplicate it. The rest is either (1) OS calls like > open(), read(), etc. which won't be applicable to the > read-from-archive case or (2) calls to pg_checksum_WHATEVER, which are > fine to just duplicate, IMHO. > I kept the refactoring as it is by fixing verify_content_checksum() as mentioned in the previous paragraph. Please let me know if this fix and the explanation makes sense to you. I’m okay with abandoning this refactor patch if you think. > My eyes are starting to glaze over a bit here so expect comments below > this point to be only a partial review of the corresponding patch. > > Regarding 0007, I think that the should_verify_sysid terminology is > problematic. I made all the code and identifier names talk only about > the control file, not the specific thing in the control file that we > are going to verify, in case in the future we want to verify > additional things. This breaks that abstraction. > Agreed, changed to should_verify_control_data. > Regarding 0009, I feel like astreamer_verify_content() might want to > grow some subroutines. One idea could be to move the > ASTREAMER_MEMBER_HEADER case and likewise ASTREAMER_MEMBER_CONTENTS > cases into a new function for each; another idea could be to move > smaller chunks of logic, e.g. under the ASTREAMER_MEMBER_CONTENTS > case, the verify_checksums could be one subroutine and the ill-named > verify_sysid stuff could be another. I'm not certain exactly what's > best here, but some of this code is as deeply as six levels nested, > which is not such a terrible thing that nobody should ever do it, but > it is bad enough that we should at least look around for a better way. > Okay, I added the verify_checksums() and verify_controldata() functions to the astreamer_verify.c file. I also updated related variables that were clashing with these function names: verify_checksums has been renamed to verifyChecksums, and verify_sysid has been renamed to verifyControlData. Thanks again for the review comments. Please have a look at the attached version. Regards, Amul
Attachment
- v3-0011-pg_verifybackup-Tests-and-document.patch
- v3-0007-Refactor-split-verify_file_checksum-function.patch
- v3-0009-pg_verifybackup-Add-backup-format-and-compression.patch
- v3-0008-Refactor-split-verify_control_file.patch
- v3-0010-pg_verifybackup-Read-tar-files-and-verify-its-con.patch
- v3-0006-Refactor-split-verify_backup_file-function.patch
- v3-0005-Refactor-move-some-part-of-pg_verifybackup.c-to-p.patch
- v3-0004-Refactor-move-show_progress-and-skip_checksums-to.patch
- v3-0003-Refactor-move-astreamer-files-to-fe_utils-to-make.patch
- v3-0002-Refactor-Add-astreamer_inject.h-and-move-related-.patch
- v3-0001-Refactor-Rename-all-bbstreamer-references-to-astr.patch
On Wed, Jul 31, 2024 at 9:28 AM Amul Sul <sulamul@gmail.com> wrote: > Fixed -- I did that because it was part of a separate group in pg_basebackup. Well, that's because pg_basebackup builds multiple executables, and these files needed to be linked with some but not others. It looks like when Andres added meson support, instead of linking each object file into the binaries that need it, he had it just build a static library and link every executable to that. That puts the linker in charge of sorting out which binaries need which files, instead of having the makefile do it. In any case, this consideration doesn't apply when we're putting the object files into a library, so there was no need to preserve the separate makefile variable. I think this looks good now. > Fixed -- frontend_common_code now includes lz4 as well. Cool. 0003 overall looks good to me now, unless Andres wants to object. > Noted. I might give it a try another day, unless someone else beats > me, perhaps in a separate thread. Probably not too important, since nobody has complained. > Done -- added a new patch as 0004, and the subsequent patch numbers > have been incremented accordingly. I think I would have made this pass context->show_progress to progress_report() instead of the whole verifier_context, but that's an arguable stylistic choice, so I'll defer to you if you prefer it the way you have it. Other than that, this LGTM. However, what is now 0005 does something rather evil. The commit message claims that it's just rearranging code, and that's almost entirely true, except that it also changes manifest_file's pathname member to be char * instead of const char *. I do not think that is a good idea, and I definitely do not think you should do it in a patch that purports to just be doing code movement, and I even more definitely think that you should not do it without even mentioning that you did it, and why you did it. > Fixed -- I did the NULL check in the earlier 0007 patch, but it should > have been done in this patch. This is now 0006. struct stat's st_size is of type off_t -- or maybe ssize_t on some platforms? - not type size_t. I suggest making the filesize argument use int64 as we do in some other places. size_t is, I believe, defined to be the right width to hold the size of an object in memory, not the size of a file on disk, so it isn't really relevant here. Other than that, my only comment on this patch is that I think I would find it more natural to write the check in verify_backup_file() in a different order: I'd put context->manifest->version != 1 && m != NULL && m->matched && !m->bad && strcmp() because (a) that way the most expensive test is last and (b) it feels weird to think about whether we have the right pathname if we don't even have a valid manifest entry. But this is minor and just a stylistic preference, so it's also OK as you have it if you prefer. > I agree, changing the order of errors could create confusion. > Previously, a file size mismatch was a clear and appropriate error > that was reported before the checksum failure error. In my opinion, this patch (currently 0007) creates a rather confusing situation that I can't easily reason about. Post-patch, verify_content_checksum() is a mix of two different things: it ends up containing all of the logic that needs to be performed on every chunk of bytes read from the file plus some but not all of the end-of-file error-checks from verify_file_checksum(). That's really weird. I'm not very convinced that the test for whether we've reached the end of the file is 100% correct, but even if it is, the stuff before that point is stuff that is supposed to happen many times and the stuff after that is only supposed to happen once, and I don't see any good reason to smush those two categories of things into a single function. Plus, changing the order in which those end-of-file checks happen doesn't seem like the right idea either: the current ordering is good the way it is. Maybe you want to think of refactoring to create TWO new functions, one to do the per-hunk work and a second to do the end-of-file "is the checksum OK?" stuff, or maybe you can just open code it, but I'm not willing to commit this the way it is. Regarding 0008, I don't really see a reason why the m != NULL shouldn't also move inside should_verify_control_data(). Yeah, the caller added in 0010 might not need the check, but it won't really cost anything. Also, it seems to me that the logic in 0010 is actually wrong. If m == NULL, we'll keep the values of verifyChecksums and verifyControlData from the previous iteration, whereas what we should do is make them both false. How about removing the if m == NULL guard here and making both should_verify_checksum() and should_verify_control_data() test m != NULL internally? Then it all works out nicely, I think. Or alternatively you need an else clause that resets both values to false when m == NULL. > Okay, I added the verify_checksums() and verify_controldata() > functions to the astreamer_verify.c file. I also updated related > variables that were clashing with these function names: > verify_checksums has been renamed to verifyChecksums, and verify_sysid > has been renamed to verifyControlData. Maybe think of doing something with the ASTREAMER_MEMBER_HEADER case also. Out of time for today, will look again soon. I think the first few of these are probably pretty much ready for commit already, and with a little more adjustment they'll probably be ready up through about 0006. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2024-07-31 16:07:03 -0400, Robert Haas wrote: > On Wed, Jul 31, 2024 at 9:28 AM Amul Sul <sulamul@gmail.com> wrote: > > Fixed -- I did that because it was part of a separate group in pg_basebackup. > > Well, that's because pg_basebackup builds multiple executables, and > these files needed to be linked with some but not others. It looks > like when Andres added meson support, instead of linking each object > file into the binaries that need it, he had it just build a static > library and link every executable to that. That puts the linker in > charge of sorting out which binaries need which files, instead of > having the makefile do it. Right. Meson supports using the same file with different compilation flags, depending on the context its used (i.e. as part of an executable or a shared library). But that also ends up compiling files multiple times when using the same file in multiple binaries. Which wasn't desirable here -> hence moving it to a static lib. > > Fixed -- frontend_common_code now includes lz4 as well. > > Cool. 0003 overall looks good to me now, unless Andres wants to object. Nope. Greetings, Andres Freund
On Thu, Aug 1, 2024 at 1:37 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Jul 31, 2024 at 9:28 AM Amul Sul <sulamul@gmail.com> wrote: > > Fixed -- I did that because it was part of a separate group in pg_basebackup. > > Well, that's because pg_basebackup builds multiple executables, and > these files needed to be linked with some but not others. It looks > like when Andres added meson support, instead of linking each object > file into the binaries that need it, he had it just build a static > library and link every executable to that. That puts the linker in > charge of sorting out which binaries need which files, instead of > having the makefile do it. In any case, this consideration doesn't > apply when we're putting the object files into a library, so there was > no need to preserve the separate makefile variable. I think this looks > good now. > Understood. > > Fixed -- frontend_common_code now includes lz4 as well. > > Cool. 0003 overall looks good to me now, unless Andres wants to object. > > > Noted. I might give it a try another day, unless someone else beats > > me, perhaps in a separate thread. > > Probably not too important, since nobody has complained. > > > Done -- added a new patch as 0004, and the subsequent patch numbers > > have been incremented accordingly. > > I think I would have made this pass context->show_progress to > progress_report() instead of the whole verifier_context, but that's an > arguable stylistic choice, so I'll defer to you if you prefer it the > way you have it. Other than that, this LGTM. > Additionally, I moved total_size and done_size to verifier_context because done_size needs to be accessed in astreamer_verify.c. With this change, verifier_context is now more suitable. > However, what is now 0005 does something rather evil. The commit > message claims that it's just rearranging code, and that's almost > entirely true, except that it also changes manifest_file's pathname > member to be char * instead of const char *. I do not think that is a > good idea, and I definitely do not think you should do it in a patch > that purports to just be doing code movement, and I even more > definitely think that you should not do it without even mentioning > that you did it, and why you did it. > True, that was a mistake on my part during the rebase. Fixed in the attached version. > > Fixed -- I did the NULL check in the earlier 0007 patch, but it should > > have been done in this patch. > > This is now 0006. struct stat's st_size is of type off_t -- or maybe > ssize_t on some platforms? - not type size_t. I suggest making the > filesize argument use int64 as we do in some other places. size_t is, > I believe, defined to be the right width to hold the size of an object > in memory, not the size of a file on disk, so it isn't really relevant > here. > Ok, used int64. > Other than that, my only comment on this patch is that I think I would > find it more natural to write the check in verify_backup_file() in a > different order: I'd put context->manifest->version != 1 && m != NULL > && m->matched && !m->bad && strcmp() because (a) that way the most > expensive test is last and (b) it feels weird to think about whether > we have the right pathname if we don't even have a valid manifest > entry. But this is minor and just a stylistic preference, so it's also > OK as you have it if you prefer. > I used to do it that way (a) -- keeping the expensive check for last. I did the same thing while adding should_verify_control_data() in the later patch. Somehow, I missed it here, maybe I didn't pay enough attention to this patch :( > > I agree, changing the order of errors could create confusion. > > Previously, a file size mismatch was a clear and appropriate error > > that was reported before the checksum failure error. > > In my opinion, this patch (currently 0007) creates a rather confusing > situation that I can't easily reason about. Post-patch, > verify_content_checksum() is a mix of two different things: it ends up > containing all of the logic that needs to be performed on every chunk > of bytes read from the file plus some but not all of the end-of-file > error-checks from verify_file_checksum(). That's really weird. I'm not > very convinced that the test for whether we've reached the end of the > file is 100% correct, but even if it is, the stuff before that point > is stuff that is supposed to happen many times and the stuff after > that is only supposed to happen once, and I don't see any good reason > to smush those two categories of things into a single function. Plus, > changing the order in which those end-of-file checks happen doesn't > seem like the right idea either: the current ordering is good the way > it is. Maybe you want to think of refactoring to create TWO new > functions, one to do the per-hunk work and a second to do the > end-of-file "is the checksum OK?" stuff, or maybe you can just open > code it, but I'm not willing to commit this the way it is. > Understood. At the start of working on the v3 review, I thought of completely discarding the 0007 patch and copying most of verify_file_checksum() to a new function in astreamer_verify.c. However, I later realized we could deduplicate some parts, so I split verify_file_checksum() and moved the reusable part to a separate function. Please have a look at v4-0007. > Regarding 0008, I don't really see a reason why the m != NULL > shouldn't also move inside should_verify_control_data(). Yeah, the > caller added in 0010 might not need the check, but it won't really > cost anything. Also, it seems to me that the logic in 0010 is actually > wrong. If m == NULL, we'll keep the values of verifyChecksums and > verifyControlData from the previous iteration, whereas what we should > do is make them both false. How about removing the if m == NULL guard > here and making both should_verify_checksum() and > should_verify_control_data() test m != NULL internally? Then it all > works out nicely, I think. Or alternatively you need an else clause > that resets both values to false when m == NULL. > I had the same thought about checking for NULL inside should_verify_control_data(), but I wanted to maintain the structure similar to should_verify_checksum(). Making this change would have also required altering should_verify_checksum(), I wasn’t sure if I should make that change before. Now, I did that in the attached version -- 0008 patch. > > Okay, I added the verify_checksums() and verify_controldata() > > functions to the astreamer_verify.c file. I also updated related > > variables that were clashing with these function names: > > verify_checksums has been renamed to verifyChecksums, and verify_sysid > > has been renamed to verifyControlData. > > Maybe think of doing something with the ASTREAMER_MEMBER_HEADER case also. > Done. > Out of time for today, will look again soon. I think the first few of > these are probably pretty much ready for commit already, and with a > little more adjustment they'll probably be ready up through about > 0006. > Sure, thank you. Regards, Amul
Attachment
- v4-0008-Refactor-split-verify_control_file.patch
- v4-0007-Refactor-split-verify_file_checksum-function.patch
- v4-0009-pg_verifybackup-Add-backup-format-and-compression.patch
- v4-0010-pg_verifybackup-Read-tar-files-and-verify-its-con.patch
- v4-0011-pg_verifybackup-Tests-and-document.patch
- v4-0006-Refactor-split-verify_backup_file-function.patch
- v4-0005-Refactor-move-some-part-of-pg_verifybackup.c-to-p.patch
- v4-0003-Refactor-move-astreamer-files-to-fe_utils-to-make.patch
- v4-0004-Refactor-move-few-global-variable-to-verifier_con.patch
- v4-0002-Refactor-Add-astreamer_inject.h-and-move-related-.patch
- v4-0001-Refactor-Rename-all-bbstreamer-references-to-astr.patch
On Thu, Aug 1, 2024 at 6:48 PM Amul Sul <sulamul@gmail.com> wrote: > > On Thu, Aug 1, 2024 at 1:37 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Wed, Jul 31, 2024 at 9:28 AM Amul Sul <sulamul@gmail.com> wrote: > > > Fixed -- I did that because it was part of a separate group in pg_basebackup. > > [...] > > Out of time for today, will look again soon. I think the first few of > > these are probably pretty much ready for commit already, and with a > > little more adjustment they'll probably be ready up through about > > 0006. > > > > Sure, thank you. > The v4 version isn't handling the progress report correctly because the total_size calculation was done in verify_manifest_entry(), and done_size was updated during the checksum verification. This worked well for the plain backup but failed for the tar backup, where checksum verification occurs right after verify_manifest_entry(), leading to incorrect total_size in the progress report output. Additionally, the patch missed the final progress_report(true) call for persistent output, which is called from verify_backup_checksums() for the plain backup but never for tar backup verification. To address this, I moved the first and last progress_report() calls to the main function. Although this is a small change, I placed it in a separate patch, 0009, in the attached version. In addition to these changes, the attached version includes improvements in code comments, function names, and their arrangements in astreamer_verify.c. Please consider the attached version for the review. Regards, Amul
Attachment
- v5-0010-pg_verifybackup-Add-backup-format-and-compression.patch
- v5-0011-pg_verifybackup-Read-tar-files-and-verify-its-con.patch
- v5-0012-pg_verifybackup-Tests-and-document.patch
- v5-0009-Refactor-move-first-and-last-progress_report-call.patch
- v5-0008-Refactor-split-verify_control_file.patch
- v5-0007-Refactor-split-verify_file_checksum-function.patch
- v5-0005-Refactor-move-some-part-of-pg_verifybackup.c-to-p.patch
- v5-0004-Refactor-move-few-global-variable-to-verifier_con.patch
- v5-0006-Refactor-split-verify_backup_file-function.patch
- v5-0003-Refactor-move-astreamer-files-to-fe_utils-to-make.patch
- v5-0002-Refactor-Add-astreamer_inject.h-and-move-related-.patch
- v5-0001-Refactor-Rename-all-bbstreamer-references-to-astr.patch
On Fri, Aug 2, 2024 at 7:43 AM Amul Sul <sulamul@gmail.com> wrote: > Please consider the attached version for the review. Thanks. I committed 0001-0003. The only thing that I changed was that in 0001, you forgot to pgindent, which actually mattered quite a bit, because astreamer is one character shorter than bbstreamer. Before we proceed with the rest of this patch series, I think we should fix up the comments for some of the astreamer files. Proposed patch for that attached; please review. I also noticed that cfbot was unhappy about this patch set: [10:37:55.075] pg_verifybackup.c:100:7: error: no previous extern declaration for non-static variable 'format' [-Werror,-Wmissing-variable-declarations] [10:37:55.075] char format = '\0'; /* p(lain)/t(ar) */ [10:37:55.075] ^ [10:37:55.075] pg_verifybackup.c:100:1: note: declare 'static' if the variable is not intended to be used outside of this translation unit [10:37:55.075] char format = '\0'; /* p(lain)/t(ar) */ [10:37:55.075] ^ [10:37:55.075] pg_verifybackup.c:101:23: error: no previous extern declaration for non-static variable 'compress_algorithm' [-Werror,-Wmissing-variable-declarations] [10:37:55.075] pg_compress_algorithm compress_algorithm = PG_COMPRESSION_NONE; [10:37:55.075] ^ [10:37:55.075] pg_verifybackup.c:101:1: note: declare 'static' if the variable is not intended to be used outside of this translation unit [10:37:55.075] pg_compress_algorithm compress_algorithm = PG_COMPRESSION_NONE; [10:37:55.075] ^ [10:37:55.075] 2 errors generated. Please fix and, after posting future versions of the patch set, try to remember to check http://cfbot.cputube.org/amul-sul.html -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
On Mon, Aug 5, 2024 at 10:29 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Aug 2, 2024 at 7:43 AM Amul Sul <sulamul@gmail.com> wrote: > > Please consider the attached version for the review. > > Thanks. I committed 0001-0003. The only thing that I changed was that > in 0001, you forgot to pgindent, which actually mattered quite a bit, > because astreamer is one character shorter than bbstreamer. > Understood. Thanks for tidying up and committing the patches. > Before we proceed with the rest of this patch series, I think we > should fix up the comments for some of the astreamer files. Proposed > patch for that attached; please review. > Looks good to me, except for the following typo that I have fixed in the attached version: s/astreamer_plan_writer/astreamer_plain_writer/ > I also noticed that cfbot was unhappy about this patch set: > > [10:37:55.075] pg_verifybackup.c:100:7: error: no previous extern > declaration for non-static variable 'format' > [-Werror,-Wmissing-variable-declarations] > [10:37:55.075] char format = '\0'; /* p(lain)/t(ar) */ > [10:37:55.075] ^ > [10:37:55.075] pg_verifybackup.c:100:1: note: declare 'static' if the > variable is not intended to be used outside of this translation unit > [10:37:55.075] char format = '\0'; /* p(lain)/t(ar) */ > [10:37:55.075] ^ > [10:37:55.075] pg_verifybackup.c:101:23: error: no previous extern > declaration for non-static variable 'compress_algorithm' > [-Werror,-Wmissing-variable-declarations] > [10:37:55.075] pg_compress_algorithm compress_algorithm = PG_COMPRESSION_NONE; > [10:37:55.075] ^ > [10:37:55.075] pg_verifybackup.c:101:1: note: declare 'static' if the > variable is not intended to be used outside of this translation unit > [10:37:55.075] pg_compress_algorithm compress_algorithm = PG_COMPRESSION_NONE; > [10:37:55.075] ^ > [10:37:55.075] 2 errors generated. > Fixed in the attached version. > Please fix and, after posting future versions of the patch set, try to > remember to check http://cfbot.cputube.org/amul-sul.html Sure. I used to rely on that earlier, but after Cirrus CI in the GitHub repo, I assumed the workflow would be the same as cfbot and started overlooking it. However, cfbot reported a warning that didn't appear in my GitHub run. From now on, I'll make sure to check cfbot as well. Regards, Amul
Attachment
- v7-0010-pg_verifybackup-Add-backup-format-and-compression.patch
- v7-0012-pg_verifybackup-Tests-and-document.patch
- v7-0011-pg_verifybackup-Read-tar-files-and-verify-its-con.patch
- v7-0008-Refactor-split-verify_control_file.patch
- v7-0009-Refactor-move-first-and-last-progress_report-call.patch
- v7-0006-Refactor-split-verify_backup_file-function.patch
- v7-0007-Refactor-split-verify_file_checksum-function.patch
- v7-0005-Refactor-move-some-part-of-pg_verifybackup.c-to-p.patch
- v7-0001-Improve-file-header-comments-for-astramer-code.patch
- v7-0004-Refactor-move-few-global-variable-to-verifier_con.patch
On Thu, Aug 1, 2024 at 9:19 AM Amul Sul <sulamul@gmail.com> wrote: > > I think I would have made this pass context->show_progress to > > progress_report() instead of the whole verifier_context, but that's an > > arguable stylistic choice, so I'll defer to you if you prefer it the > > way you have it. Other than that, this LGTM. > > Additionally, I moved total_size and done_size to verifier_context > because done_size needs to be accessed in astreamer_verify.c. > With this change, verifier_context is now more suitable. But it seems like 0006 now changes the logic for computing total_size. Prepatch, the condition is: - if (context->show_progress && !context->skip_checksums && - should_verify_checksum(m)) - context->total_size += m->size; where should_verify_checksum(m) checks (((m)->matched) && !((m)->bad) && (((m)->checksum_type) != CHECKSUM_TYPE_NONE)). But post-patch the condition is: + if (!context.skip_checksums) ... + if (!should_ignore_relpath(context, m->pathname)) + total_size += m->size; The old logic was reached from verify_backup_directory() which does check should_ignore_relpath(), so the new condition hasn't added anything. But it seems to have lost the show_progress condition, and the m->checksum_type != CHECKSUM_TYPE_NONE condition. I think this means that we'll sum the sizes even when not displaying progress, and that if some of the files in the manifest had no checksums, our progress reporting would compute wrong percentages after the patch. > Understood. At the start of working on the v3 review, I thought of > completely discarding the 0007 patch and copying most of > verify_file_checksum() to a new function in astreamer_verify.c. > However, I later realized we could deduplicate some parts, so I split > verify_file_checksum() and moved the reusable part to a separate > function. Please have a look at v4-0007. Yeah, that seems OK. The fact that these patches don't have commit messages is making life more difficult for me than it needs to be. In particular, I'm looking at 0009 and there's no hint about why you want to do this. In fact that's the case for all of these refactoring patches. Instead of saying something like "tar format verification will want to verify the control file, but will not be able to read the file directly from disk, so separate the logic that reads the control file from the logic that verifies it" you just say what code you moved. Then I have to guess why you moved it, or flip back and forth between the refactoring patch and 0011 to try to figure it out. It would be nice if each of these refactoring patches contained a clear indication about the purpose of the refactoring in the commit message. > I had the same thought about checking for NULL inside > should_verify_control_data(), but I wanted to maintain the structure > similar to should_verify_checksum(). Making this change would have > also required altering should_verify_checksum(), I wasn’t sure if I > should make that change before. Now, I did that in the attached > version -- 0008 patch. I believe there is no reason for this change to be part of 0008 at all, and that this should be part of whatever later patch needs it. > > Maybe think of doing something with the ASTREAMER_MEMBER_HEADER case also. > > Done. OK, the formatting of 0011 looks much better now. It seems to me that 0011 is arranging to palloc the checksum context for every file and then pfree it at the end. It seems like it would be considerably more efficient if astreamer_verify contained a pg_checksum_context instead of a pointer to a pg_checksum_context. If you need a flag to indicate whether we've reinitialized the checksum for the current file, it's better to add that than to have all of these unnecessary allocate/free cycles. Existing astreamer code uses struct member names_like_this. For the new one, you mostly used namesLikeThis except when you used names_like_this or namesLkThs. It seems to me that instead of adding a global variable verify_backup_file_cb, it would be better to move the 'format' variable into verifier_context. Then you can do something like if (context->format == 'p') verify_plain_backup_file() else verify_tar_backup_file(). It's pretty common for .tar.gz to be abbreviated to .tgz. I think we should support that. Let's suppose that I have a backup which, for some reason, does not use the same compression for all files (base.tar, 16384.tgz, 16385.tar.gz, 16366.tar.lz4). With this patch, that will fail. Now, that's not really a problem, because having a backup with mixed compression algorithms like that is strange and you probably wouldn't try to do it. But on the other hand, it looks to me like making the code support that would be more elegant than what you have now. Because, right now, you have code to detect what type of backup you've got by looking at base.WHATEVER_EXTENSION ... but then you have to also have code that complains if some later file doesn't have the same extension. But you could just detect the type of every file individually. In fact, I wonder if we even need -Z. What value is that actually providing? Why not just always auto-detect? find_backup_format() ignores the possibility of stat() throwing an error. That's not good. Suppose that the backup directory contains main.tar, 16385.tar, and snuffleupagus.tar. It looks to me like what will happen here is that we'll verify main.tar with tblspc_oid = InvalidOid, 16385.tar with tblspc_oid = 16385, and snuffleupagus.tar with tblspc_oid = InvalidOid. That doesn't sound right. I think we should either completely ignore snuffleupagus.tar just as it were completely imaginary, or perhaps there's an argument for emitting a warning saying that we weren't expecting a snuffleupagus to exist. In general, I think all unexpected files in a tar-format backup directory should get the same treatment, regardless of whether the problem is with the extension or the file itself. We should either silently ignore everything that isn't expected to be present, or we should emit a complaint saying that the file isn't expected to be present. Right now, you say that it's "not a valid file" if the extension isn't what you expect (which doesn't seem like a good error message, because the file may be perfectly valid for what it is, it's just not a file we're expecting to see) and say nothing if the extension is right but the part of the filename preceding the extension is unexpected. A related issue is that it's a little unclear what --ignore is supposed to do for tar-format backups. Does that ignore files in the backup directory, or files instead of the tar files inside of the backup directory? If we decide that --ignore ignores files in the backup directory, then we should complain about any unexpected files that are present there unless they've been ignored. If we decide that --ignore ignores files inside of the tar files, then I suggest we just silently skip any files in the backup directory that don't seem to have file names in the correct format. I think I prefer the latter approach, but I'm not 100% sure what's best. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 6, 2024 at 10:39 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Aug 1, 2024 at 9:19 AM Amul Sul <sulamul@gmail.com> wrote: > > > I think I would have made this pass context->show_progress to > > > progress_report() instead of the whole verifier_context, but that's an > > > arguable stylistic choice, so I'll defer to you if you prefer it the > > > way you have it. Other than that, this LGTM. > > > > Additionally, I moved total_size and done_size to verifier_context > > because done_size needs to be accessed in astreamer_verify.c. > > With this change, verifier_context is now more suitable. > > But it seems like 0006 now changes the logic for computing total_size. > Prepatch, the condition is: > > - if (context->show_progress && !context->skip_checksums && > - should_verify_checksum(m)) > - context->total_size += m->size; > > where should_verify_checksum(m) checks (((m)->matched) && !((m)->bad) > && (((m)->checksum_type) != CHECKSUM_TYPE_NONE)). But post-patch the > condition is: > > + if (!context.skip_checksums) > ... > + if (!should_ignore_relpath(context, m->pathname)) > + total_size += m->size; > > The old logic was reached from verify_backup_directory() which does > check should_ignore_relpath(), so the new condition hasn't added > anything. But it seems to have lost the show_progress condition, and > the m->checksum_type != CHECKSUM_TYPE_NONE condition. I think this > means that we'll sum the sizes even when not displaying progress, and > that if some of the files in the manifest had no checksums, our > progress reporting would compute wrong percentages after the patch. > That is not true. The compute_total_size() function doesn't do anything when not displaying progress, the first if condition, which returns the same way as progress_report(). I omitted should_verify_checksum() since we don't have match and bad flag information at the start, and we won't have that for TAR files at all. However, I missed the checksum_type check, which is necessary, and have added it now. With the patch, I am concerned that we won't be able to give an accurate progress report as before. We add all the file sizes in the backup manifest to the total_size without checking if they exist on disk. Therefore, sometimes the reported progress completion might not show 100% when we encounter files where m->bad or m->match == false at a later stage. However, I think this should be acceptable since there will be an error for the respective missing or bad file, and it can be understood that verification is complete even if the progress isn't 100% in that case. Thoughts/Comments? > > Understood. At the start of working on the v3 review, I thought of > > completely discarding the 0007 patch and copying most of > > verify_file_checksum() to a new function in astreamer_verify.c. > > However, I later realized we could deduplicate some parts, so I split > > verify_file_checksum() and moved the reusable part to a separate > > function. Please have a look at v4-0007. > > Yeah, that seems OK. > > The fact that these patches don't have commit messages is making life > more difficult for me than it needs to be. In particular, I'm looking > at 0009 and there's no hint about why you want to do this. In fact > that's the case for all of these refactoring patches. Instead of > saying something like "tar format verification will want to verify the > control file, but will not be able to read the file directly from > disk, so separate the logic that reads the control file from the logic > that verifies it" you just say what code you moved. Then I have to > guess why you moved it, or flip back and forth between the refactoring > patch and 0011 to try to figure it out. It would be nice if each of > these refactoring patches contained a clear indication about the > purpose of the refactoring in the commit message. > Sorry, I was a bit lazy there, assuming you'd handle the review :). I can understand the frustration -- added some description. > > I had the same thought about checking for NULL inside > > should_verify_control_data(), but I wanted to maintain the structure > > similar to should_verify_checksum(). Making this change would have > > also required altering should_verify_checksum(), I wasn’t sure if I > > should make that change before. Now, I did that in the attached > > version -- 0008 patch. > > I believe there is no reason for this change to be part of 0008 at > all, and that this should be part of whatever later patch needs it. > Ok > > > Maybe think of doing something with the ASTREAMER_MEMBER_HEADER case also. > > > > Done. > > OK, the formatting of 0011 looks much better now. > > It seems to me that 0011 is arranging to palloc the checksum context > for every file and then pfree it at the end. It seems like it would be > considerably more efficient if astreamer_verify contained a > pg_checksum_context instead of a pointer to a pg_checksum_context. If > you need a flag to indicate whether we've reinitialized the checksum > for the current file, it's better to add that than to have all of > these unnecessary allocate/free cycles. > I tried in the attached version, and it’s a good improvement. We don’t need any flags; we can allocate that during astreamer creation. Later, in the ASTREAMER_MEMBER_HEADER case while reading, we can (re)initialize the context for each file as needed. > Existing astreamer code uses struct member names_like_this. For the > new one, you mostly used namesLikeThis except when you used > names_like_this or namesLkThs. > Yeah, in my patch, I ended up using the same name for both the variable and the function. To avoid that, I made this change. This could be a minor inconvenience for someone using ctags/cscope to find the definition of the function or variable, as they might be directed to the wrong place. However, I think it’s still okay since there are ways to find the correct definition. I reverted those changes in the attached version. > It seems to me that instead of adding a global variable > verify_backup_file_cb, it would be better to move the 'format' > variable into verifier_context. Then you can do something like if > (context->format == 'p') verify_plain_backup_file() else > verify_tar_backup_file(). > Done. > It's pretty common for .tar.gz to be abbreviated to .tgz. I think we > should support that. > Done. > Let's suppose that I have a backup which, for some reason, does not > use the same compression for all files (base.tar, 16384.tgz, > 16385.tar.gz, 16366.tar.lz4). With this patch, that will fail. Now, > that's not really a problem, because having a backup with mixed > compression algorithms like that is strange and you probably wouldn't > try to do it. But on the other hand, it looks to me like making the > code support that would be more elegant than what you have now. > Because, right now, you have code to detect what type of backup you've > got by looking at base.WHATEVER_EXTENSION ... but then you have to > also have code that complains if some later file doesn't have the same > extension. But you could just detect the type of every file > individually. > > In fact, I wonder if we even need -Z. What value is that actually > providing? Why not just always auto-detect? > +1, removed -Z option. > find_backup_format() ignores the possibility of stat() throwing an > error. That's not good. > I wasn't sure about that before -- I tried it in the attached version. See if it looks good to you. > Suppose that the backup directory contains main.tar, 16385.tar, and > snuffleupagus.tar. It looks to me like what will happen here is that > we'll verify main.tar with tblspc_oid = InvalidOid, 16385.tar with > tblspc_oid = 16385, and snuffleupagus.tar with tblspc_oid = > InvalidOid. That doesn't sound right. I think we should either > completely ignore snuffleupagus.tar just as it were completely > imaginary, or perhaps there's an argument for emitting a warning > saying that we weren't expecting a snuffleupagus to exist. > > In general, I think all unexpected files in a tar-format backup > directory should get the same treatment, regardless of whether the > problem is with the extension or the file itself. We should either > silently ignore everything that isn't expected to be present, or we > should emit a complaint saying that the file isn't expected to be > present. Right now, you say that it's "not a valid file" if the > extension isn't what you expect (which doesn't seem like a good error > message, because the file may be perfectly valid for what it is, it's > just not a file we're expecting to see) and say nothing if the > extension is right but the part of the filename preceding the > extension is unexpected. > I added an error for files other than base.tar and <tablespacesoid>.tar. I think the error message could be improved. > A related issue is that it's a little unclear what --ignore is > supposed to do for tar-format backups. Does that ignore files in the > backup directory, or files instead of the tar files inside of the > backup directory? If we decide that --ignore ignores files in the > backup directory, then we should complain about any unexpected files > that are present there unless they've been ignored. If we decide that > --ignore ignores files inside of the tar files, then I suggest we just > silently skip any files in the backup directory that don't seem to > have file names in the correct format. I think I prefer the latter > approach, but I'm not 100% sure what's best. > I am interested in having that feature to be as useful as possible -- I mean, allowing the option to ignore files from the backup directory and from the archive file as well. I don't see any major drawbacks, apart from spending extra CPU cycles to browse the ignore list. Regards, Amul
Attachment
- v8-0010-pg_verifybackup-Add-backup-format-and-compression.patch
- v8-0009-Refactor-move-first-and-last-progress_report-call.patch
- v8-0008-Refactor-split-verify_control_file.patch
- v8-0011-pg_verifybackup-Read-tar-files-and-verify-its-con.patch
- v8-0012-pg_verifybackup-Tests-and-document.patch
- v8-0006-Refactor-split-verify_backup_file-function.patch
- v8-0007-Refactor-split-verify_file_checksum-function.patch
- v8-0004-Refactor-move-few-global-variable-to-verifier_con.patch
- v8-0005-Refactor-move-some-part-of-pg_verifybackup.c-to-p.patch
- v8-0001-Improve-file-header-comments-for-astramer-code.patch
[ I committed 0001, then noticed I had a type in the subject line of the commit message. Argh. ] On Wed, Aug 7, 2024 at 9:41 AM Amul Sul <sulamul@gmail.com> wrote: > With the patch, I am concerned that we won't be able to give an > accurate progress report as before. We add all the file sizes in the > backup manifest to the total_size without checking if they exist on > disk. Therefore, sometimes the reported progress completion might not > show 100% when we encounter files where m->bad or m->match == false at > a later stage. However, I think this should be acceptable since there > will be an error for the respective missing or bad file, and it can be > understood that verification is complete even if the progress isn't > 100% in that case. Thoughts/Comments? When somebody says that something is a refactoring commit, my assumption is that there should be no behavior change. If the behavior is changing, it's not purely a refactoring, and it shouldn't be labelled as a refactoring (or at least there should be a prominent disclaimer identifying whatever behavior has changed, if a small change was deemed acceptable and unavoidable). I am very reluctant to accept a functional regression of the type that you describe here (or the type that I postulated might occur, even if I was wrong and it doesn't). The point here is that we're trying to reuse the code, and I support that goal, because code reuse is good. But it's not such a good thing that we should do it if it has negative consequences. We should either figure out some other way of refactoring it that doesn't have those negative side-effects, or we should leave the existing code alone and have separate code for the new stuff we want to do. I do realize that the type of side effect you describe here is quite minor. I could live with it if it were unavoidable. But I really don't see why we can't avoid it. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 7, 2024 at 9:12 PM Robert Haas <robertmhaas@gmail.com> wrote: > > [ I committed 0001, then noticed I had a type in the subject line of > the commit message. Argh. ] > > On Wed, Aug 7, 2024 at 9:41 AM Amul Sul <sulamul@gmail.com> wrote: > > With the patch, I am concerned that we won't be able to give an > > accurate progress report as before. We add all the file sizes in the > > backup manifest to the total_size without checking if they exist on > > disk. Therefore, sometimes the reported progress completion might not > > show 100% when we encounter files where m->bad or m->match == false at > > a later stage. However, I think this should be acceptable since there > > will be an error for the respective missing or bad file, and it can be > > understood that verification is complete even if the progress isn't > > 100% in that case. Thoughts/Comments? > > When somebody says that something is a refactoring commit, my > assumption is that there should be no behavior change. If the behavior > is changing, it's not purely a refactoring, and it shouldn't be > labelled as a refactoring (or at least there should be a prominent > disclaimer identifying whatever behavior has changed, if a small > change was deemed acceptable and unavoidable). > I agree; I'll be more careful next time. > I am very reluctant to accept a functional regression of the type that > you describe here (or the type that I postulated might occur, even if > I was wrong and it doesn't). The point here is that we're trying to > reuse the code, and I support that goal, because code reuse is good. > But it's not such a good thing that we should do it if it has negative > consequences. We should either figure out some other way of > refactoring it that doesn't have those negative side-effects, or we > should leave the existing code alone and have separate code for the > new stuff we want to do. > > I do realize that the type of side effect you describe here is quite > minor. I could live with it if it were unavoidable. But I really don't > see why we can't avoid it. > The main issue I have is computing the total_size of valid files that will be checksummed and that exist in both the manifests and the backup, in the case of a tar backup. This cannot be done in the same way as with a plain backup. Another consideration is that, in the case of a tar backup, since we're reading all tar files from the backup rather than individual files to be checksummed, should we consider total_size as the sum of all valid tar files in the backup, regardless of checksum verification? If so, we would need to perform an initial pass to calculate the total_size in the directory, similar to what verify_backup_directory() does., but doing so I am a bit uncomfortable and unsure if we should do that pass. An alternative idea is to provide progress reports per file instead of for the entire backup directory. We could report the size of each file and keep track of done_size as we read each tar header and content. For example, the report could be: 109032/109032 kB (100%) base.tar file verified 123444/123444 kB (100%) 16245.tar file verified 23478/23478 kB (100%) 16246.tar file verified ..... <total_done_size>/<total_size> (NNN%) verified. I think this type of reporting can be implemented with minimal changes, abd for plain backups, we can keep the reporting as it is. Thoughts? Regards, Amul
On Wed, Aug 7, 2024 at 1:05 PM Amul Sul <sulamul@gmail.com> wrote: > The main issue I have is computing the total_size of valid files that > will be checksummed and that exist in both the manifests and the > backup, in the case of a tar backup. This cannot be done in the same > way as with a plain backup. I think you should compute and sum the sizes of the tar files themselves. Suppose you readdir(), make a list of files that look relevant, and stat() each one. total_size is the sum of the file sizes. Then you work your way through the list of files and read each one. done_size is the total size of all files you've read completely plus the number of bytes you've read from the current file so far. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 7, 2024 at 11:28 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Aug 7, 2024 at 1:05 PM Amul Sul <sulamul@gmail.com> wrote: > > The main issue I have is computing the total_size of valid files that > > will be checksummed and that exist in both the manifests and the > > backup, in the case of a tar backup. This cannot be done in the same > > way as with a plain backup. > > I think you should compute and sum the sizes of the tar files > themselves. Suppose you readdir(), make a list of files that look > relevant, and stat() each one. total_size is the sum of the file > sizes. Then you work your way through the list of files and read each > one. done_size is the total size of all files you've read completely > plus the number of bytes you've read from the current file so far. > I tried this in the attached version and made a few additional changes based on Sravan's off-list comments regarding function names and descriptions. Now, verification happens in two passes. The first pass simply verifies the file names, determines their compression types, and returns a list of valid tar files whose contents need to be verified in the second pass. The second pass is called at the end of verify_backup_directory() after all files in that directory have been scanned. I named the functions for pass 1 and pass 2 as verify_tar_file_name() and verify_tar_file_contents(), respectively. The rest of the code flow is similar as in the previous version. In the attached patch set, I abandoned the changes, touching the progress reporting code of plain backups by dropping the previous 0009 patch. The new 0009 patch adds missing APIs to simple_list.c to destroy SimplePtrList. The rest of the patch numbers remain unchanged. Regards, Amul
Attachment
- v9-0011-pg_verifybackup-Read-tar-files-and-verify-its-con.patch
- v9-0008-Refactor-split-verify_control_file.patch
- v9-0012-pg_verifybackup-Tests-and-document.patch
- v9-0010-pg_verifybackup-Add-backup-format-and-compression.patch
- v9-0009-Add-simple_ptr_list_destroy-and-simple_ptr_list_d.patch
- v9-0005-Refactor-move-some-part-of-pg_verifybackup.c-to-p.patch
- v9-0007-Refactor-split-verify_file_checksum-function.patch
- v9-0004-Refactor-move-skip_checksums-global-variable-to-v.patch
- v9-0006-Refactor-split-verify_backup_file-function.patch
On Mon, Aug 12, 2024 at 5:13 AM Amul Sul <sulamul@gmail.com> wrote: > I tried this in the attached version and made a few additional changes > based on Sravan's off-list comments regarding function names and > descriptions. > > Now, verification happens in two passes. The first pass simply > verifies the file names, determines their compression types, and > returns a list of valid tar files whose contents need to be verified > in the second pass. The second pass is called at the end of > verify_backup_directory() after all files in that directory have been > scanned. I named the functions for pass 1 and pass 2 as > verify_tar_file_name() and verify_tar_file_contents(), respectively. > The rest of the code flow is similar as in the previous version. > > In the attached patch set, I abandoned the changes, touching the > progress reporting code of plain backups by dropping the previous 0009 > patch. The new 0009 patch adds missing APIs to simple_list.c to > destroy SimplePtrList. The rest of the patch numbers remain unchanged. I think you've entangled the code paths here for plain-format backup and tar-format backup in a way that is not very nice. I suggest refactoring things so that verify_backup_directory() is only called for plain-format backups, and you have some completely separate function (perhaps verify_tar_backup) that is called for tar-format backups. I don't think verify_backup_file() should be shared between tar-format and plain-format backups either. Let that be just for plain-format backups, and have separate logic for tar-format backups. Right now you've got "if" statements in various places trying to get all the cases correct, but I think you've missed some (and there's also the issue of updating all the comments). For instance, verify_backup_file() recurses into subdirectories, but that behavior is inappropriate for a tar format backup, where subdirectories should instead be treated like stray files: complain that they exist. pg_verify_backup() does this: /* If it's a directory, just recurse. */ if (S_ISDIR(sb.st_mode)) { verify_backup_directory(context, relpath, fullpath); return; } /* If it's not a directory, it should be a plain file. */ if (!S_ISREG(sb.st_mode)) { report_backup_error(context, "\"%s\" is not a file or directory", relpath); return; } For a plain format backup, this whole thing should be just: /* In a tar format backup, we expect only plain files. */ if (!S_ISREG(sb.st_mode)) { report_backup_error(context, "\"%s\" is not a plain file", relpath); return; } Also, immediately above, you do simple_string_list_append(&context->ignore_list, relpath), but that is pointless in the tar-file case, and arguably wrong, if -i is going to ignore both pathnames in the base directory and also pathnames inside the tar files, because we could add something to the ignore list here -- accomplishing nothing useful -- and then that ignore-list entry could cause us to disregard a stray file with the same name present inside one of the tar files -- which is silly. Note that the actual point of this logic is to make sure that if we can't stat() a certain directory, we don't go off and issue a million complaints about all the files in that directory being missing. But this doesn't accomplish that goal for a tar-format backup. For a tar-format backup, you'd want to figure out which files in the manifest we don't expect to see based on this file being inaccessible, and then arrange to suppress future complaints about all of those files. But you can't implement that here, because you haven't parsed the file name yet. That happens later, in verify_tar_file_name(). You could add a whole bunch more if statements here and try to work around these issues, but it seems pretty obviously a dead end to me. Almost the entire function is going to end up being inside of an if-statement. Essentially the only thing in verify_backup_file() that should actually be the same in the plain and tar-format cases is that you should call stat() either way and check whether it throws an error. But that is not enough justification for trying to share the whole function. I find the logic in verify_tar_file_name() to be somewhat tortured as well. The strstr() calls will match those strings anywhere in the filename, not just at the end. But also, even if you fixed that, why work backward from the end of the filename toward the beginning? It would seem like it would more sense to (1) first check if the string starts with "base" and set suffix equal to pathname+4, (2) if not, strtol(pathname, &suffix, 10) and complain if we didn't eat at least one character or got 0 or something too big to be an OID, (3) check whether suffix is .tar, .tar.gz, etc. In verify_member_checksum(), you set mystreamer->verify_checksum = false. That would be correct if there could only ever be one call to verify_member_checksum() per member file, but there is no such rule. There can be (and, I think, normally will be) more than one ASTREAMER_MEMBER_CONTENTS chunk. I'm a little confused as to how this code passes any kind of testing. Also in verify_member_checksum(), the mystreamer->received_bytes < m->size seems strange. I don't think this is the right way to do something when you reach the end of an archive member. The right way to do that is to do it when the ASTREAMER_MEMBER_TRAILER chunk shows up. In verify_member_control_data(), you use astreamer_buffer_untIl(). But that's the same buffer that is being used by verify_member_checksum(), so I don't really understand how you expect this to work. If this code path were ever taken, verify_member_checksum() would see the same data more than once. The call to pg_log_debug() in this function also seems quite random. In a plain-format backup, we'd actually be doing something different for pg_controldata vs. other files, namely reading it during the initial directory scan. But here we're reading the file in exactly the same sense as we're reading every other file, neither more nor less, so why mention this file and not all of the others? And why at this exact spot in the code? I suspect that the report_fatal_error("%s: could not read control file: read %d of %zu", ...) call is unreachable. I agree that you need to handle the case where the control file inside the tar file is not the expected length, and in fact I think you should probably write a TAP test for that exact scenario to make sure it works. I bet this doesn't. Even if it did, the error message makes no sense in context. In the plain-format backup, this error would come from code reading the actual bytes off the disk -- i.e. the complaint about not being able to read the control file would come from the read() system call. Here it doesn't. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 14, 2024 at 9:20 AM Amul Sul <sulamul@gmail.com> wrote: > I agree with keeping verify_backup_file() separate, but I'm hesitant > about doing the same for verify_backup_directory(). I don't have time today to go through your whole email or re-review the code, but I plan to circle back to that at a later time, However, I want to respond to this point in the meanwhile. There are two big things that are different for a tar-format backup vs. a directory-format backup as far as verify_backup_directory() is concerned. One is that, for a directory format backup, we need to be able to recurse down through subdirectories; for tar-format backups we don't. So a version of this function that only handled tar-format backups would be somewhat shorter and simpler, and would need one fewer argument. The second difference is that for the tar-format backup, you need to make a list of the files you see and then go back and visit each one a second time, and for a directory-format backup you don't need to do that. It seems to me that those differences are significant enough to warrant having two separate functions. If you unify them, I think that less than half of the resulting function is going to be common to both cases. Yeah, a few bits of logic will be duplicated, like the error handling for closedir(), the logic to skip "." and "..", and the psprintf() to construct a full pathname for the directory entry. But that isn't really very much code, and it's code that is pretty straightforward and also present in various other places in the PostgreSQL source tree, perhaps not in precisely the same form. The fact that two functions both call readdir() and do something with each file in the directory isn't enough to say that they should be the same function, IMHO. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 14, 2024 at 12:44 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Aug 14, 2024 at 9:20 AM Amul Sul <sulamul@gmail.com> wrote: > > I agree with keeping verify_backup_file() separate, but I'm hesitant > > about doing the same for verify_backup_directory(). > > I don't have time today to go through your whole email or re-review > the code, but I plan to circle back to that at a later time, However, > I want to respond to this point in the meanwhile. I have committed 0004 (except that I changed a comment) and 0005 (except that I didn't move READ_CHUNK_SIZE). Looking at the issue mentioned above again, I agree that the changes in verify_backup_directory() in this version don't look overly invasive in this version. I'm still not 100% convinced it's the right call, but it doesn't seem bad. You have a spurious comment change to the header of verify_plain_backup_file(). I feel like the naming of tarFile and tarFiles is not consistent with the overall style of this file. I don't like this: [robert.haas ~]$ pg_verifybackup btar pg_verifybackup: error: pg_waldump does not support parsing WAL files from a tar archive. pg_verifybackup: hint: Try "pg_verifybackup --help" to skip parse WAL files option. The hint seems like it isn't really correct grammar, and I also don't see why we can't be more specific. How about "You must use -n, --no-parse-wal when verifying a tar-format backup."? The primary message seems a bit redundant, because parsing WAL files is the only thing pg_waldump does. How about "pg_waldump cannot read from a tar archive"? Note that primary error messages should not end in a period (while hint and detail messages should). + int64 num = strtoi64(relpath, &suffix, 10); + if (suffix == NULL || (num <= 0) || (num > OID_MAX)) Seems like too many parentheses. -- Robert Haas EDB: http://www.enterprisedb.com
On Fri, Aug 16, 2024 at 3:53 PM Robert Haas <robertmhaas@gmail.com> wrote: > + int64 num = strtoi64(relpath, &suffix, 10); Hit send too early. Here, seems like this should be strtoul(), not strtoi64(). The documentation of --format seems to be cut-and-pasted from pg_basebackup and the language isn't really appropriate here. e.g. "The main data directory's contents will be written to a file named..." but pg_verifybackup writes nothing. + simple_string_list_append(&context.ignore_list, "pg_wal.tar"); + simple_string_list_append(&context.ignore_list, "pg_wal.tar.gz"); + simple_string_list_append(&context.ignore_list, "pg_wal.tar.lz4"); + simple_string_list_append(&context.ignore_list, "pg_wal.tar.zst"); Why not make the same logic that recognizes base or an OID also recognize pg_wal as a prefix, and identify that as the WAL archive? For now we'll have to skip it, but if you do it that way then if we add future support for more suffixes, it'll just work, whereas this way won't. And you'd need that code anyway if we ever can run pg_waldump on a tarfile, because you would need to identify the compression method. Note that the danger of the list of suffixes getting out of sync here is not hypothetical: you added .tgz elsewhere but not here. There's probably more to look at here but I'm running out of energy for today. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 21, 2024 at 7:08 AM Amul Sul <sulamul@gmail.com> wrote: > I have reworked a few comments, revised error messages, and made some > minor tweaks in the attached version. Thanks. > Additionally, I would like to discuss a couple of concerns regarding > error placement and function names to gather your suggestions. > > 0007 patch: Regarding error placement: > > 1. I'm a bit unsure about the (bytes_read != m->size) check that I > placed in verify_checksum() and whether it's in the right place. Per > our previous discussion, this check is applicable to plain backup > files since they can change while being read, but not for files > belonging to tar backups. For consistency, I included the check for > tar backups as well, as it doesn't cause any harm. Is it okay to keep > this check in verify_checksum(), or should I move it back to > verify_file_checksum() and apply it only to the plain backup format? I think it's a good sanity check. For a long time I thought it was triggerable until I eventually realized that you just get this message if the file size is wrong: pg_verifybackup: error: "pg_xact/0000" has size 8203 on disk but size 8192 in the manifest After realizing that, I agree with you that this doesn't really seem reachable for tar backups, but I don't think it hurts anything either. While I was investigating this question, I discovered this problem: $ pg_basebackup -cfast -Ft -Dx $ pg_verifybackup -n x backup successfully verified $ mkdir x/tmpdir $ tar -C x/tmpdir -xf x/base.tar $ rm x/base.tar $ tar -C x/tmpdir -cf x/base.tar . $ pg_verifybackup -n x <lots of errors> It appears that the reason why this fails is that the paths in the original base.tar from the server do not include "./" at the beginning, and the ones that I get when I create my own tarfile have that. But that shouldn't matter. Anyway, I was able to work around it like this: $ tar -C x/tmpdir -cf x/base.tar `(cd x/tmpdir; echo *)` Then the result verifies. But I feel like we should have some test cases that do this kind of stuff so that there is automated verification. In fact, the current patch seems to have no negative test cases at all. I think we should test all the cases in 003_corruption.pl with tar format backups as well as with plain format backups, which we could do by untarring one of the archives, messing something up, and then retarring it. I also think we should have some negative test case specific to tar-format backup. I suggest running a coverage analysis and trying to craft test cases that hit as much of the code as possible. There will probably be some errors you can't hit, but you should try to hit the ones you can. > 2. For the verify_checksum() function, I kept the argument name as > bytes_read. Should we rename it to something more meaningful like > computed_bytes, computed_size, or checksum_computed_size? I think it's fine the way you have it. > 0011 patch: Regarding function names: > > 1. named the function verify_tar_backup_file() to align with > verify_plain_backup_file(), but it does not perform the complete > verification as verify_plain_backup_file does. Not sure if it is the > right name. I was thinking of something like precheck_tar_backup_file(). > 2. verify_tar_file_contents() is the second and final part of tar > backup verification. Should its name be aligned with > verify_tar_backup_file()? I’m unsure what the best name would be. > Perhaps verify_tar_backup_file_final(), but then > verify_tar_backup_file() would need to be renamed to something like > verify_tar_backup_file_initial(), which might be too lengthy. verify_tar_file_contents() actually verifies the contents of all the tar files, not just one, so the name is a bit confusing. Maybe verify_all_tar_files(). > 3. verify_tar_contents() is the core of verify_tar_file_contents() > that handles the actual verification. I’m unsure about the current > naming. Should we rename it to something like > verify_tar_contents_core()? It wouldn’t be an issue if we renamed > verify_tar_file_contents() as pointed in point #2. verify_one_tar_file()? But with those renames, I think you really start to see why I'm not very comfortable with verify_backup_directory(). The tar and plain format cases aren't really doing the same thing. We're just gluing them into a single function anyway. I am also still uncomfortable with the way you've refactored some of this so that we end up with very small amounts of code far away from other code that they influence. Like you end up with this: /* Check the backup manifest entry for this file. */ m = verify_manifest_entry(context, relpath, sb.st_size); /* Validate the pg_control information */ if (should_verify_control_data(context->manifest, m)) ... if (show_progress && !context->skip_checksums && should_verify_checksum(m)) But verify_manifest_entry can return NULL or it can set m->bad and either of those change the result of should_verify_control_data() and should_verify_checksum(), but none of that is obvious when you just look at this. Admittedly, the code in master isn't brilliant in terms of making it obvious what's happening either, but I think this is worse. I'm not really sure what I think we should do about that yet, but I'm uncomfortable with it. -- Robert Haas EDB: http://www.enterprisedb.com
I would rather that you didn't add simple_ptr_list_destroy_deep() given that you don't need it for this patch series. + "\"%s\" unexpected file in the tar format backup", This doesn't seem grammatical to me. Perhaps change this to: file \"%s\" is not expected in a tar format backup + /* We are only interested in files that are not in the ignore list. */ + if (member->is_directory || member->is_link || + should_ignore_relpath(mystreamer->context, member->pathname)) + return; Doesn't this need to happen after we add pg_tblspc/$OID to the path, rather than before? I bet this doesn't work correctly for files in user-defined tablespaces, compared to the way it work for a directory-format backup. + char temp[MAXPGPATH]; + + /* Copy original name at temporary space */ + memcpy(temp, member->pathname, MAXPGPATH); + + snprintf(member->pathname, MAXPGPATH, "%s/%d/%s", + "pg_tblspc", mystreamer->tblspc_oid, temp); I don't like this at all. This function doesn't have any business modifying the astreamer_member, and it doesn't need to. It can just do char *pathname; char tmppathbuf[MAXPGPATH] and then set pathname to either member->pathname or tmppathbuf depending on OidIsValid(tblspcoid). Also, shouldn't this be using %u, not %d? + mystreamer->mfile = (void *) m; Either the cast to void * isn't necessary, or it indicates that there's a type mismatch that should be fixed. + * We could have these checks while receiving contents. However, since + * contents are received in multiple iterations, this would result in + * these lengthy checks being performed multiple times. Instead, setting + * flags here and using them before proceeding with verification will be + * more efficient. Seems unnecessary to explain this. + Assert(mystreamer->verify_checksum); + + /* Should have came for the right file */ + Assert(strcmp(member->pathname, m->pathname) == 0); + + /* + * The checksum context should match the type noted in the backup + * manifest. + */ + Assert(checksum_ctx->type == m->checksum_type); What do you think about: Assert(m != NULL && !m->bad); Assert(checksum_ctx->type == m->checksum_type); Assert(strcmp(member->pathname, m->pathname) == 0); Or possibly change the first one to Assert(should_verify_checksum(m))? + memcpy(&control_file, streamer->bbs_buffer.data, sizeof(ControlFileData)); This probably doesn't really hurt anything, but it's a bit ugly. You first use astreamer_buffer_until() to force the entire file into a buffer. And then here, you copy the entire file into a second buffer which is exactly the same except that it's guaranteed to be properly aligned. It would be possible to include a ControlFileData in astreamer_verify and copy the bytes directly into it (you'd need a second astreamer_verify field for the number of bytes already copied into that structure). I'm not 100% sure that's worth the code, but it seems like it wouldn't take more than a few lines, so perhaps it is. +/* + * Verify plain backup. + */ +static void +verify_plain_backup(verifier_context *context) +{ + Assert(context->format == 'p'); + verify_backup_directory(context, NULL, context->backup_directory); +} + This seems like a waste of space. +verify_tar_backup(verifier_context *context) I like this a lot better now! I'm still not quite sure about the decision to have the ignore list apply to both the backup directory and the tar file contents -- but given the low participation on this thread, I don't think we have much chance of getting feedback from anyone else right now, so let's just do it the way you have it and we can change it later if someone shows up to complain. +verify_all_tar_files(verifier_context *context, SimplePtrList *tarfiles) I think this code could be moved into its only caller instead of having a separate function. And then if you do that, maybe verify_one_tar_file could be renamed to just verify_tar_file. Or perhaps that function could also be removed and just move the code into the caller. It's not much code and not very deeply nested. Similarly create_archive_verifier() could be considered for this treatment. Maybe inlining all of these is too much and the result will look messy, but I think we should at least try to combine some of them. ...Robert
+ pg_log_error("pg_waldump cannot read from a tar"); "tar" isn't normally used as a noun as you do here, so I think this should say "pg_waldump cannot read tar files". Technically, the position of this check could lead to an unnecessary failure, if -n wasn't specified but pg_wal.tar(.whatever) also doesn't exist on disk. But I think it's OK to ignore that case. However, I also notice this change to the TAP tests in a few places: - [ 'pg_verifybackup', $tempdir ], + [ 'pg_verifybackup', '-Fp', $tempdir ], It's not the end of the world to have to make a change like this, but it seems easy to do better. Just postpone the call to find_backup_format() until right after we call parse_manifest_file(). That also means postponing the check mentioned above until right after that, but that's fine: after parse_manifest_file() and then find_backup_format(), you can do if (!no_parse_wal && context.format == 't') { bail out }. + if (stat(path, &sb) == 0) + result = 'p'; + else + { + if (errno != ENOENT) + { + pg_log_error("cannot determine backup format : %m"); + pg_log_error_hint("Try \"%s --help\" for more information.", progname); + exit(1); + } + + /* Otherwise, it is assumed to be a tar format backup. */ + result = 't'; + } This doesn't look good, for a few reasons: 1. It would be clearer to structure this as if (stat(...) == 0) result = 'p'; else if (errno == ENOENT) result = 't'; else { report an error; } instead of the way you have it. 2. "cannot determine backup format" is not an appropriate way to report the failure of stat(). The appropriate message is "could not stat file \"%s\"". 3. It is almost never correct to put a space before a colon in an error message. 4. The hint doesn't look helpful, or necessary. I think you can just delete that. Regarding both point #2 and point #4, I think we should ask ourselves how stat() could realistically fail here. On my system (macOS), the document failed modes for stat() are EACCES (i.e. permission denied), EFAULT (i.e. we have a bug in pg_verifybackup), EIO (I/O Error), ELOOP (symlink loop), ENAMETOOLONG, ENOENT, ENOTDIR, and EOVERFLOW. In none of those cases does it seem likely that specifying the format manually will help anything. Thus, suggesting that the user look at the help, presumably to find --format, is unlikely to solve anything, and telling them that the error happened while trying to determine the backup format isn't really helping anything, either. What the user needs to know is that it was stat() that failed, and the pathname for which it failed. Then they need to sort out whatever problem is causing them to get one of those really weird errors. Aside from the above, 0010 looks pretty reasonable, although I'll probably want to do some wordsmithing on some of the comments at some point. - of the backup. The backup must be stored in the "plain" - format; a "tar" format backup can be checked after extracting it. + of the backup. The backup must be stored in the "plain" or "tar" + format. Verification is supported for <literal>gzip</literal>, + <literal>lz4</literal>, and <literal>zstd</literal> compressed tar backup; + any other compressed format backups can be checked after decompressing them. I don't think that we need to say that the backup must be stored in the plain or tar format, because those are the only backup formats pg_basebackup knows about. Similarly, it doesn't seem help to me to enumerate all the compression algorithms that pg_basebackup supports and say we only support those; what else would a user expect? What I would do is replace the original sentence ("The backup must be stored...") with: The backup may be stored either in the "plain" or the "tar" format; this includes "tar" backups compressed with any algorithm supported by pg_basebackup. However, at present, WAL verification is supported only for plain-format backups. Therefore, if the backup is stored in "tar" format, the <literal>-n, --no-parse-wal<literal> option should be used. + # Add tar backup format option + push @backup, ('-F', 't'); + # Add switch to skip WAL verification. + push @verify, ('-n'); Say why, not what. The second comment should say something like "WAL verification not yet supported for tar-format backups". + "$format backup fails with algorithm \"$algorithm\""); + $primary->command_ok(\@backup, "$format backup ok with algorithm \"$algorithm\""); + ok(-f "$backup_path/backup_manifest", "$format backup manifest exists"); + "verify $format backup with algorithm \"$algorithm\""); Personally I would change "$format" to "$format format" in all of these places, so that we talk about a "tar format backup" or a "plain format backup" instead of a "tar backup" or a "plain backup". + 'skip_on_windows' => 1 I don't understand why 4 of the 7 new tests are skipped on Windows. The existing "skip" message for this parameter says "unix-style permissions not supported on Windows" but that doesn't seem applicable for any of the new cases, and I couldn't find a comment about it, either. + my @files = glob("*"); + system_or_bail($tar, '-cf', "$backup_path/$archive", @files); Why not just system_or_bail($tar, '-cf', "$backup_path/$archive", '.')? Also, instead of having separate entries in the test array to do basically the same thing on Windows, could we just iterate through the test array twice and do everything once for plain format and then a second time for tar format, and do the tests once for each? I don't think that idea QUITE works, because the open_file_fails, open_directory_fails, and search_directory_fails tests are really not applicable to tar format. But we could rename skip_on_windows to tests_file_permissions and skip those both on Windows and for tar format. But aside from that, I don't quite see why it makes sense to, for example, test extra_file for both formats but not extra_tablespace_file, and indeed it seems like an important bit of test coverage. I also feel like we should have tests someplace that add extra files to a tar-format backup in the backup directory (e.g. 1234567.tar, or wug.tar, or 123456.wug) or remove entire files. ...Robert
On Thu, Sep 26, 2024 at 12:18 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Sep 12, 2024 at 7:05 AM Amul Sul <sulamul@gmail.com> wrote: > > The updated version attached. Thank you for the review ! > > I have spent a bunch of time on this and have made numerous revisions. > I hope to commit the result, aftering seeing what you and the > buildfarm think (and anyone else who wishes to offer an opinion). > Changes: > Thank you, Robert. The code changes look much better now. A few minor comments: + each tablespace, named after the tablespace's OID. If the backup + is compressed, the relevant compression extension is added to the + end of each file name. I am a bit unsure about the last line, especially the use of the word "added." I feel like it's implying that we're adding something, which isn't true. -- Typo: futher -- The addition of simple_ptr_list_destroy will be part of a separate commit, correct? Regards, Amul
On Fri, Sep 27, 2024 at 2:07 AM Amul Sul <sulamul@gmail.com> wrote: > Thank you, Robert. The code changes look much better now. Cool. > A few minor comments: > > + each tablespace, named after the tablespace's OID. If the backup > + is compressed, the relevant compression extension is added to the > + end of each file name. > > I am a bit unsure about the last line, especially the use of the word > "added." I feel like it's implying that we're adding something, which > isn't true. If you add .gz to the end of 16904.tar, you get 16904.tar.gz. This seems like correct English to me. > Typo: futher OK, thanks. > The addition of simple_ptr_list_destroy will be part of a separate > commit, correct? To me, it doesn't seem worth splitting that out into a separate commit. -- Robert Haas EDB: http://www.enterprisedb.com
The 32-bit buildfarm animals don't like this too much [1]: astreamer_verify.c: In function 'member_verify_checksum': astreamer_verify.c:297:68: error: format '%zu' expects argument of type 'size_t', but argument 6 has type 'uint64' {aka 'longlong unsigned int'} [-Werror=format=] 297 | "file \\"%s\\" in \\"%s\\" should contain %zu bytes, but read %zu bytes", | ~~^ | | | unsigned int | %llu 298 | m->pathname, mystreamer->archive_name, 299 | m->size, mystreamer->checksum_bytes); | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | | | uint64 {aka long long unsigned int} Now, manifest_file.size is in fact a size_t, so %zu is the correct format spec for it. But astreamer_verify.checksum_bytes is declared uint64. This leads me to question whether size_t was the correct choice for manifest_file.size. If it is, is it OK to use size_t for checksum_bytes as well? If not, your best bet is likely to write %lld and add an explicit cast to long long, as we do in dozens of other places. I see that these messages are intended to be translatable, so INT64_FORMAT is not usable here. Aside from that, I'm unimpressed with expending a five-line comment at line 376 to justify casting control_file_bytes to int, when you could similarly cast it to long long, avoiding the need to justify something that's not even in tune with project style. regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2024-09-28%2006%3A03%3A02
Piling on a bit ... Coverity reported the following issues in this new code. I have not analyzed them to see if they're real problems. ________________________________________________________________________________________________________ *** CID 1620458: Resource leaks (RESOURCE_LEAK) /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c: 1025 in verify_tar_file() 1019 relpath); 1020 1021 /* Close the file. */ 1022 if (close(fd) != 0) 1023 report_backup_error(context, "could not close file \"%s\": %m", 1024 relpath); >>> CID 1620458: Resource leaks (RESOURCE_LEAK) >>> Variable "buffer" going out of scope leaks the storage it points to. 1025 } 1026 1027 /* 1028 * Scan the hash table for entries where the 'matched' flag is not set; report 1029 * that such files are present in the manifest but not on disk. 1030 */ ________________________________________________________________________________________________________ *** CID 1620457: Memory - illegal accesses (OVERRUN) /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_verifybackup/astreamer_verify.c: 349 in member_copy_control_data() 343 */ 344 if (mystreamer->control_file_bytes <= sizeof(ControlFileData)) 345 { 346 int remaining; 347 348 remaining = sizeof(ControlFileData) - mystreamer->control_file_bytes; >>> CID 1620457: Memory - illegal accesses (OVERRUN) >>> Overrunning array of 296 bytes at byte offset 296 by dereferencing pointer "(char *)&mystreamer->control_file + mystreamer->control_file_bytes". 349 memcpy(((char *) &mystreamer->control_file) 350 + mystreamer->control_file_bytes, 351 data, Min(len, remaining)); 352 } 353 354 /* Remember how many bytes we saw, even if we didn't buffer them. */ ________________________________________________________________________________________________________ *** CID 1620456: Null pointer dereferences (FORWARD_NULL) /srv/coverity/git/pgsql-git/postgresql/src/bin/pg_verifybackup/pg_verifybackup.c: 939 in precheck_tar_backup_file() 933 "file \"%s\" is not expected in a tar format backup", 934 relpath); 935 tblspc_oid = (Oid) num; 936 } 937 938 /* Now, check the compression type of the tar */ >>> CID 1620456: Null pointer dereferences (FORWARD_NULL) >>> Passing null pointer "suffix" to "strcmp", which dereferences it. 939 if (strcmp(suffix, ".tar") == 0) 940 compress_algorithm = PG_COMPRESSION_NONE; 941 else if (strcmp(suffix, ".tgz") == 0) 942 compress_algorithm = PG_COMPRESSION_GZIP; 943 else if (strcmp(suffix, ".tar.gz") == 0) 944 compress_algorithm = PG_COMPRESSION_GZIP; regards, tom lane
On Sat, Sep 28, 2024 at 6:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Now, manifest_file.size is in fact a size_t, so %zu is the correct > format spec for it. But astreamer_verify.checksum_bytes is declared > uint64. This leads me to question whether size_t was the correct > choice for manifest_file.size. If it is, is it OK to use size_t > for checksum_bytes as well? If not, your best bet is likely > to write %lld and add an explicit cast to long long, as we do in > dozens of other places. I see that these messages are intended to be > translatable, so INT64_FORMAT is not usable here. It seems that manifest_file.size is size_t because parse_manifest.h is using size_t for json_manifest_per_file_callback. What's happening is that json_manifest_finalize_file() is parsing the file size information out of the manifest. It uses strtoul to do that and assigns the result to a size_t. I don't think I had any principled reason for making that decision; size_t is, I think, the size of an object in memory, and this is not that. This is just a string in a JSON file that represents an integer which will hopefully turn out to be the size of the file on disk. I guess I don't know what type I should be using here. Most things in PostgreSQL use a type like uint32 or uint64, but technically what we're going to be comparing against in the end is probably an off_t produced by stat(), but the return value of strtoul() or strtoull() is unsigned long or unsigned long long, which is not any of those things. If you have a suggestion here, I'm all ears. > Aside from that, I'm unimpressed with expending a five-line comment > at line 376 to justify casting control_file_bytes to int, when you > could similarly cast it to long long, avoiding the need to justify > something that's not even in tune with project style. I don't know what you mean by this. The comment explains that I used %d here because that's what pg_rewind does, and %d corresponds to int, not long long. If you think it should be some other way, you can change it, and perhaps you'd like to change pg_rewind to match while you're at it. But the fact that there's a comment here explaining the reasoning is a feature, not a bug. It's weird to me to get criticized for failing to follow project style when I literally copied something that already exists. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Sep 28, 2024 at 6:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Now, manifest_file.size is in fact a size_t, so %zu is the correct >> format spec for it. But astreamer_verify.checksum_bytes is declared >> uint64. This leads me to question whether size_t was the correct >> choice for manifest_file.size. > It seems that manifest_file.size is size_t because parse_manifest.h is > using size_t for json_manifest_per_file_callback. What's happening is > that json_manifest_finalize_file() is parsing the file size > information out of the manifest. It uses strtoul to do that and > assigns the result to a size_t. I don't think I had any principled > reason for making that decision; size_t is, I think, the size of an > object in memory, and this is not that. Correct, size_t is defined to measure in-memory object sizes. It's the argument type of malloc(), the result type of sizeof(), etc. It does not restrict the size of disk files. > This is just a string in a > JSON file that represents an integer which will hopefully turn out to > be the size of the file on disk. I guess I don't know what type I > should be using here. Most things in PostgreSQL use a type like uint32 > or uint64, but technically what we're going to be comparing against in > the end is probably an off_t produced by stat(), but the return value > of strtoul() or strtoull() is unsigned long or unsigned long long, > which is not any of those things. If you have a suggestion here, I'm > all ears. I don't know if it's realistic to expect that this code might be used to process JSON blobs exceeding 4GB. But if it is, I'd be inclined to use uint64 and strtoull for these purposes, if only to avoid cross-platform hazards with varying sizeof(long) and sizeof(size_t). Um, wait ... we do have strtou64(), so you should use that. >> Aside from that, I'm unimpressed with expending a five-line comment >> at line 376 to justify casting control_file_bytes to int, > I don't know what you mean by this. I mean that we have a widely-used, better solution. If you copied this from someplace else, the someplace else could stand to be improved too. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Sun, Sep 29, 2024 at 1:03 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> CID 1620458: Resource leaks (RESOURCE_LEAK) >>> Variable "buffer" going out of scope leaks the storage it points to. > This looks like a real leak. It can only happen once per tarfile when > verifying a tar backup so it can never add up to much, but it makes > sense to fix it. +1 >>> CID 1620457: Memory - illegal accesses (OVERRUN) >>> Overrunning array of 296 bytes at byte offset 296 by dereferencing pointer "(char *)&mystreamer->control_file + mystreamer->control_file_bytes". > I think this might be complaining about a potential zero-length copy. > Seems like perhaps the <= sizeof(ControlFileData) test should actually > be < sizeof(ControlFileData). That's clearly an improvement, but I was wondering if we should also change "len" and "remaining" to be unsigned (probably size_t). Coverity might be unhappy about the off-the-end array reference, but perhaps it's also concerned about what happens if len is negative. >>> CID 1620456: Null pointer dereferences (FORWARD_NULL) >>> Passing null pointer "suffix" to "strcmp", which dereferences it. > This one is happening, I believe, because report_backup_error() > doesn't perform a non-local exit, but we have a bit of code here that > acts like it does. Check. > Patch attached. WFM, modulo the suggestion about changing data types. regards, tom lane
On Mon, Sep 30, 2024 at 11:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > This is just a string in a > > JSON file that represents an integer which will hopefully turn out to > > be the size of the file on disk. I guess I don't know what type I > > should be using here. Most things in PostgreSQL use a type like uint32 > > or uint64, but technically what we're going to be comparing against in > > the end is probably an off_t produced by stat(), but the return value > > of strtoul() or strtoull() is unsigned long or unsigned long long, > > which is not any of those things. If you have a suggestion here, I'm > > all ears. > > I don't know if it's realistic to expect that this code might be used > to process JSON blobs exceeding 4GB. But if it is, I'd be inclined to > use uint64 and strtoull for these purposes, if only to avoid > cross-platform hazards with varying sizeof(long) and sizeof(size_t). > > Um, wait ... we do have strtou64(), so you should use that. The thing we should be worried about is not how large a JSON blob might be, but rather how large any file that appears in the data directory might be. So uint32 is out; and I think I hear you voting for uint64 over size_t. But then how do you think we should print that? Cast to unsigned long long and use %llu? > >> Aside from that, I'm unimpressed with expending a five-line comment > >> at line 376 to justify casting control_file_bytes to int, > > > I don't know what you mean by this. > > I mean that we have a widely-used, better solution. If you copied > this from someplace else, the someplace else could stand to be > improved too. I don't understand what you think the widely-used, better solution is here. As far as I can see, there are two goods here, between which one must choose. One can decide to use the same error message string, and I hope we can agree that's good, because I've been criticized in the past when I have done otherwise, as have many others. The other good is to use the most appropriate data type. One cannot have both of those things in this instance, unless one goes and fixes the other code also, but such a change had no business being part of this patch. If the issue had been serious and likely to occur in real life, I would have probably fixed it in a preparatory patch, but it isn't, so I settled for adding a comment. -- Robert Haas EDB: http://www.enterprisedb.com
On Mon, Sep 30, 2024 at 11:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > WFM, modulo the suggestion about changing data types. I would prefer not to make the data type change here because it has quite a few tentacles. If I change member_copy_control_data() then I have to change astreamer_verify_content() which means changing the astreamer interface which means adjusting all of the other astreamers. That can certainly be done, but it's quite possible it might provoke some other Coverity warning. Since this is a length, it might've been better to use an unsigned data type, but there's no reason that I can see why it should be size_t specifically: the origin of the value could be either the return value of read(), which is ssize_t not size_t, or the number of bytes returned by a decompression library or the number of bytes present in a protocol message. Trying to make things fit better here is just likely to make them fit worse someplace else. "You are in a maze of twisty little data types, all alike." -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Sep 30, 2024 at 11:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Um, wait ... we do have strtou64(), so you should use that. > The thing we should be worried about is not how large a JSON blob > might be, but rather how large any file that appears in the data > directory might be. So uint32 is out; and I think I hear you voting > for uint64 over size_t. Yes. size_t might only be 32 bits. > But then how do you think we should print > that? Cast to unsigned long long and use %llu? Our two standard solutions are to do that or to use UINT64_FORMAT. But UINT64_FORMAT is problematic in translatable strings because then the .po files would become platform-specific, so long long is what to use in that case. For a non-translated format string you can do either. > I don't understand what you think the widely-used, better solution is > here. What we just said above. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Sep 30, 2024 at 11:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> WFM, modulo the suggestion about changing data types. > I would prefer not to make the data type change here because it has > quite a few tentacles. I see your point for the function's "len" argument, but perhaps it's worth doing - int remaining; + size_t remaining; remaining = sizeof(ControlFileData) - mystreamer->control_file_bytes; memcpy(((char *) &mystreamer->control_file) + mystreamer->control_file_bytes, - data, Min(len, remaining)); + data, Min((size_t) len, remaining)); This is enough to ensure the Min() remains safe. regards, tom lane
On Mon, Sep 30, 2024 at 6:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Mon, Sep 30, 2024 at 11:31 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> WFM, modulo the suggestion about changing data types. > > > I would prefer not to make the data type change here because it has > > quite a few tentacles. > > I see your point for the function's "len" argument, but perhaps it's > worth doing > > - int remaining; > + size_t remaining; > > remaining = sizeof(ControlFileData) - mystreamer->control_file_bytes; > memcpy(((char *) &mystreamer->control_file) > + mystreamer->control_file_bytes, > - data, Min(len, remaining)); > + data, Min((size_t) len, remaining)); > > This is enough to ensure the Min() remains safe. OK, done! -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > Here is an attempt at cleaning this up. I'm far from convinced that > it's fully correct; my local compiler (clang version 15.0.7) doesn't > seem fussed about conflating size_t with uint64, not even with -Wall > -Werror. I don't suppose you have a fussier compiler locally that you > can use to test this? Sure, I can try it on mamba's host. It's slow though ... > Respectfully, if you'd just said in your first email about this "I > understand that you were trying to be consistent with a format string > somewhere else, but I don't think that's a good reason to do it this > way, so please use %llu and insert a cast," I would have just said > "fine, no problem" and I wouldn't have been irritated at all. I apologize for rubbing you the wrong way on this. It was not my intent. (But, in fact, I had not realized you copied that code from somewhere else.) regards, tom lane
On Tue, Oct 1, 2024 at 10:48 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Sure, I can try it on mamba's host. It's slow though ... OK, thanks. > I apologize for rubbing you the wrong way on this. It was not my > intent. (But, in fact, I had not realized you copied that code > from somewhere else.) That's good to hear, but I'm still slightly puzzled because you started by complaining about the comment and the comment, as I read it, says that it's copied. So either the comment isn't as clear as I think it is, or you didn't read it before complaining about it. We don't have to keep going back and forth on this. I'm happy to have you change it in any way that you feel suitable, with or without adjusting pg_rewind to match. There wouldn't have been a lengthy comment here if I'd been certain what the right thing to do was; and I wasn't. If you are, great! -- Robert Haas EDB: http://www.enterprisedb.com
I wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Here is an attempt at cleaning this up. I'm far from convinced that >> it's fully correct; my local compiler (clang version 15.0.7) doesn't >> seem fussed about conflating size_t with uint64, not even with -Wall >> -Werror. I don't suppose you have a fussier compiler locally that you >> can use to test this? > Sure, I can try it on mamba's host. It's slow though ... Yes, mamba thinks this is OK. regards, tom lane
On Tue, Oct 1, 2024 at 1:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yes, mamba thinks this is OK. Committed. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Oct 1, 2024 at 1:32 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yes, mamba thinks this is OK. > Committed. Sadly, it seems adder[1] is even pickier than mamba: ../pgsql/src/backend/backup/basebackup_incremental.c: In function \342\200\230CreateIncrementalBackupInfo\342\200\231: ../pgsql/src/backend/backup/basebackup_incremental.c:179:30: error: assignment to \342\200\230json_manifest_per_file_callback\342\200\231{aka \342\200\230void (*)(JsonManifestParseContext *, const char *,long long unsigned int, pg_checksum_type, int, unsigned char *)\342\200\231} from incompatible pointer type \342\200\230void(*)(JsonManifestParseContext *, const char *, size_t, pg_checksum_type, int, uint8 *)\342\200\231 {aka\342\200\230void (*)(JsonManifestParseContext *, const char *, unsigned int, pg_checksum_type, int, unsigned char*)\342\200\231} [-Wincompatible-pointer-types] 179 | context->per_file_cb = manifest_process_file; | ^ regards, tom lane [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder&dt=2024-10-02%2014%3A09%3A58
I wrote: > Sadly, it seems adder[1] is even pickier than mamba: Nope, it was my testing error: I supposed that this patch only affected pg_combinebackup and pg_verifybackup, so I only recompiled those modules not the whole tree. But there's one more place with a json_manifest_per_file_callback callback :-(. I pushed a quick-hack fix. regards, tom lane
On Wed, Oct 2, 2024 at 8:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Nope, it was my testing error: I supposed that this patch only > affected pg_combinebackup and pg_verifybackup, so I only > recompiled those modules not the whole tree. But there's one > more place with a json_manifest_per_file_callback callback :-(. > > I pushed a quick-hack fix. I should have realized that was there, considering that it was I who added it and not very long ago. Thanks for fixing it. -- Robert Haas EDB: http://www.enterprisedb.com