Re: pg_combinebackup --copy-file-range - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: pg_combinebackup --copy-file-range |
Date | |
Msg-id | 0e27835d-dab5-49cd-a3ea-52cf6d9ef59e@enterprisedb.com Whole thread Raw |
In response to | Re: pg_combinebackup --copy-file-range (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: pg_combinebackup --copy-file-range
|
List | pgsql-hackers |
On 3/29/24 15:23, Robert Haas wrote: > On Tue, Mar 26, 2024 at 2:03 PM Tomas Vondra > <tomas.vondra@enterprisedb.com> wrote: >> [ new patches ] > > Tomas, thanks for pointing me to this email; as you speculated, gmail > breaks threading if the subject line changes. > > The documentation still needs work here: > > - It refers to --link mode, which is not a thing. > > - It should talk about the fact that in some cases block-by-block > copying will still be needed, possibly mentioning that it specifically > happens when the old backup manifest is not available or does not > contain checksums or does not contain checksums of the right type, or > maybe just being a bit vague. > > In copy_file.c: > > - You added an unnecessary blank line to the beginning of a comment block. > Thanks, should be all cleaned up now, I think. > - You could keep the strategy_implementation variable here. I don't > think that's 100% necessary, but now that you've got the code > structured this way, there's no compelling reason to remove it. > Yeah, I think you're right. The strategy_implementation seemed a bit weird to me because we now have 4 functions with different signatures. Most only take srd/dst, but copy_file_blocks() also takes checksum. And it seemed better to handle everything the same way, rather than treating copy_file_blocks as an exception. But it's not that bad, so 0001 has strategy_implementation again. But I'll get back to this in a minute. > - I don't know what the +/* XXX maybe this should do the check > internally, same as the other functions? */ comment is talking about. > I think this is stale. The XXX is about how the various functions detect/report support. In most we have the ifdefs/pg_fatal() inside the function, but CopyFile() has nothing like that, because the detection happens earlier. I wasn't sure if maybe we should make all these functions more alike, but I don't think we should. > - Maybe these functions should have header comments. > Right, added. I was thinking about the comment [1] from a couple a days ago, where Thomas suggested that maybe we should try doing the CoW stuff (clone/copy_file_range) even in cases when we need to read the block, say to calculate checksum, or even reconstruct from incremental backups. I wasn't sure how big the benefits of the patches shared so far might be, so I decided to do some tests. I did a fairly simple thing: 1) initialize a cluster with pgbench scale 5000 (~75GB) 2) create a full backup 3) do a run that updates ~1%, 10% and 20% of the blocks 4) create an incremental backup after each run 5) do pg_combinebackup for for each of the increments, with block-by-block copy and copy_file_range, measure how long it takes and how much disk space it consumes I did this on xfs and btrfs, and it quickly became obvious that there's very little benefit unless --no-manifest is used. Which makes perfect sense, because pgbench is uniform updates so all segments need to be reconstructed from increments (so copy_file.c is mostly irrelevant), and write_reconstructed_file only uses copy_file_range() without checksums. I don't know how common --no-manifest is going to be, but I guess people will want to keep manifests in at least some backup schemes (e.g. to rebuild full backup instead of having to take a full backup regularly). So I decided to take a stab at Thomas' idea, i.e. reading the data to calculate checksums, but then using copy_file_range instead of just writing the data onto disk. This is in 0003, which relaxes the conditions in 0002 shared a couple days ago. And this helped a lot. The attached PDF shows results for xfs/btrfs. Charts on the left are disk space occupied by the reconstructed backup, measured as difference between "df" before and after running pg_combinebackup. The duration of the pg_combinebackup execution is on the right. First row is without manifest (i.e. --no-manifest), the second row is with manifest. The 1%, 10% and 20% groups are for the various increments, updating different fractions of the database. The database is ~75GB, so that's what we expect a plain copy to have. If there are some CoW benefits of copy_file_range, allowing the fs to reuse some of the space or, the disk space should be reduced. And similarly, there could/should be some improvements in pg_combinebackup duration. Each bar is a different copy method and patch: * copy on master/0001/0002/0003 - we don't expect any difference between these, it should all perform the same and use the "full" space * copy_file_range on 0001/0002/0003 - 0001 should perform the same as copy, because it's only about full-segment copies, and we don't any of those here, 0002/0003 should help, depending on --no-manifest And indeed, this is what we see. 0002/0003 use only a fraction of disk space, roughly the same as the updated fraction (which matches the size of the increment). This is nice. For duration, the benefits seem to depend on the file system. For btrfs it actually is faster, as expected. 0002/0003 saves maybe 30-50% of time, compared to block-by-block copy. On XFS it's not that great, the copy_file_range is actually slower by up to about 50%. And this is not about the extra read - this affects the 0002/no-manifest case too, where the read is not necessary. I think this is fine. It's a tradeoff, where on some filesystems you can save time or space, and on other filesystems you can save both. That's a tradeoff for the users to decide, I think. I'll see how this works on EXT4/ZFS next ... But thinking about this a bit more, I realized there's no reason not to apply the same logic to the copy_file part. I mean, if we need to copy a file but also calculate a checksum, we can simply do the clone using clone/copy_file_range, but then also read the file and calculate the checksum ... 0004 does this, by simply passing the checksum_cxt around, which also has the nice consequence that all the functions now have the same signature, which makes the strategy_implementation work for all cases. I need to do more testing of this, but like how this looks. Of course, maybe there's not an agreement this is the right way to approach this, and we should do the regular block-by-block copy? There's one more change in 0003 - the checks if clone/copy_file_range are supported by the platform now happen right at the beginning when parsing the arguments, so that when a user specifies one of those options, the error happens right away instead of sometime later when we happen to hit one of those pg_fatal() places. I think this is the right place to do these checks, as it makes the write_reconstructed_file much easier to understand (without all the ifdefs etc.). But there's an argument whether this should fail with pg_fatal() or just fallback to the default copy method. BTW I wonder if it makes sense to only allow one of those methods? At the moment the user can specify both --clone and --copy-file-range, and which one "wins" depends on the order in which they are specified. Seems confusing at best. But maybe it'd make sense to allow both, and e.g. use clone() to copy whole segments and copy_file_range() for other places? regards [1] https://www.postgresql.org/message-id/CA%2BhUKG%2B8KDk%2BpM6vZHWT6XtZzh-sdieUDohcjj0fia6aqK3Oxg%40mail.gmail.com -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: