Re: Large files for relations - Mailing list pgsql-hackers
From | David Steele |
---|---|
Subject | Re: Large files for relations |
Date | |
Msg-id | 05cc050b-af2c-7504-c628-63bdcf104702@pgmasters.net Whole thread Raw |
In response to | Re: Large files for relations (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Large files for relations
|
List | pgsql-hackers |
On 5/28/23 08:48, Thomas Munro wrote: > > Alright, since I had some time to kill in an airport, here is a > starter patch for initdb --rel-segsize. I've gone through this patch and it looks pretty good to me. A few things: + * rel_setment_size, we will truncate the K+1st segment to 0 length rel_setment_size -> rel_segment_size + * We used a phony GUC with a custome show function, because we don't custome -> custom + if (strcmp(endptr, "kB") == 0) Why kB here instead of KB to match MB, GB, TB below? + int64 relseg_size; /* blocks per segment of large relation */ This will require PG_CONTROL_VERSION to be bumped -- but you are probably waiting until commit time to avoid annoying conflicts, though I don't think it is as likely as with CATALOG_VERSION_NO. > Some random thoughts: > > Another potential option name would be --segsize, if we think we're > going to use this for temp files too eventually. I feel like temp file segsize should be separately configurable for the same reason that we are leaving it as 1GB for now. > Maybe it's not so beautiful to have that global variable > rel_segment_size (which replaces REL_SEGSIZE everywhere). Maybe not, but it is the way these things are done in general, .e.g. wal_segment_size, so I don't think it will be too controversial. > Another > idea would be to make it static in md.c and call smgrsetsegmentsize(), > or something like that. That could be a nice place to compute the > "shift" value up front, instead of computing it each time in > blockno_to_segno(), but that's probably not worth bothering with (?). > BSR/LZCNT/CLZ instructions are pretty fast on modern chips. That's > about the only place where someone could say that this change makes > things worse for people not interested in the new feature, so I was > careful to get rid of / and % operations with no-longer-constant RHS. Right -- not sure we should be troubling ourselves with trying to optimize away ops that are very fast, unless they are computed trillions of times. > I had to promote segment size to int64 (global variable, field in > control file), because otherwise it couldn't represent > --rel-segsize=32TB (it'd be too big by one). Other ideas would be to > store the shift value instead of the size, or store the max block > number, eg subtract one, or use InvalidBlockNumber to mean "no limit" > (with more branches to test for it). The only problem I ran into with > the larger type was that 'SHOW segment_size' now needs a custom show > function because we don't have int64 GUCs. A custom show function seems like a reasonable solution here. > A C type confusion problem that I noticed: some code uses BlockNumber > and some code uses int for segment numbers. It's not really a > reachable problem for practical reasons (you'd need over 2 billion > directories and VFDs to reach it), but it's wrong to use int if > segment size can be set as low as BLCKSZ (one file per block); you > could have more segments than an int can represent. We could go for > uint32, BlockNumber or create SegmentNumber (which I think I've > proposed before, and lost track of...). We can address that > separately (perhaps by finding my old patch...) I think addressing this separately is fine, though maybe enforcing some reasonable minimum in initdb would be a good idea for this patch. For my 2c SEGSIZE == BLOCKSZ just makes very little sense. Lastly, I think the blockno_to_segno(), blockno_within_segment(), and blockno_to_seekpos() functions add enough readability that they should be committed regardless of how this patch proceeds. Regards, -David
pgsql-hackers by date: