Home > mailing lists

Re: Large files for relations - Mailing list pgsql-hackers

From	David Steele
Subject	Re: Large files for relations
Date	June 12, 2023 08:52:56
Msg-id	05cc050b-af2c-7504-c628-63bdcf104702@pgmasters.net Whole thread Raw
In response to	Re: Large files for relations (Thomas Munro <thomas.munro@gmail.com>)
Responses	Re: Large files for relations
List	pgsql-hackers

Tree view

On 5/28/23 08:48, Thomas Munro wrote:
> 
> Alright, since I had some time to kill in an airport, here is a
> starter patch for initdb --rel-segsize.  

I've gone through this patch and it looks pretty good to me. A few things:

+             * rel_setment_size, we will truncate the K+1st segment to 0 length

rel_setment_size -> rel_segment_size

+     * We used a phony GUC with a custome show function, because we don't

custome -> custom

+        if (strcmp(endptr, "kB") == 0)

Why kB here instead of KB to match MB, GB, TB below?

+    int64        relseg_size;    /* blocks per segment of large relation */

This will require PG_CONTROL_VERSION to be bumped -- but you are 
probably waiting until commit time to avoid annoying conflicts, though I 
don't think it is as likely as with CATALOG_VERSION_NO.

> Some random thoughts:
> 
> Another potential option name would be --segsize, if we think we're
> going to use this for temp files too eventually.

I feel like temp file segsize should be separately configurable for the 
same reason that we are leaving it as 1GB for now.

> Maybe it's not so beautiful to have that global variable
> rel_segment_size (which replaces REL_SEGSIZE everywhere).  

Maybe not, but it is the way these things are done in general, .e.g. 
wal_segment_size, so I don't think it will be too controversial.

> Another
> idea would be to make it static in md.c and call smgrsetsegmentsize(),
> or something like that.  That could be a nice place to compute the
> "shift" value up front, instead of computing it each time in
> blockno_to_segno(), but that's probably not worth bothering with (?).
> BSR/LZCNT/CLZ instructions are pretty fast on modern chips.  That's
> about the only place where someone could say that this change makes
> things worse for people not interested in the new feature, so I was
> careful to get rid of / and % operations with no-longer-constant RHS.

Right -- not sure we should be troubling ourselves with trying to 
optimize away ops that are very fast, unless they are computed trillions 
of times.

> I had to promote segment size to int64 (global variable, field in
> control file), because otherwise it couldn't represent
> --rel-segsize=32TB (it'd be too big by one).  Other ideas would be to
> store the shift value instead of the size, or store the max block
> number, eg subtract one, or use InvalidBlockNumber to mean "no limit"
> (with more branches to test for it).  The only problem I ran into with
> the larger type was that 'SHOW segment_size' now needs a custom show
> function because we don't have int64 GUCs.

A custom show function seems like a reasonable solution here.

> A C type confusion problem that I noticed: some code uses BlockNumber
> and some code uses int for segment numbers.  It's not really a
> reachable problem for practical reasons (you'd need over 2 billion
> directories and VFDs to reach it), but it's wrong to use int if
> segment size can be set as low as BLCKSZ (one file per block); you
> could have more segments than an int can represent.  We could go for
> uint32, BlockNumber or create SegmentNumber (which I think I've
> proposed before, and lost track of...).  We can address that
> separately (perhaps by finding my old patch...)

I think addressing this separately is fine, though maybe enforcing some 
reasonable minimum in initdb would be a good idea for this patch. For my 
2c SEGSIZE == BLOCKSZ just makes very little sense.

Lastly, I think the blockno_to_segno(), blockno_within_segment(), and 
blockno_to_seekpos() functions add enough readability that they should 
be committed regardless of how this patch proceeds.

Regards,
-David

pgsql-hackers by date:

From: "Fujii.Yuki@df.MitsubishiElectric.co.jp"
Date: 12 June 2023, 08:51:30
Subject: RE: Partial aggregates pushdown

From: "Daniel Verite"
Date: 12 June 2023, 09:37:58
Subject: Re: Order changes in PG16 since ICU introduction

Re: Large files for relations - Mailing list pgsql-hackers

Previous

Next