Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem) - Mailing list pgsql-hackers
From | Hannu Krosing |
---|---|
Subject | Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem) |
Date | |
Msg-id | CAMT0RQQvki5v5FPZMXSR3cn4Y2gMUUtEYb+Hf_OQmqUbeBrUeQ@mail.gmail.com Whole thread Raw |
In response to | Support for 8-byte TOAST values (aka the TOAST infinite loop problem) (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: Support for 8-byte TOAST values (aka the TOAST infinite loop problem)
|
List | pgsql-hackers |
Hi Michael I'll take a look at the patch set. While digging around in the TOAST code did you have any ideas on how one could extract the TOAST APIs in a way that they can be added in Table Access Method definition ? Not all TAMs need TOAST, but the ones that do could also be the ones that still like to do something different when materializing toasted values. And TOAST is actually a nice abstraction which could be used as basis for both offloading more columns into separate forks and files as well as implementing some kinds of vectored, columnar and compressed storages. ---- Hannu On Thu, Jun 19, 2025 at 7:59 AM Michael Paquier <michael@paquier.xyz> wrote: > > Hi all, > > I have been looking at $subject and the many past reviews recently, > also related to some of the work related to the potential support for > zstandard compression in TOAST values, and found myself pondering > about the following message from Tom, to be reminded that nothing has > been done regarding the fact that the backend may finish in an > infinite loop once a TOAST table reaches 4 billion values: > https://www.postgresql.org/message-id/764273.1669674269%40sss.pgh.pa.us > > Spoiler: I have heard of users that are in this case, and the best > thing we can do currently except raising shoulders is to use > workarounds with data externalization AFAIK, which is not nice, > usually, and users notice the problem once they see some backends > stuck in the infinite loop. I have spent some time looking at the > problem, and looked at all the proposals in this area like these ones > (I hope so at least): > https://commitfest.postgresql.org/patch/4296/ > https://www.postgresql.org/message-id/flat/224711f9-83b7-a307-b17f-4457ab73aa0a@sigaev.ru > > Anyway, it seems like nothing goes in a direction that I think would > be suited to fix the two following problems (some of the proposed > patches broke backward-compatibility, as well, and that's critical): > - The limit of TOAST values to 4 billions, because external TOAST > pointers want OIDs. > - New compression methods, see the recent proposal about zstandard at > [1]. ToastCompressionId is currently limited to 4 values because the > extinfo field of varatt_external has only this much data remaining. > Spoiler: I want to propose a new varatt_external dedicated to > zstandard-compressed external pointers, but that's not for this > thread. > > Please find attached a patch set I have finished with while poking at > the problem, to address points 1) and 2) in the first email mentioned > at the top of this message. It is not yet ready for prime day yet > (there are a couple of things that would need adjustments), but I have > reached the point where I am going to need a consensus about what > people would be OK to have in terms of design to be able to support > multiple types of varatt_external to address these issues. And I'm OK > to consume time on that for the v19 cycle. > > While hacking at (playing with?) the whole toasting and detoasting > code to understand the blast radius that this would involve, I have > quickly found that it is very annoying to have to touch at many places > of varatt.h to make variants of the existing varatt_external structure > (what we store on-disk as varlena Datum for external TOAST pointers). > Spoiler: it's my first time touching the internals of this code area > so deeply. Most of the code churns happen because we need to make the > [de]toast code aware of what to do depending on the vartags of the > external varlenas. It would be simple to hardcode a bunch of new > VARATT_IS_EXTERNAL_ONDISK() variants to plug in the new structures. > While it is efficient, this has a cost for out-of-core code and in > core because all the places that touch external TOAST pointers need to > be adjusted. Point is: it can be done. But if we introduce more > types of external TOAST pointers we need to always patch all these > areas, and there's a cost in that each time one or more new vartags > are added. > > So, I have invented a new interface aimed at manipulating on-disk > external TOAST pointers, called toast_external, that is an on-memory > structure that services as an interface between on-disk external TOAST > pointers and what the backend wants to look at when retrieving chunks > of data from the TOAST relations. That's the main proposal of this > patch set, with a structure looking like that: > typedef struct toast_external_data > { > /* Original data size (includes header) */ > int32 rawsize; > /* External saved size (without header) */ > uint32 extsize; > /* compression method */ > ToastCompressionId compression_method; > /* Relation OID of TOAST table containing the value */ > Oid toastrelid; > /* > * Unique ID of value within TOAST table. This could be an OID or an > * int8 value. This field is large enough to be able to store any of > * them. > */ > uint64 value; > } toast_external_data; > > This is a bit similar to what the code does for R/W and R/O vartags, > only applying to the on-disk external pointers. Then, the [de]toast > code and extension code is updated so as varlenas are changed into > this structure if we need to retrieve some of its data, and these > areas of the code do not need to know about the details of the > external TOAST pointers. When saving an external set of chunks, this > structure is filled with information depending on what > toast_save_datum() deals with, be it a short varlena, a non-compressed > external value, or a compressed external value, then builds a varlena > with the vartag we want. > > External TOAST pointers have three properties that are hardcoded in > the tree, bringing some challenges of their own: > - The maximum size of a chunk, TOAST_MAX_CHUNK_SIZE, tweaked at close > to 2k to make 4 chunks fit on a page. This depends on the size of the > external pointer. This one was actually easy to refactor. > - The varlena header size, based on VARTAG_SIZE(), which is kind of > tricky to refactor out in the new toast_external.c, but that seems OK > even if this knowledge stays in varatt.h. > - The toast pointer size, aka TOAST_POINTER_SIZE. This one is > actually very interesting (tricky): we use it in one place, > toast_tuple_find_biggest_attribute(), as a lower bound to decide if an > attribute should be toastable or not. I've refactored the code to use > a "best" guess depending on the value type in the TOAST relation, but > that's not 100% waterproof. That needs more thoughts. > > Anyway, the patch set is able to demonstrate how much needs to be done > in the tree to support multiple chunk_id types, and the CI is happy > with the attached. Some of the things done: > - Introduction of a user-settable GUC called default_toast_type, that > can be switched between two modes "oid" and "int8", to force the > creation of a TOAST relation using one type or the other. > - Dump, restore and upgrade support are integrated, relying on a GUC > makes the logic a breeze. > - 64b values are retrieved from a single counter in the control file, > named a "TOAST counter", which has the same reliability and properties > as an OID, with checkpoint support, WAL records, etc. > - Rewrites are soft, so I have kicked the can down the toast on this > point to not make the proposal more complicated than it should be: a > VACUUM FULL retains the same TOAST value type as the original. We > could extend rewrites so as the type of TOAST value is changed. It is > possible to setup a new cluster with default_toast_type = int8 set > after an upgrade, with the existing tables still using the OID mode. > This relates to the recent proposal with a concurrent VACUUM FULL > (REPACK discussion). > > The patch set keeps the existing vartag_external with OID values for > backward-compatibility, and adds a second vartag_external that can > store 8-byte values. This model is the simplest one, and > optimizations are possible, where the Datum TOAST pointer could be > shorter depending on the ID type (OID or int8), the compression method > and the actual value to divide in chunks. For example, if you know > that a chunk of data to save has a value less than UINT32_MAX, we > could store 4 bytes worth of data instead of 8. This design has the > advantage to allow plugging in new TOAST external structures easily. > Now I've not spent extra time in this tuning, because there's no point > in spending more time without an actual agreement about three things, > and *that's what I'm looking for* as feedback for this upcoming CF: > - The structures of the external TOAST pointers. Variable-sized > pointers could be one possibility, across multiple vartags. Ideas are > welcome. > - How many vartag_external types we want. > - If people are actually OK with this translation layer or not, and I > don't disagree that there may be some paths hot enough where the > translation between the on-disk varlenas and this on-memory > toast_external_data hurts. Again, it is possible to hardcode more > types of vartags in the tree, or just bypass the translation in the > paths that are too hot. That's doable still brutal, but if that's the > final consensus reached I'm OK with that as well. (See for example > the changes in amcheck to see how simpler things get.) > > The patch set has been divided into multiple pieces to ease its > review. Again, I'm not completely happy with everything in it, but > it's a start. Each patch has its own commit message, so feel free to > refer to them for more details: > - 0001 introduces the GUC default_toast_type. It is just defined, not > used in the tree at this stage. > - 0002 adds support for catcache lookups for int8 values, required to > allow TOAST values with int8 and its indexes. Potentially useful for > extensions. > - 0003 introduces the "TOAST counter", 8 bytes in the control file to > allocate values for the int8 chunk_id. That's cheap, reliable. > - 0004 is a mechanical change, that enlarges a couple of TOAST > interfaces to use values of uint64 instead of OID. > - 0005, again a mechanical change, reducing a bit the footprint of > TOAST_MAX_CHUNK_SIZE because OID and int8 values need different > values. > - 0006 tweaks pg_column_toast_chunk_id() to use int8 as return type. > > Then comes the "main" patches: > - 0007 adds support for int8 chunk_id in TOAST tables. This is mostly > a mechanical change. If applying the patches up to this point, > external Datums are applied to both OID and int8 values. Note that > there is one tweak I'm unhappy with: the toast counter generation > would need to be smarter to avoid concurrent values because we don't > cross-check the TOAST index for existing values. (Sorry, got slightly > lazy here). > - 0008 adds tests for external compressed and uncompressed TOAST > values for int8 TOAST types. > - 0009 adds support for dump, restore, upgrades of the TOAST table > types. > - 0010 is the main dish: refactoring of the TOAST code to use > toast_external_data, with OID vartags as the only type defined. > - 0011 adds a second vartag_external: the one with int8 values stored > in the external TOAST pointer. > - 0012 is a bonus for amcheck: what needs to be done in its TAP tests > to allow the corruption cases to work when supporting a new vartag. > > That was a long message. Thank you for reading if you have reached > this point. > > Regards, > > [1]: https://www.postgresql.org/message-id/CAFAfj_HX84EK4hyRYw50AOHOcdVi-%2BFFwAAPo7JHx4aShCvunQ%40mail.gmail.com > -- > Michael
pgsql-hackers by date: