Re: New Table Access Methods for Multi and Single Inserts - Mailing list pgsql-hackers
From | Jeff Davis |
---|---|
Subject | Re: New Table Access Methods for Multi and Single Inserts |
Date | |
Msg-id | 8633171cb034aafc260fdf37df04b6c779aa1e2f.camel@j-davis.com Whole thread Raw |
In response to | Re: New Table Access Methods for Multi and Single Inserts (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>) |
Responses |
Re: New Table Access Methods for Multi and Single Inserts
Introduce new multi insert Table AM and improve performance of various SQL commands with it for Heap AM |
List | pgsql-hackers |
On Sun, 2024-03-31 at 21:18 +0530, Bharath Rupireddy wrote: > if (table_modify_buffer_insert() is defined) > table_modify_buffer_insert(...); > else > { > myState->bistate = GetBulkInsertState(); > table_tuple_insert(...); > } We can't alloc/free the bulk insert state for every insert call. I see two options: * Each caller needs to support two code paths: if the buffered insert APIs are defined, then use those; otherwise the caller needs to manage the bulk insert state itself and call the plain insert API. * Have default implementation for the new API methods, so that the default for the begin method would allocate the bulk insert state, and the default for the buffered insert method would be to call plain insert using the bulk insert state. I'd prefer the latter, at least in the long term. But I haven't really thought through the details, so perhaps we'd need to use the former. > > > > After we have these new APIs fully in place and used by COPY, what > > will > > happen to those other APIs? Will they be deprecated or will there > > be a > > reason to keep them? > > Deprecated perhaps? Including Alexander on this thread, because he's making changes to the multi-insert API. We need some consensus on where we are going with these APIs before we make more changes, and what incremental steps make sense in v17. Here's where I think this API should go: 1. Have table_modify_begin/end and table_modify_buffer_insert, like those that are implemented in your patch. 2. Add some kind of flush callback that will be called either while the tuples are being flushed or after the tuples are flushed (but before they are freed by the AM). (Aside: do we need to call it while the tuples are being flushed to get the right visibility semantics for after-row triggers?) 3. Add table_modify_buffer_{update|delete} APIs. 4. Some kind of API tweaks to help manage memory when modifying pertitioned tables, so that the buffering doesn't get out of control. Perhaps just reporting memory usage and allowing the caller to force flushes would be enough. 5. Use these new methods for CREATE/REFRESH MATERIALIZED VIEW. This is fairly straightforward, I believe, and handled by your patch. Indexes are (re)built afterward, and no triggers are possible. 6. Use these new methods for CREATE TABLE ... AS. This is fairly straightforward, I believe, and handled by your patch. No indexes or triggers are possible. 7. Use these new methods for COPY. We have to be careful to avoid regressions for the heap method, because it's already managing its own buffers. If the AM manages the buffering, then it may require additional copying of slots, which could be a disadvantage. To solve this, we may need some minor API tweaks to avoid copying when the caller guarantees that the memory will not be freed to early, or perhaps expose the AM's memory context to copyfrom.c. Another thing to consider is that the buffering in copyfrom.c is also used for FDWs, so that buffering code path needs to be preserved in copyfrom.c even if not used for AMs. 8. Use these new methods for INSERT INTO ... SELECT. One potential challenge here is that execution nodes are not always run to completion, so we need to be sure that the flush isn't forgotten in that case. 9. Use these new methods for DELETE, UPDATE, and MERGE. MERGE can use the buffer_insert/update/delete APIs; we don't need a separate merge method. This probably requires that the AM maintain 3 separate buffers to distinguish different kinds of changes at flush time (obviously these can be initialized lazily to avoid overhead when not being used). 10. Use these new methods for logical apply. 11. Deprecate the multi_insert API. Thoughts on this plan? Does your patch make sense in v17 as a stepping stone, or should we try to make all of these API changes together in v18? Also, a sample AM code would be a huge benefit here. Writing a real AM is hard, but perhaps we can at least have an example one to demonstrate how to use these APIs? Regards, Jeff Davis
pgsql-hackers by date: