Thread: Undo logs
Hello hackers, As announced elsewhere[1][2][3], at EnterpriseDB we are working on a proposal to add in-place updates with undo logs to PostgreSQL. The goal is to improve performance and resource usage by recycling space better. The lowest level piece of this work is a physical undo log manager, which I've personally been working on. Later patches will build on top, adding record-oriented access and then the main "zheap" access manager and related infrastructure. My colleagues will write about those. The README files[4][5] explain in more detail, but here is a bullet-point description of what the attached patch set gives you: 1. Efficient appending of new undo data from many concurrent backends. Like logs. 2. Efficient discarding of old undo data that isn't needed anymore. Like queues. 3. Efficient buffered random reading of undo data. Like relations. A test module is provided that can be used to exercise the undo log code paths without needing any of the later zheap patches. This is work in progress. A few aspects are under active development and liable to change, as indicated by comments, and there are no doubt bugs and room for improvement. The code is also available at github.com/EnterpriseDB/zheap (these patches are from the undo-log-storage branch, see also the master branch which has the full zheap feature). We'd be grateful for any questions, feedback or ideas. [1] https://amitkapila16.blogspot.com/2018/03/zheap-storage-engine-to-provide-better.html [2] https://rhaas.blogspot.com/2018/01/do-or-undo-there-is-no-vacuum.html [3] https://www.pgcon.org/2018/schedule/events/1190.en.html [4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo [5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr -- Thomas Munro http://www.enterprisedb.com
Attachment
Exciting stuff! Really looking forward to having a play with this.
James Sewell,
Chief Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
On 25 May 2018 at 08:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
Hello hackers,
As announced elsewhere[1][2][3], at EnterpriseDB we are working on a
proposal to add in-place updates with undo logs to PostgreSQL. The
goal is to improve performance and resource usage by recycling space
better.
The lowest level piece of this work is a physical undo log manager,
which I've personally been working on. Later patches will build on
top, adding record-oriented access and then the main "zheap" access
manager and related infrastructure. My colleagues will write about
those.
The README files[4][5] explain in more detail, but here is a
bullet-point description of what the attached patch set gives you:
1. Efficient appending of new undo data from many concurrent
backends. Like logs.
2. Efficient discarding of old undo data that isn't needed anymore.
Like queues.
3. Efficient buffered random reading of undo data. Like relations.
A test module is provided that can be used to exercise the undo log
code paths without needing any of the later zheap patches.
This is work in progress. A few aspects are under active development
and liable to change, as indicated by comments, and there are no doubt
bugs and room for improvement. The code is also available at
github.com/EnterpriseDB/zheap (these patches are from the
undo-log-storage branch, see also the master branch which has the full
zheap feature). We'd be grateful for any questions, feedback or
ideas.
[1] https://amitkapila16.blogspot.com/2018/03/zheap-storage- engine-to-provide-better.html
[2] https://rhaas.blogspot.com/2018/01/do-or-undo-there-is- no-vacuum.html
[3] https://www.pgcon.org/2018/schedule/events/1190.en.html
[4] https://github.com/EnterpriseDB/zheap/tree/undo- log-storage/src/backend/ access/undo
[5] https://github.com/EnterpriseDB/zheap/tree/undo- log-storage/src/backend/ storage/smgr
--
Thomas Munro
http://www.enterprisedb.com
The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.
On 24 May 2018 at 23:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > As announced elsewhere[1][2][3], at EnterpriseDB we are working on a > proposal to add in-place updates with undo logs to PostgreSQL. The > goal is to improve performance and resource usage by recycling space > better. Cool > The lowest level piece of this work is a physical undo log manager, > 1. Efficient appending of new undo data from many concurrent > backends. Like logs. > 2. Efficient discarding of old undo data that isn't needed anymore. > Like queues. > 3. Efficient buffered random reading of undo data. Like relations. Like an SLRU? > [4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo > [5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr I think there are quite a few design decisions there that need to be discussed, so lets crack on and discuss them please. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Simon, On Mon, May 28, 2018 at 11:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote: > On 24 May 2018 at 23:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> The lowest level piece of this work is a physical undo log manager, > >> 1. Efficient appending of new undo data from many concurrent >> backends. Like logs. >> 2. Efficient discarding of old undo data that isn't needed anymore. >> Like queues. >> 3. Efficient buffered random reading of undo data. Like relations. > > Like an SLRU? Yes, but with some difference: 1. There is a variable number of undo logs. Each one corresponds to a range of the 64 bit address space, and has its own head and tail pointers, so that concurrent writers don't contend for buffers when appending data. (Unlike SLRUs which are statically defined, one for clog.c, one for commit_ts.c, ...). 2. Undo logs use regular buffers instead of having their own mini buffer pool, ad hoc search and reclamation algorithm etc. 3. Undo logs support temporary, unlogged and permanent storage (= local buffers and reset-on-crash-restart, for undo data relating to relations of those persistence levels). 4. Undo logs storage files are preallocated (rather than being extended block by block), and the oldest file is renamed to become the newest file in common cases, like WAL. >> [4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo >> [5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr > > I think there are quite a few design decisions there that need to be > discussed, so lets crack on and discuss them please. What do you think about using the main buffer pool? Best case: pgbench type workload, discard pointer following closely behind insert pointer, we never write anything out to disk (except for checkpoints when we write a few pages), never advance the buffer pool clock hand, and we use and constantly recycle 1-2 pages per connection via the free list (as can be seen by monitoring insert - discard in the pg_stat_undo_logs view). Worst case: someone opens a snapshot and goes out to lunch so we can't discard old undo data, and then we start to compete with other stuff for buffers, and we hope the buffer reclamation algorithm is good at its job (or can be improved). I just talked about this proposal at a pgcon unconference session. Here's some of the feedback I got: 1. Jeff Davis pointed out that I'm probably wrong about not needing FPI, and there must at least be checksum problems with torn pages. He also gave me an idea on how to fix that very cheaply, and I'm still processing that feedback. 2. Andres Freund thought it seemed OK if we have smgr.c routing to md.c for relations and undofile.c for undo, but if we're going to generalise this technique to put other things into shared buffers eventually too (like the SLRUs, as proposed by Shawn Debnath in another unconf session) then it might be worth investigating how to get md.c to handle all of their needs. They'd all just use fd.c files, after all, so it'd be weird if we had to maintain several different similar things. 3. Andres also suggested that high frequency free page list access might be quite contended in the "best case" described above. I'll look into that. 4. Someone said that segment sizes probably shouldn't be hard coded (cf WAL experience). I also learned in other sessions that there are other access managers in development that need undo logs. I'm hoping to find out more about that. -- Thomas Munro http://www.enterprisedb.com
Hello hackers, As Thomas has already mentioned upthread that we are working on an undo-log based storage and he has posted the patch sets for the lowest layer called undo-log-storage. This is the next layer which sits on top of the undo log storage, which will provide an interface for prepare, insert, or fetch the undo records. This layer will use undo-log-storage to reserve the space for the undo records and buffer management routine to write and read the undo records. To prepare an undo record, first, it will allocate required space using undo_log_storage module. Next, it will pin and lock the required buffers and return an undo record pointer where it will insert the record. Finally, it calls the Insert routine for final insertion of prepared record. Additionally, there is a mechanism for multi-insert, wherein multiple records are prepared and inserted at a time. To fetch an undo record, a caller must provide a valid undo record pointer. Optionally, the caller can provide a callback function with the information of the block and offset, which will help in faster retrieval of undo record, otherwise, it has to traverse the undo-chain. These patch sets will apply on top of the undo-log-storage branch [1], commit id fa3803a048955c4961581e8757fe7263a98fe6e6. [1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/ undo_interface_v1.patch is the main patch for providing the undo interface. undo_interface_test_v1.patch is a simple test module to test the undo interface layer. On Thu, May 31, 2018 at 4:27 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Hi Simon, > > On Mon, May 28, 2018 at 11:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 24 May 2018 at 23:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >>> The lowest level piece of this work is a physical undo log manager, >> >>> 1. Efficient appending of new undo data from many concurrent >>> backends. Like logs. >>> 2. Efficient discarding of old undo data that isn't needed anymore. >>> Like queues. >>> 3. Efficient buffered random reading of undo data. Like relations. >> >> Like an SLRU? > > Yes, but with some difference: > > 1. There is a variable number of undo logs. Each one corresponds to > a range of the 64 bit address space, and has its own head and tail > pointers, so that concurrent writers don't contend for buffers when > appending data. (Unlike SLRUs which are statically defined, one for > clog.c, one for commit_ts.c, ...). > 2. Undo logs use regular buffers instead of having their own mini > buffer pool, ad hoc search and reclamation algorithm etc. > 3. Undo logs support temporary, unlogged and permanent storage (= > local buffers and reset-on-crash-restart, for undo data relating to > relations of those persistence levels). > 4. Undo logs storage files are preallocated (rather than being > extended block by block), and the oldest file is renamed to become the > newest file in common cases, like WAL. > >>> [4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo >>> [5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr >> >> I think there are quite a few design decisions there that need to be >> discussed, so lets crack on and discuss them please. > > What do you think about using the main buffer pool? > > Best case: pgbench type workload, discard pointer following closely > behind insert pointer, we never write anything out to disk (except for > checkpoints when we write a few pages), never advance the buffer pool > clock hand, and we use and constantly recycle 1-2 pages per connection > via the free list (as can be seen by monitoring insert - discard in > the pg_stat_undo_logs view). > > Worst case: someone opens a snapshot and goes out to lunch so we can't > discard old undo data, and then we start to compete with other stuff > for buffers, and we hope the buffer reclamation algorithm is good at > its job (or can be improved). > > I just talked about this proposal at a pgcon unconference session. > Here's some of the feedback I got: > > 1. Jeff Davis pointed out that I'm probably wrong about not needing > FPI, and there must at least be checksum problems with torn pages. He > also gave me an idea on how to fix that very cheaply, and I'm still > processing that feedback. > 2. Andres Freund thought it seemed OK if we have smgr.c routing to > md.c for relations and undofile.c for undo, but if we're going to > generalise this technique to put other things into shared buffers > eventually too (like the SLRUs, as proposed by Shawn Debnath in > another unconf session) then it might be worth investigating how to > get md.c to handle all of their needs. They'd all just use fd.c > files, after all, so it'd be weird if we had to maintain several > different similar things. > 3. Andres also suggested that high frequency free page list access > might be quite contended in the "best case" described above. I'll look > into that. > 4. Someone said that segment sizes probably shouldn't be hard coded > (cf WAL experience). > > I also learned in other sessions that there are other access managers > in development that need undo logs. I'm hoping to find out more about > that. > > -- > Thomas Munro > http://www.enterprisedb.com > -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > Hello hackers, > > As Thomas has already mentioned upthread that we are working on an > undo-log based storage and he has posted the patch sets for the lowest > layer called undo-log-storage. > > This is the next layer which sits on top of the undo log storage, > which will provide an interface for prepare, insert, or fetch the undo > records. This layer will use undo-log-storage to reserve the space for > the undo records and buffer management routine to write and read the > undo records. > > To prepare an undo record, first, it will allocate required space > using undo_log_storage module. Next, it will pin and lock the required > buffers and return an undo record pointer where it will insert the > record. Finally, it calls the Insert routine for final insertion of > prepared record. Additionally, there is a mechanism for multi-insert, > wherein multiple records are prepared and inserted at a time. > > To fetch an undo record, a caller must provide a valid undo record > pointer. Optionally, the caller can provide a callback function with > the information of the block and offset, which will help in faster > retrieval of undo record, otherwise, it has to traverse the undo-chain. > > These patch sets will apply on top of the undo-log-storage branch [1], > commit id fa3803a048955c4961581e8757fe7263a98fe6e6. > > [1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/ > > > undo_interface_v1.patch is the main patch for providing the undo interface. > undo_interface_test_v1.patch is a simple test module to test the undo > interface layer. > Thanks to Robert Haas for designing an early prototype for forming undo record. Later, I’ve completed the remaining parts of the code including undo record prepare, insert, fetch and other related APIs with help of Rafia Sabih. Thanks to Amit Kapila for providing valuable design inputs. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar <dilip.kumar@enterprisedb.com> wrote: > On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > As Thomas has already mentioned upthread that we are working on an > > undo-log based storage and he has posted the patch sets for the lowest > > layer called undo-log-storage. > > > > This is the next layer which sits on top of the undo log storage, > > which will provide an interface for prepare, insert, or fetch the undo > > records. This layer will use undo-log-storage to reserve the space for > > the undo records and buffer management routine to write and read the > > undo records. I have also pushed a new WIP version of the lower level undo log storage layer patch set to a public branch[1]. I'll leave the earlier branch[2] there because the record-level patch posted by Dilip depends on it for now. The changes are mostly internal: it doesn't use DSM segments any more. Originally I wanted to use DSM because I didn't want arbitrary limits, but in fact DSM slots can run out in unpredictable ways, and unlike parallel query the undo log subsystem doesn't have a plan B for when it can't get the space it needs due to concurrent queries. Instead, this version uses a pool of size 4 * max_connections, fixed at startup in regular shared memory. This creates an arbitrary limit on transaction size, but it's a large at 1TB per slot, can be increased, doesn't disappear unpredictably, is easy to monitor (pg_stat_undo_logs), and is probably a useful brake on a system in trouble. More soon. [1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage-v2 [2] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage -- Thomas Munro http://www.enterprisedb.com
On Sun, Sep 2, 2018 at 12:18 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar > <dilip.kumar@enterprisedb.com> wrote: >> On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >> > As Thomas has already mentioned upthread that we are working on an >> > undo-log based storage and he has posted the patch sets for the lowest >> > layer called undo-log-storage. >> > >> > This is the next layer which sits on top of the undo log storage, >> > which will provide an interface for prepare, insert, or fetch the undo >> > records. This layer will use undo-log-storage to reserve the space for >> > the undo records and buffer management routine to write and read the >> > undo records. > > I have also pushed a new WIP version of the lower level undo log > storage layer patch set to a public branch[1]. I'll leave the earlier > branch[2] there because the record-level patch posted by Dilip depends > on it for now. Rebased undo_interface patches on top of the new branch of undo-log-storage[1]. [1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage-v2 -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar > <dilip.kumar@enterprisedb.com> wrote: > > On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > As Thomas has already mentioned upthread that we are working on an > > > undo-log based storage and he has posted the patch sets for the lowest > > > layer called undo-log-storage. > > > > > > This is the next layer which sits on top of the undo log storage, > > > which will provide an interface for prepare, insert, or fetch the undo > > > records. This layer will use undo-log-storage to reserve the space for > > > the undo records and buffer management routine to write and read the > > > undo records. > > I have also pushed a new WIP version of the lower level undo log > storage layer patch set to a public branch[1]. I'll leave the earlier > branch[2] there because the record-level patch posted by Dilip depends > on it for now. > I have started reading the patch and have a few assorted comments which are mentioned below. I have been involved in the high-level design of this module and I have also shared given some suggestions during development, but this is mainly Thomas's work with some help from Dilip. It would be good if other members of the community also review the design or participate in the discussion. Comments ------------------ undo/README ----------------------- 1. +The undo log subsystem provides a way to store data that is needed for +a limited time. Undo data is generated whenever zheap relations are +modified, but it is only useful until (1) the generating transaction +is committed or rolled back and (2) there is no snapshot that might +need it for MVCC purposes. I think the snapshots need it for MVCC purpose and we need it till the transaction is committed and all-visible. 2. +* the tablespace that holds its segment files +* the persistence level (permanent, unlogged, temporary) +* the "discard" pointer; data before this point has been discarded +* the "insert" pointer: new data will be written here +* the "end" pointer: a new undo segment file will be needed at this point + +The three pointers discard, insert and end move strictly forwards +until the whole undo log has been exhausted. At all times discard <= +insert <= end. When discard == insert, the undo log is empty +(everything that has ever been inserted has since been discarded). +The insert pointer advances when regular backends allocate new space, +and the discard pointer usually advances when an undo worker process +determines that no session could need the data either for rollback or +for finding old versions of tuples to satisfy a snapshot. In some +special cases including single-user mode and temporary undo logs the +discard pointer might also be advanced synchronously by a foreground +session. Here, the use of insert and discard pointer are explained nicely. Can you elaborate the usage of end pointer as well? 3. +UndoLogControl objects corresponding to the current set of active undo +logs are held in a fixed-sized pool in shared memory. The size of +the array is a multiple of max_connections, and limits the total size of +transactions. Here, it is mentioned the array is multiple of max_connections, but the code uses MaxBackends. Can you sync them? 4. +The meta-data for all undo logs is written to disk at every +checkpoint. It is stored in files under PGDATA/pg_undo/, using the +checkpoint's redo point (a WAL LSN) as its filename. At startup time, +the redo point's file can be used to restore all undo logs' meta-data +as of the moment of the redo point into shared memory. Changes to the +discard pointer and end pointer are WAL-logged by undolog.c and will +bring the in-memory meta-data up to date in the event of recovery +after a crash. Changes to insert pointers are included in other WAL +records (see below). I see one inconvenience for using checkpoint's redo point for meta file name which is what if someone uses pg_resetxlog to truncate the redo? Is there any reason we can't keep a different name for the meta file? 5. +stabilize on one undo log per active writing backend (or more if +different tablespaces are persistence levels are used). /tablespaces are persistence levels/tablespaces and persistence levels I think due to the above design, we can now reach the maximum number of undo logs quickly as the patch now uses fixed shared memory to represent them. I am not sure if there is an easy way to avoid that. Can we try to expose guc for a maximum number of undo slots such that instead of MaxBackends * 4, it could be MaxBackends * <new_guc>? undo-log-manager patch ------------------------------------ 6. @@ -127,6 +128,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) size = add_size(size, ProcGlobalShmemSize()); size = add_size(size, XLOGShmemSize()); size = add_size(size, CLOGShmemSize()); + size = add_size(size, UndoLogShmemSize()); size = add_size(size, CommitTsShmemSize()); size = add_size(size, SUBTRANSShmemSize()); size = add_size(size, TwoPhaseShmemSize()); @@ -219,6 +221,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port) */ XLOGShmemInit(); CLOGShmemInit(); + UndoLogShmemInit(); It seems to me that we will always allocate shared memory for undo logs irrespective of whether someone wants to use them or not. Am I right? If so, isn't it better if we find some way that this memory is allocated only when someone has a need for it? 7. +/* + * How many undo logs can be active at a time? This creates a theoretical + * maximum transaction size, but it we set it to a factor the maximum number + * of backends it will be a very high limit. Alternative designs involving + * demand paging or dynamic shared memory could remove this limit but + * introduce other problems. + */ +static inline size_t +UndoLogNumSlots(void) +{+ return MaxBackends * 4; +} Seems like typos in above comment /but it we/but if we /factor the maximum number -- the sentence is not completely clear. 8. + * Extra shared memory will be managed using DSM segments. + */ +Size +UndoLogShmemSize(void) You told in the email that the patch doesn't use DSM segments anymore, but the comment seems to indicate otherwise. 9. /* + * TODO: Should this function be usable in a critical section? + * Woudl it make sense to detect that we are in a critical Typo /Woudl/Would 10. +static void +undolog_xlog_attach(XLogReaderState *record) +{ + xl_undolog_attach *xlrec = (xl_undolog_attach *) XLogRecGetData(record); + UndoLogControl *log; + + undolog_xid_map_add(xlrec->xid, xlrec->logno); + + /* + * Whatever follows is the first record for this transaction. Zheap will + * use this to add UREC_INFO_TRANSACTION. + */ + log = get_undo_log(xlrec->logno, false); + /* TODO */ There are a lot of TODO's in the code, among them, above is not at all clear. 11. + UndoRecPtr oldest_data; + +} UndoLogControl; Extra space after the last member looks odd. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Oct 15, 2018 at 6:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > I have also pushed a new WIP version of the lower level undo log > > storage layer patch set to a public branch[1]. I'll leave the earlier > > branch[2] there because the record-level patch posted by Dilip depends > > on it for now. > > > > I have started reading the patch and have a few assorted comments > which are mentioned below. I have been involved in the high-level > design of this module and I have also shared given some suggestions > during development, but this is mainly Thomas's work with some help > from Dilip. It would be good if other members of the community also > review the design or participate in the discussion. > > Comments > ------------------ > Some more comments/questions on the design level choices you have made in this patch and some general comments. 1. To allocate an undo log (UndoLogAllocate()), it seems first we are creating the shared memory state for an undo log, write a WAL for it, create an actual file and segment in it and write a separate WAL for it. Now imagine the system crashed after creating a shared memory state and before actually allocating an undo log segment, then it is quite possible that after recovery we will block multiple slots for undo logs without having actual undo logs for them. Apart from that writing separate WAL for them doesn't appear to be the best way to deal with it considering that we also need to write a third WAL to attach an undo log. Now, IIUC, one advantage of arranging the things this way is that we avoid dropping the tablespaces when a particular undo log exists in it. I understand that this design kind of works, but I think we should try to think of some alternatives here. You might have already thought of making it work similar to how the interaction for regular tables or temp_tablespaces works with dropping the tablespaces but decided to do something different here. Can you explain why you have made a different design choice here? 2. extend_undo_log() { .. + /* + * Flush the parent dir so that the directory metadata survives a crash + * after this point. + */ + UndoLogDirectory(log->meta.tablespace, dir); + fsync_fname(dir, true); + + /* + * If we're not in recovery, we need to WAL-log the creation of the new + * file(s). We do that after the above filesystem modifications, in + * violation of the data-before-WAL rule as exempted by + * src/backend/access/transam/README. This means that it's possible for + * us to crash having made some or all of the filesystem changes but + * before WAL logging, but in that case we'll eventually try to create the + * same segment(s) again, which is tolerated. + */ + if (!InRecovery) + { + xl_undolog_extend xlrec; + XLogRecPtr ptr; .. } I don't understand this WAL logging action. If the crash happens before or during syncing the file, then we anyway don't have WAL to replay. If it happens after WAL writing, then anyway we are sure that the extended undo log segment must be there. Can you explain how this works? 3. +static void +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace, + UndoLogOffset end) { .. } What will happen if the transaction creating undo log segment rolls back? Do we want to have pendingDeletes stuff as we have for normal relation files? This might also help in clearing the shared memory state (undo log slots) if any. 4. +choose_undo_tablespace(bool force_detach, Oid *tablespace) { .. /* + * Take the tablespace create/drop lock while we look the name up. + * This prevents the tablespace from being dropped while we're trying + * to resolve the name, or while the called is trying to create an + * undo log in it. The caller will have to release this lock. + */ + LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE); .. This appears quite expensive, for selecting an undo log to attach, we might need to wait for an unrelated tablespace create/drop. Have you considered any other ideas to prevent this? How other callers of get_tablespace_oid prevent it from being dropped? If we don't find any better solution, then I think at the very least we should start a separate thread to know the opinion of others on this matter. I think this is somewhat related to point-1. 5. +static inline Oid +UndoRecPtrGetTablespace(UndoRecPtr urp) +{ + UndoLogNumber logno = UndoRecPtrGetLogNo (urp); + UndoLogTableEntry *entry; + + /* + * Fast path, for undo logs we've seen before. This is safe because + * tablespaces are constant for the lifetime of an undo log number. + */ + entry = undologtable_lookup (undologtable_cache, logno); + if (likely(entry)) + return entry->tablespace; + + /* + * Slow path: force cache entry to be created. Raises an error if the + * undo log has been entirely discarded, or hasn't been created yet. That + * is appropriate here, because this interface is designed for accessing + * undo pages via bufmgr, and we should never be trying to access undo + * pages that have been discarded. + */ + UndoLogGet(logno, false); It seems UndoLogGet() probes hash table first, so what is the need for doing it in the caller and if you think it is better to perform in the caller, then maybe we should avoid doing it inside UndoLogGet()->get_undo_log()->undologtable_lookup(). 6. +get_undo_log(UndoLogNumber logno, bool locked) { .. + /* + * If we didn't find it, then it must already have been entirely + * discarded. We create a negative cache entry so that we can answer + * this question quickly next time. + * + * TODO: We could track the lowest known undo log number, to reduce + * the negative cache entry bloat. + */ + if (result == NULL) + { + /* + * Sanity check: the caller should not be asking about undo logs + * that have never existed. + */ + if (logno >= shared->next_logno) + elog(ERROR, "undo log %u hasn't been created yet", logno); + entry = undologtable_insert (undologtable_cache, logno, &found); + entry->number = logno; + entry->control = NULL; + entry->tablespace = 0; + } .. } Are you planning to take care of this TODO? In any case, do we have any mechanism to clear this bloat or will it stay till the end of the session? If it is later, then I think it is important to take care of TODO. 7. +void UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno); +/* Redo interface. */ +extern void undolog_redo(XLogReaderState *record); You might want to add an extra line before /* Redo interface. */ following what has been done earlier in this file. 8. + * XXX For now an xl_undolog_meta object is filled in, in case it turns out + * to be necessary to write it into the WAL record (like FPI, this must be + * logged once for each undo log after each checkpoint). I think this should + * be moved out of this interface and done differently -- to review. + */ +UndoRecPtr +UndoLogAllocate(size_t size, UndoPersistence persistence) This function doesn't appear to be filling xl_undolog_meta. Am I missing something, if not, then this comments needs to be changed? 9. +static bool +choose_undo_tablespace(bool force_detach, Oid *tablespace) +{ + char *rawname; + List *namelist; + bool need_to_unlock; + int length; + int i; + + /* We need a modifiable copy of string. */ + rawname = pstrdup(undo_tablespaces); I don't see the usage of rawname outside this function, isn't it better to free it? I understand that this function won't be called frequently enough to matter, but still there is some theoretical danger if a user continuously changes undo_tablespaces. 10. +attach_undo_log(UndoPersistence persistence, Oid tablespace) { .. + /* + * For now we have a simple linked list of unattached undo logs for each + * persistence level. We'll grovel though it to find something for the Typo. /though/through 11. +attach_undo_log(UndoPersistence persistence, Oid tablespace) { .. + /* WAL-log the creation of this new undo log. */ + { + xl_undolog_create xlrec; + + xlrec.logno = logno; + xlrec.tablespace = log- >meta.tablespace; + xlrec.persistence = log->meta.persistence; + + XLogBeginInsert(); + XLogRegisterData((char *) &xlrec, sizeof(xlrec)); + XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_CREATE); + } .. } Do we need to WAL log this for temporary/unlogged persistence level? 12. +choose_undo_tablespace(bool force_detach, Oid *tablespace) { .. + oid = get_tablespace_oid(name, true); .. Do we need to check permissions to see if the current user is allowed to create in this tablespace? 13. +UndoLogAllocate(size_t size, UndoPersistence persistence) { .. + log->meta.prevlogno = prevlogno; Is it okay to update meta information without lock or we should do it few lines down after taking mutex lock? If it is okay, then it is better to write a comment for the same? 14. +static void +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace, + UndoLogOffset end) { .. + /* Flush the contents of the file to disk. */ + if (pg_fsync(fd) != 0) + elog(ERROR, "cannot fsync file \"%s\": %m", path); .. } You might want to have a wait event for this as we do have at other places where we perform fsync. 15. +allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace, + UndoLogOffset end) { .. + if (!InRecovery) + { + xl_undolog_extend xlrec; + XLogRecPtr ptr; + + xlrec.logno = logno; + xlrec.end = end; + + XLogBeginInsert(); + XLogRegisterData ((char *) &xlrec, sizeof(xlrec)); + ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND); + XLogFlush(ptr); + } .. } Do we need it for temporary/unlogged persistence level? 16. +static void +undolog_xlog_create(XLogReaderState *record) +{ + xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record); + UndoLogControl *log; + UndoLogSharedData *shared = MyUndoLogState.shared; + + /* Create meta-data space in shared memory. */ + LWLockAcquire(UndoLogLock, LW_EXCLUSIVE); + /* TODO: assert that it doesn't exist already? */ + log = allocate_undo_log(); + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); Do we need to acquire UndoLogLock during replay? What else can be going in concurrent to this which can create a problem? 17. UndoLogAllocate() { .. + /* + * While we have the lock, check if we have been forcibly detached by + * DROP TABLESPACE. That can only happen between transactions (see + * DropUndoLogsInsTablespace()). + */ .. } Function name in above comment is wrong. 18. + { + {"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT, + gettext_noop("Sets the tablespace(s) to use for undo logs."), + NULL, + GUC_LIST_INPUT | GUC_LIST_QUOTE + }, + &undo_tablespaces, + "", + check_undo_tablespaces, assign_undo_tablespaces, NULL + }, It seems you need to update variable_is_guc_list_quote for this variable. Till now, I have mainly reviewed undo log allocation part. This is a big patch and can take much more time to complete the review. I will review the other parts of the patch later. I have changed the status of this CF entry as "Waiting on Author", feel free to change it once you think all the comments are addressed. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Sep 3, 2018 at 11:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: Thomas has already posted the latest version of undo log patches on 'Cleaning up orphaned files using undo logs' thread[1]. So I have rebased the undo-interface patch also. This patch also includes latest defect fixes from the main zheap branch [2]. I have also done some changes to the undo-log patches. Basically, it is just some cleanup work and also make these patches independently compilable. I have moved some of the code into undo-log patches and also moved out some code which is not relevant to undo-log. Some examples: 1. Moved UndoLog Startup and Checkpoint related code into '0001-Add-undo-log-manager_v2.patch patch' + /* Recover undo log meta data corresponding to this checkpoint. */ + StartupUndoLogs(ControlFile->checkPointCopy.redo); + 2. Removed undo-worker related stuff out of this patch + case WAIT_EVENT_UNDO_DISCARD_WORKER_MAIN: + event_name = "UndoDiscardWorkerMain"; + break; + case WAIT_EVENT_UNDO_LAUNCHER_MAIN: + event_name = "UndoLauncherMain"; + break; [1] https://www.postgresql.org/message-id/flat/CAEepm=0ULqYgM2aFeOnrx6YrtBg3xUdxALoyCG+XpssKqmezug@mail.gmail.com [2] https://github.com/EnterpriseDB/zheap/ Patch applying order: 0001-Add-undo-log-manager.patch 0002-Provide-access-to-undo-log-data-via-the-buffer-manag.patch 0003-undo-interface-v3.patch 0004-Add-tests-for-the-undo-log-manager.patch from Cleaning up orphaned files using undo logs' thread[1] 0004-undo-interface-test-v3.patch -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Nov 5, 2018 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > [review for undo record layer (0003-undo-interface-v3)] I might sound repeating myself, but just to be clear, I was involved in the design of this patch as well and I have given a few high-level inputs for this patch. I have used this interface in the zheap development, but haven't done any sort of detailed review which I am doing now. I encourage others also to review this patch. 1. * NOTES: + * Handling multilog - + * It is possible that the undo record of a transaction can be spread across + * multiple undo log. And, we need some special handling while inserting the + * undo for discard and rollback to work sanely. I think before describing how the undo record is spread across multiple logs, you can explain how it is laid out when that is not the case. You can also explain how undo record headers are linked. I am not sure file header is the best place or it should be mentioned in README, but I think for now we can use file header for this purpose and later we can move it to README if required. 2. +/* + * FIXME: Do we want to support undo tuple size which is more than the BLCKSZ + * if not than undo record can spread across 2 buffers at the max. + */ +#define MAX_BUFFER_PER_UNDO 2 I think here the right question is what is the possibility of undo record to be greater than BLCKSZ? For zheap, as of today, we don' have any such requirement as the largest undo record is written for update or multi_insert and in both cases we don't exceed the limit of BLCKSZ. I guess some user other than zheap could probably have such requirement and I don't think it is impossible to enhance this if we have any requirement. If anybody else has an opinion here, please feel to share it. 3. +/* + * FIXME: Do we want to support undo tuple size which is more than the BLCKSZ + * if not than undo record can spread across 2 buffers at the max. + */ +#define MAX_BUFFER_PER_UNDO 2 + +/* + * Consider buffers needed for updating previous transaction's + * starting undo record. Hence increased by 1. + */ +#define MAX_UNDO_BUFFERS (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO + +/* Maximum number of undo record that can be prepared before calling insert. */ +#define MAX_PREPARED_UNDO 2 I think it is better to define MAX_PREPARED_UNDO before MAX_UNDO_BUFFERS as the first one is used in the definition of a second. 4. +/* + * Call PrepareUndoInsert to tell the undo subsystem about the undo record you + * intended to insert. Upon return, the necessary undo buffers are pinned. + * This should be done before any critical section is established, since it + * can fail. + * + * If not in recovery, 'xid' should refer to the top transaction id because + * undo log only stores mapping for the top most transactions. + * If in recovery, 'xid' refers to the transaction id stored in WAL. + */ +UndoRecPtr +PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence, + TransactionId xid) This function locks the buffer as well which is right as we need to do that before critical section, but the function header comments doesn't indicate it. You can modify it as: "Upon return, the necessary undo buffers are pinned and locked." Note that similar modification is required in .h file as well. 5. +/* + * Insert a previously-prepared undo record. This will lock the buffers + * pinned in the previous step, write the actual undo record into them, + * and mark them dirty. For persistent undo, this step should be performed + * after entering a critical section; it should never fail. + */ +void +InsertPreparedUndo(void) Here, the comments are wrong as buffers are already locked in the previous step. A similar change is required in .h file as well. 6. +InsertPreparedUndo(void) { .. /* + * Try to insert the record into the current page. If it doesn't + * succeed then recall the routine with the next page. + */ + if (InsertUndoRecord(uur, page, starting_byte, &already_written, false)) + { + undo_len += already_written; + MarkBufferDirty(buffer); + break; + } + + MarkBufferDirty(buffer); .. } Here, you are writing into a shared buffer and marking it dirty, isn't it a good idea to Assert for being in the critical section? 7. +/* Maximum number of undo record that can be prepared before calling insert. */ +#define MAX_PREPARED_UNDO 2 /record/records I think this definition doesn't define the maximum number of undo records that can be prepared as the caller can use UndoSetPrepareSize to change it. I think you can modify the comment as below or something on those lines: "This defines the number of undo records that can be prepared before calling insert by default. If you need to prepare more than MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize first." 8. + * Caller should call this under log->discard_lock + */ +static bool +IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp) +{ + if (log->oldest_data == InvalidUndoRecPtr) .. Isn't it a good idea to have an Assert that we already have discard_lock? 9. + UnpackedUndoRecord uur; /* prev txn's first undo record.*/ +} PreviousTxnInfo; Extra space at the of comment is required. 10. +/* + * Check if previous transactions undo is already discarded. + * + * Caller should call this under log->discard_lock + */ +static bool +IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp) +{ The name suggests that this function is doing something special for the previous transaction whereas this it just checks whether undo is discarded corresponding to a particular undo location. Isn't it better if we name it as UndoRecordExists or UndoRecordIsValid? Then explain in comments when do you consider particular record exists. Another point to note is that you are not releasing the lock in all paths, so it is better to mention in comments when will it be released and when not. 11. +IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp) +{ + if (log->oldest_data == InvalidUndoRecPtr) + { + /* + * oldest_data is not yet initialized. We have to check + * UndoLogIsDiscarded and if it's already discarded then we have + * nothing to do. + */ + LWLockRelease(&log->discard_lock); + if (UndoLogIsDiscarded(prev_xact_urp)) + return true; The comment in above code is just trying to write the code in words. I think here you should tell why we need to call UndoLogIsDiscarded when oldest_data is not initialized and or the scenario when oldest_data will not be initialized. 12. + * The actual update will be done by UndoRecordUpdateTransInfo under the + * critical section. + */ +void +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched) +{ .. I find this function name bit awkward. How about UndoRecordPrepareTransInfo? 13. +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched) { .. + /* + * TODO: For now we don't know how to build a transaction chain for + * temporary undo logs. That's because this log might have been used by a + * different backend, and we can't access its buffers. What should happen + * is that the undo data should be automatically discarded when the other + * backend detaches, but that code doesn't exist yet and the undo worker + * can't do it either. + */ + if (log->meta.persistence == UNDO_TEMP) + return; Aren't we already dealing with this case in the other patch [1]? Basically, I think we should discard it at commit time and or when the backend is detached. 14. +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched) { .. /* + * If previous transaction's urp is not valid means this backend is + * preparing its first undo so fetch the information from the undo log + * if it's still invalid urp means this is the first undo record for this + * log and we have nothing to update. + */ + if (!UndoRecPtrIsValid(prev_xact_urp)) + return; .. This comment is confusing. It appears to be saying same thing twice. You can write it along something like: "The absence of previous transaction's undo indicate that this backend is preparing its first undo in which case we have nothing to update.". 15. +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched) /* + * Acquire the discard lock before accessing the undo record so that + * discard worker doen't remove the record while we are in process of + * reading it. + */ Typo doen't/doesn't. I think you can use 'can't' instead of doesn't. This is by no means a complete review, rather just noticed a few things while reading the patch. [1] - https://www.postgresql.org/message-id/CAFiTN-t8fv-qYG9zynhS-1jRrvt_o5C-wCMRtzOsK8S%3DMXvKKw%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Oct 17, 2018 at 3:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Oct 15, 2018 at 6:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro > > <thomas.munro@enterprisedb.com> wrote: > > > I have also pushed a new WIP version of the lower level undo log > > > storage layer patch set to a public branch[1]. I'll leave the earlier > > > branch[2] there because the record-level patch posted by Dilip depends > > > on it for now. > > > > > > > Till now, I have mainly reviewed undo log allocation part. This is a > big patch and can take much more time to complete the review. I will > review the other parts of the patch later. > I think I see another issue with this patch. Basically, during extend_undo_log, there is an assumption that no one could update log->meta.end concurrently, but it is not true as the same can be updated by UndoLogDiscard which can lead to assertion failure in extend_undo_log. +static void +extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end) { .. /* + * Create all the segments needed to increase 'end' to the requested + * size. This is quite expensive, so we will try to avoid it completely + * by renaming files into place in UndoLogDiscard instead. + */ + end = log->meta.end; + while (end < new_end) + { + allocate_empty_undo_segment(logno, log->meta.tablespace, end); + end += UndoLogSegmentSize; + } .. + Assert(end == new_end); .. /* + * We didn't need to acquire the mutex to read 'end' above because only + * we write to it. But we need the mutex to update it, because the + * checkpointer might read it concurrently. + * + * XXX It's possible for meta.end to be higher already during + * recovery, because of the timing of a checkpoint; in that case we did + * nothing above and we shouldn't update shmem here. That interaction + * needs more analysis. + */ + LWLockAcquire(&log->mutex, LW_EXCLUSIVE); + if (log->meta.end < end) + log->meta.end = end; + LWLockRelease(&log->mutex); .. } Assume, before we read log->meta.end in above code, concurrently, discard process discards the undo and moves the end pointer to a later location, then the above assertion will fail. Now, if discard happens just after we read log->meta.end and before we do other stuff in this function, then it will crash in recovery. Can't we just remove this Assert? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, Nov 5, 2018 at 5:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Sep 3, 2018 at 11:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Thomas has already posted the latest version of undo log patches on > 'Cleaning up orphaned files using undo logs' thread[1]. So I have > rebased the undo-interface patch also. This patch also includes > latest defect fixes from the main zheap branch [2]. > Hi Thomas, The latest patch for undo log storage is not compiling on the head, I think it needs to be rebased due to your commit related to "pg_pread() and pg_pwrite() for data files and WAL" -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sat, Nov 10, 2018 at 9:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Nov 5, 2018 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > [review for undo record layer (0003-undo-interface-v3)] > > I might sound repeating myself, but just to be clear, I was involved > in the design of this patch as well and I have given a few high-level > inputs for this patch. I have used this interface in the zheap > development, but haven't done any sort of detailed review which I am > doing now. I encourage others also to review this patch. Thanks for the review, please find my reply inline. > > 1. > * NOTES: > + * Handling multilog - > + * It is possible that the undo record of a transaction can be spread across > + * multiple undo log. And, we need some special handling while inserting the > + * undo for discard and rollback to work sanely. > > I think before describing how the undo record is spread across > multiple logs, you can explain how it is laid out when that is not the > case. You can also explain how undo record headers are linked. I am > not sure file header is the best place or it should be mentioned in > README, but I think for now we can use file header for this purpose > and later we can move it to README if required. Added in the header. > > 2. > +/* > + * FIXME: Do we want to support undo tuple size which is more than the BLCKSZ > + * if not than undo record can spread across 2 buffers at the max. > + */ > > +#define MAX_BUFFER_PER_UNDO 2 > > I think here the right question is what is the possibility of undo > record to be greater than BLCKSZ? For zheap, as of today, we don' > have any such requirement as the largest undo record is written for > update or multi_insert and in both cases we don't exceed the limit of > BLCKSZ. I guess some user other than zheap could probably have such > requirement and I don't think it is impossible to enhance this if we > have any requirement. > > If anybody else has an opinion here, please feel to share it. Should we remove this FIXME or lets wait for some other opinion. As of now I have kept it as it is. > > 3. > +/* > + * FIXME: Do we want to support undo tuple size which is more than the BLCKSZ > + * if not than undo record can spread across 2 buffers at the max. > + */ > +#define MAX_BUFFER_PER_UNDO 2 > + > +/* > + * Consider buffers needed for updating previous transaction's > + * starting undo record. Hence increased by 1. > + */ > +#define MAX_UNDO_BUFFERS (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO > + > +/* Maximum number of undo record that can be prepared before calling insert. */ > +#define MAX_PREPARED_UNDO 2 > > I think it is better to define MAX_PREPARED_UNDO before > MAX_UNDO_BUFFERS as the first one is used in the definition of a > second. Done > > 4. > +/* > + * Call PrepareUndoInsert to tell the undo subsystem about the undo record you > + * intended to insert. Upon return, the necessary undo buffers are pinned. > + * This should be done before any critical section is established, since it > + * can fail. > + * > + * If not in recovery, 'xid' should refer to the top transaction id because > + * undo log only stores mapping for the top most transactions. > + * If in recovery, 'xid' refers to the transaction id stored in WAL. > + */ > +UndoRecPtr > +PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence, > + TransactionId xid) > > This function locks the buffer as well which is right as we need to do > that before critical section, but the function header comments doesn't > indicate it. You can modify it as: > "Upon return, the necessary undo buffers are pinned and locked." > > Note that similar modification is required in .h file as well. Done > > 5. > +/* > + * Insert a previously-prepared undo record. This will lock the buffers > + * pinned in the previous step, write the actual undo record into them, > + * and mark them dirty. For persistent undo, this step should be performed > + * after entering a critical section; it should never fail. > + */ > +void > +InsertPreparedUndo(void) > > Here, the comments are wrong as buffers are already locked in the > previous step. A similar change is required in .h file as well. Fixed > > 6. > +InsertPreparedUndo(void) > { > .. > /* > + * Try to insert the record into the current page. If it doesn't > + * succeed then recall the routine with the next page. > + */ > + if (InsertUndoRecord(uur, page, starting_byte, &already_written, false)) > + { > + undo_len += already_written; > + MarkBufferDirty(buffer); > + break; > + } > + > + MarkBufferDirty(buffer); > .. > } > > Here, you are writing into a shared buffer and marking it dirty, isn't > it a good idea to Assert for being in the critical section? Done > > 7. > +/* Maximum number of undo record that can be prepared before calling insert. */ > +#define MAX_PREPARED_UNDO 2 > > /record/records > > I think this definition doesn't define the maximum number of undo > records that can be prepared as the caller can use UndoSetPrepareSize > to change it. I think you can modify the comment as below or > something on those lines: > "This defines the number of undo records that can be prepared before > calling insert by default. If you need to prepare more than > MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize > first." Fixed > > 8. > + * Caller should call this under log->discard_lock > + */ > +static bool > +IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp) > +{ > + if (log->oldest_data == InvalidUndoRecPtr) > .. > > Isn't it a good idea to have an Assert that we already have discard_lock? Done > > 9. > + UnpackedUndoRecord uur; /* prev txn's first undo record.*/ > +} PreviousTxnInfo; > > Extra space at the of comment is required. Done > > 10. > +/* > + * Check if previous transactions undo is already discarded. > + * > + * Caller should call this under log->discard_lock > + */ > +static bool > +IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp) > +{ > > The name suggests that this function is doing something special for > the previous transaction whereas this it just checks whether undo is > discarded corresponding to a particular undo location. Isn't it > better if we name it as UndoRecordExists or UndoRecordIsValid? Then > explain in comments when do you consider particular record exists. > Changed to UndoRecordIsValid > Another point to note is that you are not releasing the lock in all > paths, so it is better to mention in comments when will it be released > and when not. Done > > > 11. > +IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp) > +{ > + if (log->oldest_data == InvalidUndoRecPtr) > + { > + /* > + * oldest_data is not yet initialized. We have to check > + * UndoLogIsDiscarded and if it's already discarded then we have > + * nothing to do. > + */ > + LWLockRelease(&log->discard_lock); > + if (UndoLogIsDiscarded(prev_xact_urp)) > + return true; > > The comment in above code is just trying to write the code in words. > I think here you should tell why we need to call UndoLogIsDiscarded > when oldest_data is not initialized and or the scenario when > oldest_data will not be initialized. Fixed > > 12. > + * The actual update will be done by UndoRecordUpdateTransInfo under the > + * critical section. > + */ > +void > +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched) > +{ > .. > > I find this function name bit awkward. How about UndoRecordPrepareTransInfo? Changed as per the suggestion > > 13. > +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched) > { > .. > + /* > + * TODO: For now we don't know how to build a transaction chain for > + * temporary undo logs. That's because this log might have been used by a > + * different backend, and we can't access its buffers. What should happen > + * is that the undo data should be automatically discarded when the other > + * backend detaches, but that code doesn't exist yet and the undo worker > + * can't do it either. > + */ > + if (log->meta.persistence == UNDO_TEMP) > + return; > > Aren't we already dealing with this case in the other patch [1]? > Basically, I think we should discard it at commit time and or when the > backend is detached. Changed > > 14. > +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched) > { > .. > /* > + * If previous transaction's urp is not valid means this backend is > + * preparing its first undo so fetch the information from the undo log > + * if it's still invalid urp means this is the first undo record for this > + * log and we have nothing to update. > + */ > + if (!UndoRecPtrIsValid(prev_xact_urp)) > + return; > .. > > This comment is confusing. It appears to be saying same thing twice. > You can write it along something like: > > "The absence of previous transaction's undo indicate that this backend > is preparing its first undo in which case we have nothing to update.". Done as per the suggestion > > 15. > +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched) > /* > + * Acquire the discard lock before accessing the undo record so that > + * discard worker doen't remove the record while we are in process of > + * reading it. > + */ > > Typo doen't/doesn't. > > I think you can use 'can't' instead of doesn't. Fixed > > This is by no means a complete review, rather just noticed a few > things while reading the patch. > > [1] - https://www.postgresql.org/message-id/CAFiTN-t8fv-qYG9zynhS-1jRrvt_o5C-wCMRtzOsK8S%3DMXvKKw%40mail.gmail.com > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Nov 14, 2018 at 2:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Nov 10, 2018 at 9:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Nov 5, 2018 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > [review for undo record layer (0003-undo-interface-v3)] > > > > I might sound repeating myself, but just to be clear, I was involved > > in the design of this patch as well and I have given a few high-level > > inputs for this patch. I have used this interface in the zheap > > development, but haven't done any sort of detailed review which I am > > doing now. I encourage others also to review this patch. > > Thanks for the review, please find my reply inline. > > > > 1. > > * NOTES: > > + * Handling multilog - > > + * It is possible that the undo record of a transaction can be spread across > > + * multiple undo log. And, we need some special handling while inserting the > > + * undo for discard and rollback to work sanely. > > > > I think before describing how the undo record is spread across > > multiple logs, you can explain how it is laid out when that is not the > > case. You can also explain how undo record headers are linked. I am > > not sure file header is the best place or it should be mentioned in > > README, but I think for now we can use file header for this purpose > > and later we can move it to README if required. > Added in the header. > > > > > 2. > > +/* > > + * FIXME: Do we want to support undo tuple size which is more than the BLCKSZ > > + * if not than undo record can spread across 2 buffers at the max. > > + */ > > > > +#define MAX_BUFFER_PER_UNDO 2 > > > > I think here the right question is what is the possibility of undo > > record to be greater than BLCKSZ? For zheap, as of today, we don' > > have any such requirement as the largest undo record is written for > > update or multi_insert and in both cases we don't exceed the limit of > > BLCKSZ. I guess some user other than zheap could probably have such > > requirement and I don't think it is impossible to enhance this if we > > have any requirement. > > > > If anybody else has an opinion here, please feel to share it. > > Should we remove this FIXME or lets wait for some other opinion. As > of now I have kept it as it is. > > I think you can keep it with XXX instead of Fixme as there is nothing to fix. Both the patches 0003-undo-interface-v4.patch and 0004-undo-interface-test-v4.patch appears to be same except for the name? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Nov 14, 2018 at 2:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > I think you can keep it with XXX instead of Fixme as there is nothing to fix. Changed > > Both the patches 0003-undo-interface-v4.patch and > 0004-undo-interface-test-v4.patch appears to be same except for the > name? My bad, please find the updated patch. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Wed, Nov 14, 2018 at 3:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Nov 14, 2018 at 2:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > I think you can keep it with XXX instead of Fixme as there is nothing to fix. > Changed > > > > Both the patches 0003-undo-interface-v4.patch and > > 0004-undo-interface-test-v4.patch appears to be same except for the > > name? > My bad, please find the updated patch. > There was some problem in a assert and one comment was not aligned properly so I have fixed that in the latest patch. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Nov 15, 2018 at 12:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Updated patch (merged latest code from the zheap main branch [1]). The main chain is related to removing relfilenode and tablespace id from the undo record and store reloid. Earlier, we kept it thinking that we will perform rollback without database connection but that's not the case now. [1] https://github.com/EnterpriseDB/zheap -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Fri, Nov 16, 2018 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Thu, Nov 15, 2018 at 12:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > Updated patch (merged latest code from the zheap main branch [1]). > Review comments: ------------------------------- 1. UndoRecordPrepareTransInfo() { .. + /* + * The absence of previous transaction's undo indicate that this backend + * is preparing its first undo in which case we have nothing to update. + */ + if (!UndoRecPtrIsValid(prev_xact_urp)) + return; .. } It is expected that the caller of UndoRecPtrIsValid should have discard lock, but I don't see that how this the call from this place ensures the same? 2. UndoRecordPrepareTransInfo() { .. /* + * The absence of previous transaction's undo indicate that this backend + * is preparing its first undo in which case we have nothing to update. + */ + if (!UndoRecPtrIsValid(prev_xact_urp)) + return; + + /* + * Acquire the discard lock before accessing the undo record so that + * discard worker doesn't remove the record while we are in process of + * reading it. + */ + LWLockAcquire(&log->discard_lock, LW_SHARED); + + if (!UndoRecordIsValid(log, prev_xact_urp)) + return; .. } I don't understand this logic where you are checking the same information with and without a lock, is there any reason for same? It seems we don't need the first call to UndoRecPtrIsValid is not required. 3. UndoRecordPrepareTransInfo() { .. + while (true) + { + bufidx = InsertFindBufferSlot(rnode, cur_blk, + RBM_NORMAL, + log->meta.persistence); + prev_txn_info.prev_txn_undo_buffers[index] = bufidx; + buffer = undo_buffer[bufidx].buf; + page = BufferGetPage(buffer); + index++; + + if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte, + &already_decoded, true)) + break; + + starting_byte = UndoLogBlockHeaderSize; + cur_blk++; + } Can you write some commentary on what this code is doing? There is no need to use index++; as a separate statement, you can do it while assigning the buffer in that index. 4. +UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched) +{ + UndoRecPtr prev_xact_urp; I think you can simply name this variable as xact_urp. All this and related prev_* terminology used for variables seems confusing to me. I understand that you are trying to update the last transactions undo record information, but you can explain that via comments. Keeping such information as part of variable names not only makes their length longer, but is also confusing. 5. /* + * Structure to hold the previous transaction's undo update information. + */ +typedef struct PreviousTxnUndoRecord +{ + UndoRecPtr prev_urecptr; /* prev txn's starting urecptr */ + int prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO]; + UnpackedUndoRecord uur; /* prev txn's first undo record. */ +} PreviousTxnInfo; + +static PreviousTxnInfo prev_txn_info; Due to reasons mentioned in point-4, lets name the structure and it's variables as below: typedef struct XactUndoRecordInfo { UndoRecPtr start_urecptr; /* prev txn's starting urecptr */ int idx_undo_buffers[MAX_BUFFER_PER_UNDO]; UnpackedUndoRecord first_uur; /* prev txn's first undo record. */ } XactUndoRecordInfo; static XactUndoRecordInfo xact_ur_info; 6. +static int +InsertFindBufferSlot(RelFileNode rnode, The name of this function is not clear, can we change it to UndoGetBufferSlot or UndoGetBuffer? 7. +UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords, + UndoPersistence upersistence, TransactionId txid) { .. + /* + * If this is the first undo record for this transaction then set the + * uur_next to the SpecialUndoRecPtr. This is the indication to allocate + * the space for the transaction header and the valid value of the uur_next + * will be updated while preparing the first undo record of the next + * transaction. + */ + first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid); .. } I think it will be better if we move this comment few lines down: + if (need_start_undo && i == 0) + { + urec->uur_next = SpecialUndoRecPtr; BTW, is the only reason set a special value (SpecialUndoRecPtr) for uur_next is for allocating transaction header? If so, can't we directly set the corresponding flag (UREC_INFO_TRANSACTION) in uur_info and then remove it from UndoRecordSetInfo? I think it would have been better if there is one central location to set uur_info, but as that is becoming tricky, we should not try to add some special stuff to make it possible. 8. +UndoRecordExpectedSize(UnpackedUndoRecord *uur) +{ + Size size; + + /* Fixme : Temporary hack to allow zheap to set some value for uur_info. */ + /* if (uur->uur_info == 0) */ + UndoRecordSetInfo(uur); Can't we move UndoRecordSetInfo in it's caller (UndoRecordAllocateMulti)? It seems another caller of this function doesn't expect this. If we do that way, then we can have an Assert for non-zero uur_info in UndoRecordExpectedSize. 9. bool +InsertUndoRecord(UnpackedUndoRecord *uur, Page page, + int starting_byte, int *already_written, bool header_only) +{ + char *writeptr = (char *) page + starting_byte; + char *endptr = (char *) page + BLCKSZ; + int my_bytes_written = *already_written; + + if (uur->uur_info == 0) + UndoRecordSetInfo(uur); Do we really need UndoRecordSetInfo here? If not, then just add an assert for non-zero uur_info? 10 UndoRecordAllocateMulti() { .. else + { + /* + * It is okay to initialize these variables as these are used only + * with the first record of transaction. + */ + urec->uur_next = InvalidUndoRecPtr; + urec->uur_xidepoch = 0; + urec->uur_dbid = 0; + urec->uur_progress = 0; + } + + + /* calculate the size of the undo record. */ + size += UndoRecordExpectedSize(urec); + } Remove one extra line before comment "calculate the size of ..". -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Sat, Nov 17, 2018 at 5:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Nov 16, 2018 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Thu, Nov 15, 2018 at 12:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > Updated patch (merged latest code from the zheap main branch [1]). > > > > Review comments: > ------------------------------- > 1. > UndoRecordPrepareTransInfo() > { > .. > + /* > + * The absence of previous transaction's undo indicate that this backend > + * is preparing its first undo in which case we have nothing to update. > + */ > + if (!UndoRecPtrIsValid(prev_xact_urp)) > + return; > .. > } > > It is expected that the caller of UndoRecPtrIsValid should have > discard lock, but I don't see that how this the call from this place > ensures the same? I think its duplicate code, made mistake while merging from the zheap branch > > > 2. > UndoRecordPrepareTransInfo() > { > .. > /* > + * The absence of previous transaction's undo indicate that this backend > + * is preparing its first undo in which case we have nothing to update. > + */ > + if (!UndoRecPtrIsValid(prev_xact_urp)) > + return; > + > + /* > + * Acquire the discard lock before accessing the undo record so that > + * discard worker doesn't remove the record while we are in process of > + * reading it. > + */ > + LWLockAcquire(&log->discard_lock, LW_SHARED); > + > + if (!UndoRecordIsValid(log, prev_xact_urp)) > + return; > .. > } > > I don't understand this logic where you are checking the same > information with and without a lock, is there any reason for same? It > seems we don't need the first call to UndoRecPtrIsValid is not > required. Removed > > > 3. > UndoRecordPrepareTransInfo() > { > .. > + while (true) > + { > + bufidx = InsertFindBufferSlot(rnode, cur_blk, > + RBM_NORMAL, > + log->meta.persistence); > + prev_txn_info.prev_txn_undo_buffers[index] = bufidx; > + buffer = undo_buffer[bufidx].buf; > + page = BufferGetPage(buffer); > + index++; > + > + if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte, > + &already_decoded, true)) > + break; > + > + starting_byte = UndoLogBlockHeaderSize; > + cur_blk++; > + } > > > Can you write some commentary on what this code is doing? Done > > > There is no need to use index++; as a separate statement, you can do > it while assigning the buffer in that index. Done > > > 4. > +UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched) > +{ > + UndoRecPtr prev_xact_urp; > > I think you can simply name this variable as xact_urp. All this and > related prev_* terminology used for variables seems confusing to me. I > understand that you are trying to update the last transactions undo > record information, but you can explain that via comments. Keeping > such information as part of variable names not only makes their length > longer, but is also confusing. > > 5. > /* > + * Structure to hold the previous transaction's undo update information. > + */ > +typedef struct PreviousTxnUndoRecord > +{ > + UndoRecPtr prev_urecptr; /* prev txn's starting urecptr */ > + int prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO]; > + UnpackedUndoRecord uur; /* prev txn's first undo record. */ > +} PreviousTxnInfo; > + > +static PreviousTxnInfo prev_txn_info; > > Due to reasons mentioned in point-4, lets name the structure and it's > variables as below: > > typedef struct XactUndoRecordInfo > { > UndoRecPtr start_urecptr; /* prev txn's starting urecptr */ > int idx_undo_buffers[MAX_BUFFER_PER_UNDO]; > UnpackedUndoRecord first_uur; /* prev txn's first undo record. */ > } XactUndoRecordInfo; > > static XactUndoRecordInfo xact_ur_info; Done, but I have kept start_urecptr as urecptr and first_uur as uur and explained in comment. > > > 6. > +static int > +InsertFindBufferSlot(RelFileNode rnode, > > The name of this function is not clear, can we change it to > UndoGetBufferSlot or UndoGetBuffer? Changed to UndoGetBufferSlot > > > 7. > +UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords, > + UndoPersistence upersistence, TransactionId txid) > { > .. > + /* > + * If this is the first undo record for this transaction then set the > + * uur_next to the SpecialUndoRecPtr. This is the indication to allocate > + * the space for the transaction header and the valid value of the uur_next > + * will be updated while preparing the first undo record of the next > + * transaction. > + */ > + first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid); > .. > } Done > > > I think it will be better if we move this comment few lines down: > + if (need_start_undo && i == 0) > + { > + urec->uur_next = SpecialUndoRecPtr; > > BTW, is the only reason set a special value (SpecialUndoRecPtr) for > uur_next is for allocating transaction header? If so, can't we > directly set the corresponding flag (UREC_INFO_TRANSACTION) in > uur_info and then remove it from UndoRecordSetInfo? yeah, Done that way. > > > I think it would have been better if there is one central location to > set uur_info, but as that is becoming tricky, > we should not try to add some special stuff to make it possible. > > 8. > +UndoRecordExpectedSize(UnpackedUndoRecord *uur) > +{ > + Size size; > + > + /* Fixme : Temporary hack to allow zheap to set some value for uur_info. */ > + /* if (uur->uur_info == 0) */ > + UndoRecordSetInfo(uur); > > Can't we move UndoRecordSetInfo in it's caller > (UndoRecordAllocateMulti)? It seems another caller of this function > doesn't expect this. If we do that way, then we can have an Assert > for non-zero uur_info in UndoRecordExpectedSize. Done that way > > > 9. > bool > +InsertUndoRecord(UnpackedUndoRecord *uur, Page page, > + int starting_byte, int *already_written, bool header_only) > +{ > + char *writeptr = (char *) page + starting_byte; > + char *endptr = (char *) page + BLCKSZ; > + int my_bytes_written = *already_written; > + > + if (uur->uur_info == 0) > + UndoRecordSetInfo(uur); > > Do we really need UndoRecordSetInfo here? If not, then just add an > assert for non-zero uur_info? Done > > > 10 > UndoRecordAllocateMulti() > { > .. > else > + { > + /* > + * It is okay to initialize these variables as these are used only > + * with the first record of transaction. > + */ > + urec->uur_next = InvalidUndoRecPtr; > + urec->uur_xidepoch = 0; > + urec->uur_dbid = 0; > + urec->uur_progress = 0; > + } > + > + > + /* calculate the size of the undo record. */ > + size += UndoRecordExpectedSize(urec); > + } > > Remove one extra line before comment "calculate the size of ..". Fixed Along with that I have merged latest changes in zheap branch committed by Rafia Sabih for cleaning up the undo buffer information in abort path. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Nov 20, 2018 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > Along with that I have merged latest changes in zheap branch committed > by Rafia Sabih for cleaning up the undo buffer information in abort > path. > Thanks, few more comments: 1. @@ -2627,6 +2653,7 @@ AbortTransaction(void) AtEOXact_HashTables(false); AtEOXact_PgStat(false); AtEOXact_ApplyLauncher(false); + AtAbort_ResetUndoBuffers(); Don't we need similar handling in AbortSubTransaction? 2. Read undo record header in by calling UnpackUndoRecord, if the undo + * record header is splited across buffers then we need to read the complete + * header by invoking UnpackUndoRecord multiple times. /splited/splitted. You can just use split here. 3. +/* + * Identifying information for a transaction to which this undo belongs. + * it will also store the total size of the undo for this transaction. + */ +typedef struct UndoRecordTransaction +{ + uint32 urec_progress; /* undo applying progress. */ + uint32 urec_xidepoch; /* epoch of the current transaction */ + Oid urec_dbid; /* database id */ + uint64 urec_next; /* urec pointer of the next transaction */ +} UndoRecordTransaction; /it will/It will. BTW, which field(s) in the above structure stores the size of the undo? 4. + /* + * Set uur_info for an UnpackedUndoRecord appropriately based on which + * fields are set and calculate the size of the undo record based on the + * uur_info. + */ Can we rephrase it as "calculate the size of the undo record based on the info required"? 5. +/* + * Unlock and release undo buffers. This step performed after exiting any + * critical section. + */ +void +UnlockReleaseUndoBuffers(void) Can we change the later sentence as "This step is performed after exiting any critical section where we have performed undo action."? 6. +InsertUndoRecord() { .. + Assert (uur->uur_info != 0); Add a comment above Assert "The undo record must contain a valid information." 6. +UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords, + UndoPersistence upersistence, TransactionId txid) { .. + first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid); + + if ((!InRecovery && prev_txid[upersistence] != txid) || + first_rec_in_recovery) + { + need_start_undo = true; + } Here, I think we can avoid using two boolean variables (first_rec_in_recovery and need_start_undo). Also, this same check is used in this function twice. I have tried to simplify it in the attached. Can you check and let me know if that sounds okay to you? 7. UndoRecordAllocateMulti { .. /* + * If this is the first undo record of the transaction then initialize + * the transaction header fields of the undorecord. Also, set the flag + * in the uur_info to indicate that this record contains the transaction + * header so allocate the space for the same. + */ + if (need_start_undo && i == 0) + { + urec->uur_next = InvalidUndoRecPtr; + urec->uur_xidepoch = GetEpochForXid(txid); + urec->uur_progress = 0; + + /* During recovery, Fetch database id from the undo log state. */ + if (InRecovery) + urec->uur_dbid = UndoLogStateGetDatabaseId(); + else + urec->uur_dbid = MyDatabaseId; + + /* Set uur_info to include the transaction header. */ + urec->uur_info |= UREC_INFO_TRANSACTION; + } .. } It seems here you have written the code in your comments. I have changed it in the attached delta patch. 8. UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords, + TransactionId xid, UndoPersistence upersistence) +{ .. .. + multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare, Can we rename this variable as prepared_urec_ptr or prepared_urp? 9. +void +UndoSetPrepareSize(int max_prepare, I think it will be better to use nrecords instead of 'max_prepare' similar to how you have it in UndoRecordAllocateMulti() 10. + if (!UndoRecPtrIsValid(multi_prep_urp)) + urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid); + else + urecptr = multi_prep_urp; + + size = UndoRecordExpectedSize(urec); .. .. + if (UndoRecPtrIsValid(multi_prep_urp)) + { + UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp); + insert = UndoLogOffsetPlusUsableBytes(insert, size); + multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert); + } Can't we use urecptr instead of multi_prep_urp in above code after urecptr is initialized? 11. +static int max_prepare_undo = MAX_PREPARED_UNDO; Let's change the name of this variable as max_prepared_undo. Already changed in attached delta patch 12. PrepareUndoInsert() { .. + /* Already reached maximum prepared limit. */ + if (prepare_idx == max_prepare_undo) + return InvalidUndoRecPtr; .. } I think in the above condition, we should have elog, otherwise, callers need to be prepared to handle it. 13. UndoRecordAllocateMulti() How about naming it as UndoRecordAllocate as this is used to allocate even a single undo record? 14. If not already done, can you pgindent the new code added by this patch? Attached is a delta patch on top of your previous patch containing some fixes as memtioned above and few other minor changes and cleanup. If you find changes okay, kindly include them in your next version. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Mon, Nov 26, 2018 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Nov 20, 2018 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Along with that I have merged latest changes in zheap branch committed > > by Rafia Sabih for cleaning up the undo buffer information in abort > > path. > > > > Thanks, few more comments: > > 1. > @@ -2627,6 +2653,7 @@ AbortTransaction(void) > AtEOXact_HashTables(false); > AtEOXact_PgStat(false); > AtEOXact_ApplyLauncher(false); > + AtAbort_ResetUndoBuffers(); > > Don't we need similar handling in AbortSubTransaction? Yeah we do. I have fixed. > > 2. > Read undo record header in by calling UnpackUndoRecord, if the undo > + * record header is splited across buffers then we need to read the complete > + * header by invoking UnpackUndoRecord multiple times. > > /splited/splitted. You can just use split here. Fixed > > 3. > +/* > + * Identifying information for a transaction to which this undo belongs. > + * it will also store the total size of the undo for this transaction. > + */ > +typedef struct UndoRecordTransaction > +{ > + uint32 urec_progress; /* undo applying progress. */ > + uint32 urec_xidepoch; /* epoch of the current transaction */ > + Oid urec_dbid; /* database id */ > + uint64 urec_next; /* urec pointer of the next transaction */ > +} UndoRecordTransaction; > > /it will/It will. > BTW, which field(s) in the above structure stores the size of the undo? Fixed > > 4. > + /* > + * Set uur_info for an UnpackedUndoRecord appropriately based on which > + * fields are set and calculate the size of the undo record based on the > + * uur_info. > + */ > > Can we rephrase it as "calculate the size of the undo record based on > the info required"? Fixed > > 5. > +/* > + * Unlock and release undo buffers. This step performed after exiting any > + * critical section. > + */ > +void > +UnlockReleaseUndoBuffers(void) > > Can we change the later sentence as "This step is performed after > exiting any critical section where we have performed undo action."? Done, I mentioned This step is performed after exiting any critical section where we have prepared undo record. > > 6. > +InsertUndoRecord() > { > .. > + Assert (uur->uur_info != 0); > > Add a comment above Assert "The undo record must contain a valid information." Done > > 6. > +UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords, > + UndoPersistence upersistence, TransactionId txid) > { > .. > + first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid); > + > + if ((!InRecovery && prev_txid[upersistence] != txid) || > + first_rec_in_recovery) > + { > + need_start_undo = true; > + } > > Here, I think we can avoid using two boolean variables > (first_rec_in_recovery and need_start_undo). Also, this same check is > used in this function twice. I have tried to simplify it in the > attached. Can you check and let me know if that sounds okay to you? I have taken your changes > > 7. > UndoRecordAllocateMulti > { > .. > /* > + * If this is the first undo record of the transaction then initialize > + * the transaction header fields of the undorecord. Also, set the flag > + * in the uur_info to indicate that this record contains the transaction > + * header so allocate the space for the same. > + */ > + if (need_start_undo && i == 0) > + { > + urec->uur_next = InvalidUndoRecPtr; > + urec->uur_xidepoch = GetEpochForXid(txid); > + urec->uur_progress = 0; > + > + /* During recovery, Fetch database id from the undo log state. */ > + if (InRecovery) > + urec->uur_dbid = UndoLogStateGetDatabaseId(); > + else > + urec->uur_dbid = MyDatabaseId; > + > + /* Set uur_info to include the transaction header. */ > + urec->uur_info |= UREC_INFO_TRANSACTION; > + } > .. > } > > It seems here you have written the code in your comments. I have > changed it in the attached delta patch. Taken you changes > > 8. > UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords, > + TransactionId xid, UndoPersistence upersistence) > +{ > .. > .. > + multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare, > > Can we rename this variable as prepared_urec_ptr or prepared_urp? Done > > 9. > +void > +UndoSetPrepareSize(int max_prepare, > > I think it will be better to use nrecords instead of 'max_prepare' > similar to how you have it in UndoRecordAllocateMulti() Done > > 10. > + if (!UndoRecPtrIsValid(multi_prep_urp)) > + urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid); > + else > + urecptr = multi_prep_urp; > + > + size = UndoRecordExpectedSize(urec); > .. > .. > + if (UndoRecPtrIsValid(multi_prep_urp)) > + { > + UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp); > + insert = UndoLogOffsetPlusUsableBytes(insert, size); > + multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert); > + } > > Can't we use urecptr instead of multi_prep_urp in above code after > urecptr is initialized? I think we can't because urecptr is just the current pointer we are going to return but multi_prep_urp is the static variable we need to update so that the next prepare can calculate urecptr from this location. > > 11. > +static int max_prepare_undo = MAX_PREPARED_UNDO; > > Let's change the name of this variable as max_prepared_undo. Already > changed in attached delta patch ok > > 12. > PrepareUndoInsert() > { > .. > + /* Already reached maximum prepared limit. */ > + if (prepare_idx == max_prepare_undo) > + return InvalidUndoRecPtr; > .. > } > > I think in the above condition, we should have elog, otherwise, > callers need to be prepared to handle it. Done > > 13. > UndoRecordAllocateMulti() > > How about naming it as UndoRecordAllocate as this is used to allocate > even a single undo record? Done > > 14. > If not already done, can you pgindent the new code added by this patch? Done > > Attached is a delta patch on top of your previous patch containing > some fixes as memtioned above and few other minor changes and cleanup. > If you find changes okay, kindly include them in your next version. I have taken your changes. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Thu, Nov 29, 2018 at 6:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Nov 26, 2018 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > 10. > > + if (!UndoRecPtrIsValid(multi_prep_urp)) > > + urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid); > > + else > > + urecptr = multi_prep_urp; > > + > > + size = UndoRecordExpectedSize(urec); > > .. > > .. > > + if (UndoRecPtrIsValid(multi_prep_urp)) > > + { > > + UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp); > > + insert = UndoLogOffsetPlusUsableBytes(insert, size); > > + multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert); > > + } > > > > Can't we use urecptr instead of multi_prep_urp in above code after > > urecptr is initialized? > > I think we can't because urecptr is just the current pointer we are > going to return but multi_prep_urp is the static variable we need to > update so that > the next prepare can calculate urecptr from this location. > > Okay, but that was not apparent from the code, so added a comment in the attached delta patch. BTW, wouldn't it be better to move this code to the end of function once prepare for current record is complete. More comments ---------------------------- 1. * We can consider that the log as switched if /that/ needs to be removed. 2. + if (prepare_idx == max_prepared_undo) + elog(ERROR, "Already reached the maximum prepared limit."); a. /Already/already b. we don't use full-stop (.) in error 3. + * also stores the dbid and the progress of the undo apply during rollback. /the progress/ extra space. 4. +UndoSetPrepareSize(int nrecords, UnpackedUndoRecord *undorecords, + TransactionId xid, UndoPersistence upersistence) +{ nrecords should be a second parameter. 5. +UndoRecPtr +PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence, + TransactionId xid) It seems better to have xid parameter before UndoPersistence. 6. + /* FIXME: Should we just report error ? */ + Assert(index < MAX_BUFFER_PER_UNDO); No need of this Fixme. 7. PrepareUndoInsert() { .. do { .. + /* Undo record can not fit into this block so go to the next block. */ + cur_blk++; .. } while (cur_size < size); .. } This comment was making me uneasy, so slightly adjusted the code. Basically, by that time it was not decided whether undo record can fit in current buffer or not. 8. + /* + * Save referenced of undo record pointer as well as undo record. + * InsertPreparedUndo will use these to insert the prepared record. + */ + prepared_undo[prepare_idx].urec = urec; + prepared_undo[prepare_idx].urp = urecptr; Slightly adjust the above comment. 9. +InsertPreparedUndo(void) { .. + Assert(prepare_idx > 0); + + /* This must be called under a critical section. */ + Assert(InRecovery || CritSectionCount > 0); .. } I have added a few more comments for above assertions, see if those are correct. 10. +InsertPreparedUndo(void) { .. + prev_undolen = undo_len; + + UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen); .. } There is no need to use an additional variable prev_undolen in the above code. I have modified the code to remove it's usage, check if that is correct. 11. + * Fetch the next undo record for given blkno, offset and transaction id (if + * valid). We need to match transaction id along with block number and offset + * because in some cases (like reuse of slot for committed transaction), we + * need to skip the record if it is modified by a transaction later than the + * transaction indicated by previous undo record. For example, consider a + * case where tuple (ctid - 0,1) is modified by transaction id 500 which + * belongs to transaction slot 0. Then, the same tuple is modified by + * transaction id 501 which belongs to transaction slot 1. Then, both the + * transaction slots are marked for reuse. Then, again the same tuple is + * modified by transaction id 502 which has used slot 0. Now, some + * transaction which has started before transaction 500 wants to traverse the + * chain to find visible tuple will keep on rotating infinitely between undo + * tuple written by 502 and 501. In such a case, we need to skip the undo + * tuple written by transaction 502 when we want to find the undo record + * indicated by the previous pointer of undo tuple written by transaction 501. + * Start the search from urp. Caller need to call UndoRecordRelease to release the + * resources allocated by this function. + * + * urec_ptr_out is undo record pointer of the qualified undo record if valid + * pointer is passed. + */ +UnpackedUndoRecord * +UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset, + TransactionId xid, UndoRecPtr * urec_ptr_out, + SatisfyUndoRecordCallback callback) The comment above UndoFetchRecord is very zheap specific, so I have tried to simplify it. I think we can give so much detailed examples only when we introduce zheap code. Apart from above, there are miscellaneous comments and minor code edits in the attached delta patch. 12. PrepareUndoInsert() { .. + /* + * If this is the first undo record for this top transaction add the + * transaction information to the undo record. + * + * XXX there is also an option that instead of adding the information to + * this record we can prepare a new record which only contain transaction + * informations. + */ + if (xid == InvalidTransactionId) The above comment seems to be out of place, we are doing nothing like that here. This work is done in UndoRecordAllocate, may be you can move 'XXX ..' part of the comment in that function. 13. PrepareUndoInsert() { .. if (!UndoRecPtrIsValid(prepared_urec_ptr)) + urecptr = UndoRecordAllocate(urec, 1, upersistence, txid); + else + urecptr = prepared_urec_ptr; + + size = UndoRecordExpectedSize(urec); .. I think we should make above code a bit more bulletproof. As it is written, there is no guarantee the size we have allocated is same as we are using in this function. How about if we take 'size' as output parameter from UndoRecordAllocate and then use it in this function? Additionally, we can have an Assert that the size returned by UndoRecordAllocate is same as UndoRecordExpectedSize. 14. + +void +RegisterUndoLogBuffers(uint8 first_block_id) +{ + int idx; + int flags; + + for (idx = 0; idx < buffer_idx; idx++) + { + flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0; + XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags); + } +} + +void +UndoLogBuffersSetLSN(XLogRecPtr recptr) +{ + int idx; + + for (idx = 0; idx < buffer_idx; idx++) + PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr); +} One line comment atop of these functions will be good. It will be better if we place these functions at the end of file or someplace else, as right now they are between prepare* and insert* function calls which makes code flow bit awkward. 15. + * and mark them dirty. For persistent undo, this step should be performed + * after entering a critical section; it should never fail. + */ +void +InsertPreparedUndo(void) +{ Why only for persistent undo this step should be performed in the critical section? I think as this function operates on shred buffer, even for unlogged undo, it should be done in a critical section. 16. +InsertPreparedUndo(void) { .. /* if starting a new log then there is no prevlen to store */ + if (offset == UndoLogBlockHeaderSize) + { + if (log->meta.prevlogno != InvalidUndoLogNumber) + { + UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false); + + uur->uur_prevlen = prevlog->meta.prevlen; + } .. } The comment here suggests that for new logs, we don't need prevlen, but still, in one case you are maintaining the length, can you add few comments to explain why? 17. +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode, + UndoPersistence persistence) { .. + /* + * FIXME: This can be optimized to just fetch header first and only if + * matches with block number and offset then fetch the complete + * record. + */ + if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false)) + break; .. } I don't know how much it matters if we fetch complete record or just it's header unless the record is big or it falls in two pages. I think both are boundary cases and I couldn't see this part much in perf profiles. There is nothing to fix here if you want you can a XXX comment or maybe suggest it as a future optimization. 18. +UndoFetchRecord() { ... + /* + * Prevent UndoDiscardOneLog() from discarding data while we try to + * read it. Usually we would acquire log->mutex to read log->meta + * members, but in this case we know that discard can't move without + * also holding log->discard_lock. + */ + LWLockAcquire(&log->discard_lock, LW_SHARED); + if (!UndoRecPtrIsValid(log->oldest_data)) + { + /* + * UndoDiscardInfo is not yet initialized. Hence, we've to check + * UndoLogIsDiscarded and if it's already discarded then we have + * nothing to do. + */ + LWLockRelease(&log->discard_lock); + if (UndoLogIsDiscarded(urp)) + { + if (BufferIsValid(urec->uur_buffer)) + ReleaseBuffer(urec->uur_buffer); + return NULL; + } + + LWLockAcquire(&log->discard_lock, LW_SHARED); + } + + /* Check if it's already discarded. */ + if (urp < log->oldest_data) + { + LWLockRelease(&log->discard_lock); + if (BufferIsValid(urec->uur_buffer)) + ReleaseBuffer(urec->uur_buffer); + return NULL; + } .. } Can't we replace this logic with UndoRecordIsValid? 19. UndoFetchRecord() { .. while(true) { .. /* + * If we have a valid buffer pinned then just ensure that we want to + * find the next tuple from the same block. Otherwise release the + * buffer and set it invalid + */ + if (BufferIsValid(urec->uur_buffer)) + { + /* + * Undo buffer will be changed if the next undo record belongs to + * a different block or undo log. + */ + if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) || + (prevrnode.relNode != rnode.relNode)) + { + ReleaseBuffer(urec->uur_buffer); + urec->uur_buffer = InvalidBuffer; + } + } + else + { + /* + * If there is not a valid buffer in urec->uur_buffer that means + * we had copied the payload data and tuple data so free them. + */ + if (urec->uur_payload.data) + pfree(urec->uur_payload.data); + if (urec->uur_tuple.data) + pfree(urec->uur_tuple.data); + } + + /* Reset the urec before fetching the tuple */ + urec->uur_tuple.data = NULL; + urec->uur_tuple.len = 0; + urec->uur_payload.data = NULL; + urec->uur_payload.len = 0; + prevrnode = rnode; .. } Can't we move this logic after getting a record with UndoGetOneRecord and matching with a callback? This is certainly required after the first record, so it looks a bit odd here. Also, if possible can we move it to a separate function as this is not the main logic and makes the main logic difficult to follow. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, Dec 1, 2018 at 12:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Nov 29, 2018 at 6:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Mon, Nov 26, 2018 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 10. > > > + if (!UndoRecPtrIsValid(multi_prep_urp)) > > > + urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid); > > > + else > > > + urecptr = multi_prep_urp; > > > + > > > + size = UndoRecordExpectedSize(urec); > > > .. > > > .. > > > + if (UndoRecPtrIsValid(multi_prep_urp)) > > > + { > > > + UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp); > > > + insert = UndoLogOffsetPlusUsableBytes(insert, size); > > > + multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert); > > > + } > > > > > > Can't we use urecptr instead of multi_prep_urp in above code after > > > urecptr is initialized? > > > > I think we can't because urecptr is just the current pointer we are > > going to return but multi_prep_urp is the static variable we need to > > update so that > > the next prepare can calculate urecptr from this location. > > > > > Okay, but that was not apparent from the code, so added a comment in > the attached delta patch. BTW, wouldn't it be better to move this > code to the end of function once prepare for current record is > complete. > > More comments > ---------------------------- > 1. > * We can consider that the log as switched if > > /that/ needs to be removed. > > 2. > + if (prepare_idx == max_prepared_undo) > + elog(ERROR, "Already reached the maximum prepared limit."); > > a. /Already/already > b. we don't use full-stop (.) in error Merged your change > > 3. > + * also stores the dbid and the progress of the undo apply during rollback. > > /the progress/ extra space. > Merged your change > 4. > +UndoSetPrepareSize(int nrecords, UnpackedUndoRecord *undorecords, > + TransactionId xid, UndoPersistence upersistence) > +{ > > nrecords should be a second parameter. Merged your change > > 5. > +UndoRecPtr > +PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence, > + TransactionId xid) > > It seems better to have xid parameter before UndoPersistence. Merged your change. Done same changes in UndoRecordAllocate as well > > 6. > + /* FIXME: Should we just report error ? */ > + Assert(index < MAX_BUFFER_PER_UNDO); > > No need of this Fixme. Merged your change > > 7. > PrepareUndoInsert() > { > .. > do > { > .. > + /* Undo record can not fit into this block so go to the next block. */ > + cur_blk++; > .. > } while (cur_size < size); > .. > } > > This comment was making me uneasy, so slightly adjusted the code. > Basically, by that time it was not decided whether undo record can fit > in current buffer or not. Merged your change > > 8. > + /* > + * Save referenced of undo record pointer as well as undo record. > + * InsertPreparedUndo will use these to insert the prepared record. > + */ > + prepared_undo[prepare_idx].urec = urec; > + prepared_undo[prepare_idx].urp = urecptr; > > Slightly adjust the above comment. Merged your change > > 9. > +InsertPreparedUndo(void) > { > .. > + Assert(prepare_idx > 0); > + > + /* This must be called under a critical section. */ > + Assert(InRecovery || CritSectionCount > 0); > .. > } > > I have added a few more comments for above assertions, see if those are correct. Merged your change > > 10. > +InsertPreparedUndo(void) > { > .. > + prev_undolen = undo_len; > + > + UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen); > .. > } > > There is no need to use an additional variable prev_undolen in the > above code. I have modified the code to remove it's usage, check if > that is correct. looks fine to me. > > 11. > + * Fetch the next undo record for given blkno, offset and transaction id (if > + * valid). We need to match transaction id along with block number and offset > + * because in some cases (like reuse of slot for committed transaction), we > + * need to skip the record if it is modified by a transaction later than the > + * transaction indicated by previous undo record. For example, consider a > + * case where tuple (ctid - 0,1) is modified by transaction id 500 which > + * belongs to transaction slot 0. Then, the same tuple is modified by > + * transaction id 501 which belongs to transaction slot 1. Then, both the > + * transaction slots are marked for reuse. Then, again the same tuple is > + * modified by transaction id 502 which has used slot 0. Now, some > + * transaction which has started before transaction 500 wants to traverse the > + * chain to find visible tuple will keep on rotating infinitely between undo > + * tuple written by 502 and 501. In such a case, we need to skip the undo > + * tuple written by transaction 502 when we want to find the undo record > + * indicated by the previous pointer of undo tuple written by transaction 501. > + * Start the search from urp. Caller need to call UndoRecordRelease > to release the > + * resources allocated by this function. > + * > + * urec_ptr_out is undo record pointer of the qualified undo record if valid > + * pointer is passed. > + */ > +UnpackedUndoRecord * > +UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset, > + TransactionId xid, UndoRecPtr * urec_ptr_out, > + SatisfyUndoRecordCallback callback) > > > The comment above UndoFetchRecord is very zheap specific, so I have > tried to simplify it. I think we can give so much detailed examples > only when we introduce zheap code. Make sense. > > Apart from above, there are miscellaneous comments and minor code > edits in the attached delta patch. I have merged your changes. > > 12. > PrepareUndoInsert() > { > .. > + /* > + * If this is the first undo record for this top transaction add the > + * transaction information to the undo record. > + * > + * XXX there is also an option that instead of adding the information to > + * this record we can prepare a new record which only contain transaction > + * informations. > + */ > + if (xid == InvalidTransactionId) > > The above comment seems to be out of place, we are doing nothing like > that here. This work is done in UndoRecordAllocate, may be you can > move 'XXX ..' part of the comment in that function. Done > > 13. > PrepareUndoInsert() > { > .. > if (!UndoRecPtrIsValid(prepared_urec_ptr)) > + urecptr = UndoRecordAllocate(urec, 1, upersistence, txid); > + else > + urecptr = prepared_urec_ptr; > + > + size = UndoRecordExpectedSize(urec); > .. > > I think we should make above code a bit more bulletproof. As it is > written, there is no guarantee the size we have allocated is same as > we are using in this function. I agree How about if we take 'size' as output > parameter from UndoRecordAllocate and then use it in this function? > Additionally, we can have an Assert that the size returned by > UndoRecordAllocate is same as UndoRecordExpectedSize. With this change we will be able to guarantee when we are allocating single undo record but multi prepare will still be a problem. I haven't fix this as of now. I will think on how to handle both the cases when we have to prepare one time or when we have to allocate once and prepare multiple time. > > 14. > + > +void > +RegisterUndoLogBuffers(uint8 first_block_id) > +{ > + int idx; > + int flags; > + > + for (idx = 0; idx < buffer_idx; idx++) > + { > + flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0; > + XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags); > + } > +} > + > +void > +UndoLogBuffersSetLSN(XLogRecPtr recptr) > +{ > + int idx; > + > + for (idx = 0; idx < buffer_idx; idx++) > + PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr); > +} > > One line comment atop of these functions will be good. It will be > better if we place these functions at the end of file or someplace > else, as right now they are between prepare* and insert* function > calls which makes code flow bit awkward. Done > > 15. > + * and mark them dirty. For persistent undo, this step should be performed > + * after entering a critical section; it should never fail. > + */ > +void > +InsertPreparedUndo(void) > +{ > > Why only for persistent undo this step should be performed in the > critical section? I think as this function operates on shred buffer, > even for unlogged undo, it should be done in a critical section. I think we can remove this comment? I have removed that in the current patch. > > 16. > +InsertPreparedUndo(void) > { > .. > /* if starting a new log then there is no prevlen to store */ > + if (offset == UndoLogBlockHeaderSize) > + { > + if (log->meta.prevlogno != InvalidUndoLogNumber) > + { > + UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false); > + > + uur->uur_prevlen = prevlog->meta.prevlen; > + } > .. > } > > The comment here suggests that for new logs, we don't need prevlen, > but still, in one case you are maintaining the length, can you add few > comments to explain why? Done > > 17. > +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode, > + UndoPersistence persistence) > { > .. > + /* > + * FIXME: This can be optimized to just fetch header first and only if > + * matches with block number and offset then fetch the complete > + * record. > + */ > + if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false)) > + break; > .. > } > > I don't know how much it matters if we fetch complete record or just > it's header unless the record is big or it falls in two pages. I > think both are boundary cases and I couldn't see this part much in > perf profiles. There is nothing to fix here if you want you can a XXX > comment or maybe suggest it as a future optimization. Changed > > 18. > +UndoFetchRecord() > { > ... > + /* > + * Prevent UndoDiscardOneLog() from discarding data while we try to > + * read it. Usually we would acquire log->mutex to read log->meta > + * members, but in this case we know that discard can't move without > + * also holding log->discard_lock. > + */ > + LWLockAcquire(&log->discard_lock, LW_SHARED); > + if (!UndoRecPtrIsValid(log->oldest_data)) > + { > + /* > + * UndoDiscardInfo is not yet initialized. Hence, we've to check > + * UndoLogIsDiscarded and if it's already discarded then we have > + * nothing to do. > + */ > + LWLockRelease(&log->discard_lock); > + if (UndoLogIsDiscarded(urp)) > + { > + if (BufferIsValid(urec->uur_buffer)) > + ReleaseBuffer(urec->uur_buffer); > + return NULL; > + } > + > + LWLockAcquire(&log->discard_lock, LW_SHARED); > + } > + > + /* Check if it's already discarded. */ > + if (urp < log->oldest_data) > + { > + LWLockRelease(&log->discard_lock); > + if (BufferIsValid(urec->uur_buffer)) > + ReleaseBuffer(urec->uur_buffer); > + return NULL; > + } > .. > } > > Can't we replace this logic with UndoRecordIsValid? Done > > 19. > UndoFetchRecord() > { > .. > while(true) > { > .. > /* > + * If we have a valid buffer pinned then just ensure that we want to > + * find the next tuple from the same block. Otherwise release the > + * buffer and set it invalid > + */ > + if (BufferIsValid(urec->uur_buffer)) > + { > + /* > + * Undo buffer will be changed if the next undo record belongs to > + * a different block or undo log. > + */ > + if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) || > + (prevrnode.relNode != rnode.relNode)) > + { > + ReleaseBuffer(urec->uur_buffer); > + urec->uur_buffer = InvalidBuffer; > + } > + } > + else > + { > + /* > + * If there is not a valid buffer in urec->uur_buffer that means > + * we had copied the payload data and tuple data so free them. > + */ > + if (urec->uur_payload.data) > + pfree(urec->uur_payload.data); > + if (urec->uur_tuple.data) > + pfree(urec->uur_tuple.data); > + } > + > + /* Reset the urec before fetching the tuple */ > + urec->uur_tuple.data = NULL; > + urec->uur_tuple.len = 0; > + urec->uur_payload.data = NULL; > + urec->uur_payload.len = 0; > + prevrnode = rnode; > .. > } > > Can't we move this logic after getting a record with UndoGetOneRecord > and matching with a callback? This is certainly required after the > first record, so it looks a bit odd here. Also, if possible can we > move it to a separate function as this is not the main logic and makes > the main logic difficult to follow. > Fixed Apart from fixing these comments I have also rebased Thomas' undo log patches on the current head. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Dec 4, 2018 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Sat, Dec 1, 2018 at 12:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > 13. > > PrepareUndoInsert() > > { > > .. > > if (!UndoRecPtrIsValid(prepared_urec_ptr)) > > + urecptr = UndoRecordAllocate(urec, 1, upersistence, txid); > > + else > > + urecptr = prepared_urec_ptr; > > + > > + size = UndoRecordExpectedSize(urec); > > .. > > > > I think we should make above code a bit more bulletproof. As it is > > written, there is no guarantee the size we have allocated is same as > > we are using in this function. > I agree > How about if we take 'size' as output > > parameter from UndoRecordAllocate and then use it in this function? > > Additionally, we can have an Assert that the size returned by > > UndoRecordAllocate is same as UndoRecordExpectedSize. > With this change we will be able to guarantee when we are allocating > single undo record > but multi prepare will still be a problem. I haven't fix this as of > now. I will think on how > to handle both the cases when we have to prepare one time or when we > have to allocate > once and prepare multiple time. > Yeah, this appears tricky. I think we can leave it as it is unless we get some better idea. 1. +void +UndoRecordOnUndoLogChange(UndoPersistence persistence) +{ + prev_txid[persistence] = InvalidTransactionId; +} Are we using this API in this or any other patch or in zheap? I see this can be useful API, but unless we have a plan to use it, I don't think we should retain it. 2. + * Handling multi log: + * + * It is possible that the undo record of a transaction can be spread across + * multiple undo log. And, we need some special handling while inserting the + * undo for discard and rollback to work sanely. + * + * If the undorecord goes to next log then we insert a transaction header for + * the first record in the new log and update the transaction header with this + * new log's location. This will allow us to connect transactions across logs + * when the same transaction span across log (for this we keep track of the + * previous logno in undo log meta) which is required to find the latest undo + * record pointer of the aborted transaction for executing the undo actions + * before discard. If the next log get processed first in that case we + * don't need to trace back the actual start pointer of the transaction, + * in such case we can only execute the undo actions from the current log + * because the undo pointer in the slot will be rewound and that will be enough + * to avoid executing same actions. However, there is possibility that after + * executing the undo actions the undo pointer got discarded, now in later + * stage while processing the previous log it might try to fetch the undo + * record in the discarded log while chasing the transaction header chain. + * To avoid this situation we first check if the next_urec of the transaction + * is already discarded then no need to access that and start executing from + * the last undo record in the current log. I think I see the problem in the discard mechanism when the log is spread across multiple logs. Basically, if the second log contains undo of some transaction prior to the transaction which has just decided to spread it's undo in the chosen undo log, then we might discard the undo log of some transaction(s) inadvertently. Am, I missing something? If not, then I guess we need to ensure that we don't immediately discard the undo in the second log when a single transactions undo is spreaded across two logs Before choosing a new undo log to span the undo for a transaction, do we ensure that it is not already linked with some other undo log for a similar reason? One more thing in this regard: +UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords, + TransactionId txid, UndoPersistence upersistence) { .. .. + if (InRecovery) + urecptr = UndoLogAllocateInRecovery(txid, size, upersistence); + else + urecptr = UndoLogAllocate(size, upersistence); + + log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false); + + /* + * By now, we must be attached to some undo log unless we are in recovery. + */ + Assert(AmAttachedToUndoLog(log) || InRecovery); + + /* + * We can consider the log as switched if this is the first record of the + * log and not the first record of the transaction i.e. same transaction + * continued from the previous log. + */ + if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) && + log->meta.prevlogno != InvalidUndoLogNumber) + log_switched = true; .. .. } Isn't there a hidden assumption in the above code that you will always get a fresh undo log if the undo doesn't fit in the currently attached log? What is the guarantee of same? 3. +/* + * Call PrepareUndoInsert to tell the undo subsystem about the undo record you + * intended to insert. Upon return, the necessary undo buffers are pinned and + * locked. + * This should be done before any critical section is established, since it + * can fail. + * + * If not in recovery, 'xid' should refer to the top transaction id because + * undo log only stores mapping for the top most transactions. + * If in recovery, 'xid' refers to the transaction id stored in WAL. + */ +extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid, + UndoPersistence); + +/* + * Insert a previously-prepared undo record. This will write the actual undo + * record into the buffers already pinned and locked in PreparedUndoInsert, + * and mark them dirty. For persistent undo, this step should be performed + * after entering a critical section; it should never fail. + */ +extern void InsertPreparedUndo(void); This text is duplicate of what is mentioned in .c file, so I have removed it in delta patch. Similarly, I have removed duplicate text atop other functions exposed via undorecord.h 4. +/* + * Forget about any previously-prepared undo record. Error recovery calls + * this, but it can also be used by other code that changes its mind about + * inserting undo after having prepared a record for insertion. + */ +extern void CancelPreparedUndo(void); This API is nowhere defined or used. What is the intention? 5. +typedef struct UndoRecordHeader +{ .. + /* + * Transaction id that has modified the tuple present in this undo record. + * If this is older then RecentGlobalXmin, then we can consider the tuple + * in this undo record as visible. + */ + TransactionId urec_prevxid; .. /then/than I think we need to mention oldestXidWithEpochHavingUndo instead of RecentGlobalXmin. 6. + * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows. + * + * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows. + * + * When (as will often be the case) multiple structures are present, they + * appear in the same order in which the constants are defined here. That is, + * UndoRecordRelationDetails appears first. + */ +#define UREC_INFO_RELATION_DETAILS 0x01 I have expanded this comments in the attached delta patch. I think we should remove the define UREC_INFO_PAYLOAD_CONTAINS_SLOT from the patch as this is zheap specific and should be added later along with the zheap code. 7. +/* + * Additional information about a relation to which this record pertains, + * namely the tablespace OID and fork number. If the tablespace OID is + * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure + * can (and should) be omitted. + */ +typedef struct UndoRecordRelationDetails +{ + ForkNumber urec_fork; /* fork number */ +} UndoRecordRelationDetails; This comment seems to be out-dated, so modified in the attached delta patch. 8. +typedef struct UndoRecordTransaction +{ + uint32 urec_progress; /* undo applying progress. */ + uint32 urec_xidepoch; /* epoch of the current transaction */ Can you expand comments about how the progress is defined and used? Also, write a few sentences about why epoch is captured and or used? 9. +#define urec_next_pos \ + (SizeOfUndoRecordTransaction - SizeOfUrecNext) What is its purpose? 10. +/* + * Set uur_info for an UnpackedUndoRecord appropriately based on which + * other fields are set. + */ +extern void UndoRecordSetInfo(UnpackedUndoRecord *uur); + +/* + * Compute the number of bytes of storage that will be required to insert + * an undo record. Sets uur->uur_info as a side effect. + */ +extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur); Again, I see duplicate text in .h and .c files, so removed this and similar comments from .h files. I have tried to move some part of comments from .h to .c file, so that it is easier to read from one place rather than referring at two places. See, if I have missed anything. Apart from the above, I have done few other cosmetic changes in the attached delta patch, see, if you like those, kindly include it in the main patch. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, Dec 8, 2018 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
On Tue, Dec 4, 2018 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Dec 1, 2018 at 12:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > 13.
> > PrepareUndoInsert()
> > {
> > ..
> > if (!UndoRecPtrIsValid(prepared_urec_ptr))
> > + urecptr = UndoRecordAllocate(urec, 1, upersistence, txid);
> > + else
> > + urecptr = prepared_urec_ptr;
> > +
> > + size = UndoRecordExpectedSize(urec);
> > ..
> >
> > I think we should make above code a bit more bulletproof. As it is
> > written, there is no guarantee the size we have allocated is same as
> > we are using in this function.
> I agree
> How about if we take 'size' as output
> > parameter from UndoRecordAllocate and then use it in this function?
> > Additionally, we can have an Assert that the size returned by
> > UndoRecordAllocate is same as UndoRecordExpectedSize.
> With this change we will be able to guarantee when we are allocating
> single undo record
> but multi prepare will still be a problem. I haven't fix this as of
> now. I will think on how
> to handle both the cases when we have to prepare one time or when we
> have to allocate
> once and prepare multiple time.
>
Yeah, this appears tricky. I think we can leave it as it is unless we
get some better idea.
1.
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+ prev_txid[persistence] = InvalidTransactionId;
+}
Are we using this API in this or any other patch or in zheap? I see
this can be useful API, but unless we have a plan to use it, I don't
think we should retain it.
Currently, we are not using it so removed.
2.
+ * Handling multi log:
+ *
+ * It is possible that the undo record of a transaction can be spread across
+ * multiple undo log. And, we need some special handling while inserting the
+ * undo for discard and rollback to work sanely.
+ *
+ * If the undorecord goes to next log then we insert a transaction header for
+ * the first record in the new log and update the transaction header with this
+ * new log's location. This will allow us to connect transactions across logs
+ * when the same transaction span across log (for this we keep track of the
+ * previous logno in undo log meta) which is required to find the latest undo
+ * record pointer of the aborted transaction for executing the undo actions
+ * before discard. If the next log get processed first in that case we
+ * don't need to trace back the actual start pointer of the transaction,
+ * in such case we can only execute the undo actions from the current log
+ * because the undo pointer in the slot will be rewound and that
will be enough
+ * to avoid executing same actions. However, there is possibility that after
+ * executing the undo actions the undo pointer got discarded, now in later
+ * stage while processing the previous log it might try to fetch the undo
+ * record in the discarded log while chasing the transaction header chain.
+ * To avoid this situation we first check if the next_urec of the transaction
+ * is already discarded then no need to access that and start executing from
+ * the last undo record in the current log.
I think I see the problem in the discard mechanism when the log is
spread across multiple logs. Basically, if the second log contains
undo of some transaction prior to the transaction which has just
decided to spread it's undo in the chosen undo log, then we might
discard the undo log of some transaction(s) inadvertently. Am, I
missing something?
Actually, I don't see exactly this problem here because we only process one undo log at a time, so we will not got to the next undo log and discard some transaction for which we supposed to retain the undo.
If not, then I guess we need to ensure that we
don't immediately discard the undo in the second log when a single
transactions undo is spreaded across two logs
Before choosing a new undo log to span the undo for a transaction, do
we ensure that it is not already linked with some other undo log for a
similar reason?
One more thing in this regard:
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+ TransactionId txid, UndoPersistence upersistence)
{
..
..
+ if (InRecovery)
+ urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+ else
+ urecptr = UndoLogAllocate(size, upersistence);
+
+ log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+ /*
+ * By now, we must be attached to some undo log unless we are in recovery.
+ */
+ Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+ /*
+ * We can consider the log as switched if this is the first record of the
+ * log and not the first record of the transaction i.e. same transaction
+ * continued from the previous log.
+ */
+ if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+ log->meta.prevlogno != InvalidUndoLogNumber)
+ log_switched = true;
..
..
}
Isn't there a hidden assumption in the above code that you will always
get a fresh undo log if the undo doesn't fit in the currently attached
log? What is the guarantee of same?
Yeah it's a problem, we might get the undo which may not be empty. One way to avoid this could be that instead of relying on the check "UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize". We can add some flag to the undo log meta data that whether it's the first record after attach or not and based on that we can decide. But, I want to think some better solution where we can identify without adding anything extra to undo meta.
3.
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert. Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
+ UndoPersistence);
+
+/*
+ * Insert a previously-prepared undo record. This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty. For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);
This text is duplicate of what is mentioned in .c file, so I have
removed it in delta patch. Similarly, I have removed duplicate text
atop other functions exposed via undorecord.h
4.
+/*
+ * Forget about any previously-prepared undo record. Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);
This API is nowhere defined or used. What is the intention?
Not required
5.
+typedef struct UndoRecordHeader
+{
..
+ /*
+ * Transaction id that has modified the tuple present in this undo record.
+ * If this is older then RecentGlobalXmin, then we can consider the tuple
+ * in this undo record as visible.
+ */
+ TransactionId urec_prevxid;
..
/then/than
Done
I think we need to mention oldestXidWithEpochHavingUndo instead of
RecentGlobalXmin.
Merged
6.
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here. That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS 0x01
I have expanded this comments in the attached delta patch.
Merged
I think we
should remove the define UREC_INFO_PAYLOAD_CONTAINS_SLOT from the
patch as this is zheap specific and should be added later along with
the zheap code.
Yeah we can, so removed in my new patch.
7.
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number. If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+ ForkNumber urec_fork; /* fork number */
+} UndoRecordRelationDetails;
This comment seems to be out-dated, so modified in the attached delta patch.
Merged
8.
+typedef struct UndoRecordTransaction
+{
+ uint32 urec_progress; /* undo applying progress. */
+ uint32 urec_xidepoch; /* epoch of the current transaction */
Can you expand comments about how the progress is defined and used?
Moved your comment from UnpackedUndoRecord to this structure and in UnpackedUndoRecord I have mentioned that we can refer detailed comment in this structure.
Also, write a few sentences about why epoch is captured and or used?
urec_xidepoch is captured mainly for the zheap visibility purpose so isn't it good that we mention it there?
9.
+#define urec_next_pos \
+ (SizeOfUndoRecordTransaction - SizeOfUrecNext)
What is its purpose?
It's not required so removed
10.
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record. Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
Again, I see duplicate text in .h and .c files, so removed this and
similar comments from .h files. I have tried to move some part of
comments from .h to .c file, so that it is easier to read from one
place rather than referring at two places. See, if I have missed
anything.
Apart from the above, I have done few other cosmetic changes in the
attached delta patch, see, if you like those, kindly include it in
the main patch.
Done
Attachment
On Wed, Dec 12, 2018 at 11:18 AM Dilip Kumar <dilip.kumar@enterprisedb.com> wrote: > > On Sat, Dec 8, 2018 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: >> >> >> I think I see the problem in the discard mechanism when the log is >> spread across multiple logs. Basically, if the second log contains >> undo of some transaction prior to the transaction which has just >> decided to spread it's undo in the chosen undo log, then we might >> discard the undo log of some transaction(s) inadvertently. Am, I >> missing something? > > Actually, I don't see exactly this problem here because we only process one undo log at a time, so we will not got to thenext undo log and discard some transaction for which we supposed to retain the undo. > How will rollbacks work for such a case? I have more to say about this, see below. >> >> If not, then I guess we need to ensure that we >> don't immediately discard the undo in the second log when a single >> transactions undo is spreaded across two logs >> >> Before choosing a new undo log to span the undo for a transaction, do >> we ensure that it is not already linked with some other undo log for a >> similar reason? >> You seem to forget answering this. >> One more thing in this regard: >> +UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords, >> + TransactionId txid, UndoPersistence upersistence) .. >> >> Isn't there a hidden assumption in the above code that you will always >> get a fresh undo log if the undo doesn't fit in the currently attached >> log? What is the guarantee of same? > > > Yeah it's a problem, we might get the undo which may not be empty. One way to avoid this could be that instead of relyingon the check "UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize". We can add some flag to the undo log metadata that whether it's the first record after attach or not and based on that we can decide. But, I want to think somebetter solution where we can identify without adding anything extra to undo meta. > I think what we need to determine here is whether we have switched the log for some non-first record of the transaction. If so, can't we detect by something like: log = GetLatestUndoLogAmAttachedTo(); UndoLogAllocate(); if (!need_xact_hdr) { currnet_log = GetLatestUndoLogAmAttachedTo(); if (currnet_log is not same as log) { Assert(currnet_log->meta.prevlogno == log->logno); log_switched = true; } } Won't the similar problem happens when we read undo records during rollback? In code below: +UndoRecPtr +UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen) +{ + UndoLogNumber logno = UndoRecPtrGetLogNo(urp); + UndoLogOffset offset = UndoRecPtrGetOffset(urp); + + /* + * We have reached to the first undo record of this undo log, so fetch the + * previous undo record of the transaction from the previous log. + */ + if (offset == UndoLogBlockHeaderSize) + { + UndoLogControl *prevlog, We seem to be assuming here that the new log starts from the beginning. IIUC, we can read the record of some other transaction if the transaction's log spans across two logs and the second log contains some other transaction's log in the beginning. >> >> 8. >> +typedef struct UndoRecordTransaction >> +{ >> + uint32 urec_progress; /* undo applying progress. */ >> + uint32 urec_xidepoch; /* epoch of the current transaction */ >> >> Can you expand comments about how the progress is defined and used? > > Moved your comment from UnpackedUndoRecord to this structure and in UnpackedUndoRecord I have mentioned that we can referdetailed comment in this structure. > >> >> Also, write a few sentences about why epoch is captured and or used? > > urec_xidepoch is captured mainly for the zheap visibility purpose so isn't it good that we mention it there? > Okay, you can leave it here as it is. One small point about this structure: + uint64 urec_next; /* urec pointer of the next transaction */ +} UndoRecordTransaction; + +#define SizeOfUrecNext (sizeof(UndoRecPtr)) +#define SizeOfUndoRecordTransaction \ + (offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext) Isn't it better to define urec_next as UndoRecPtr, even though it is technically the same as per the current code? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Wed, Dec 12, 2018 at 3:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Dec 12, 2018 at 11:18 AM Dilip Kumar > <dilip.kumar@enterprisedb.com> wrote: > > > > On Sat, Dec 8, 2018 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > >> > >> I think I see the problem in the discard mechanism when the log is > >> spread across multiple logs. Basically, if the second log contains > >> undo of some transaction prior to the transaction which has just > >> decided to spread it's undo in the chosen undo log, then we might > >> discard the undo log of some transaction(s) inadvertently. Am, I > >> missing something? > > > > Actually, I don't see exactly this problem here because we only process one undo log at a time, so we will not got tothe next undo log and discard some transaction for which we supposed to retain the undo. > > > > How will rollbacks work for such a case? I have more to say about > this, see below. Yeah, I agree rollback will have the problem. > > >> > >> If not, then I guess we need to ensure that we > >> don't immediately discard the undo in the second log when a single > >> transactions undo is spreaded across two logs > >> > >> Before choosing a new undo log to span the undo for a transaction, do > >> we ensure that it is not already linked with some other undo log for a > >> similar reason? > >> > > You seem to forget answering this. In current patch I have removed the concept of prevlogno in undo log meta. > > >> One more thing in this regard: > >> +UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords, > >> + TransactionId txid, UndoPersistence upersistence) > .. > >> > >> Isn't there a hidden assumption in the above code that you will always > >> get a fresh undo log if the undo doesn't fit in the currently attached > >> log? What is the guarantee of same? > > > > > > Yeah it's a problem, we might get the undo which may not be empty. One way to avoid this could be that instead of relyingon the check "UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize". We can add some flag to the undo log metadata that whether it's the first record after attach or not and based on that we can decide. But, I want to think somebetter solution where we can identify without adding anything extra to undo meta. > > > > I think what we need to determine here is whether we have switched the > log for some non-first record of the transaction. If so, can't we > detect by something like: > > log = GetLatestUndoLogAmAttachedTo(); > UndoLogAllocate(); > if (!need_xact_hdr) > { > currnet_log = GetLatestUndoLogAmAttachedTo(); > if (currnet_log is not same as log) > { > Assert(currnet_log->meta.prevlogno == log->logno); > log_switched = true; > } > } > > > Won't the similar problem happens when we read undo records during > rollback? In code below: > > +UndoRecPtr > +UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen) > +{ > + UndoLogNumber logno = UndoRecPtrGetLogNo(urp); > + UndoLogOffset offset = UndoRecPtrGetOffset(urp); > + > + /* > + * We have reached to the first undo record of this undo log, so fetch the > + * previous undo record of the transaction from the previous log. > + */ > + if (offset == UndoLogBlockHeaderSize) > + { > + UndoLogControl *prevlog, > > We seem to be assuming here that the new log starts from the > beginning. IIUC, we can read the record of some other transaction if > the transaction's log spans across two logs and the second log > contains some other transaction's log in the beginning. For addressing these issues related to multilog I have changed the design as we discussed offlist. 1) Now, at Do time we identify the log switch as you mentioned above (identify which log we are attached to before and after allocate). And, if the log is switched we write a WAL for the same and during recovery whenever this WAL is replayed we stored the undo record pointer of the transaction header (which is in the previous undo log) in UndoLogStateData and read it while allocating space the undo record and immediately reset it. 2) For handling the discard issue, along with updating the current transaction's start header in the previous undo log we also update the previous transaction's start header in the current log if we get assigned with the non-empty undo log. 3) For, identifying the previous undo record of the transaction during rollback (when undo log is switched), we store the transaction's last record's (in previous undo log) undo record pointer in the transaction header of the first undo record in the new undo log. > > >> > >> 8. > >> +typedef struct UndoRecordTransaction > >> +{ > >> + uint32 urec_progress; /* undo applying progress. */ > >> + uint32 urec_xidepoch; /* epoch of the current transaction */ > >> > >> Can you expand comments about how the progress is defined and used? > > > > Moved your comment from UnpackedUndoRecord to this structure and in UnpackedUndoRecord I have mentioned that we can referdetailed comment in this structure. > > > >> > >> Also, write a few sentences about why epoch is captured and or used? > > > > urec_xidepoch is captured mainly for the zheap visibility purpose so isn't it good that we mention it there? > > > > Okay, you can leave it here as it is. One small point about this structure: > > + uint64 urec_next; /* urec pointer of the next transaction */ > +} UndoRecordTransaction; > + > +#define SizeOfUrecNext (sizeof(UndoRecPtr)) > +#define SizeOfUndoRecordTransaction \ > + (offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext) > > Isn't it better to define urec_next as UndoRecPtr, even though it is > technically the same as per the current code? While replying I noticed that I haven't address this comment, I will handle this in next patch. I have to change this at couple of place. Handling multilog, needed some changes in undo-log-manager patch so attaching the updated version of the undo-log patches as well. Commit on which patches created (bf491a9073e12ce1fc3e6facd0ae1308534df570) -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sun, Dec 23, 2018 at 3:49 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Wed, Dec 12, 2018 at 3:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > For addressing these issues related to multilog I have changed the > design as we discussed offlist. > 1) Now, at Do time we identify the log switch as you mentioned above > (identify which log we are attached to before and after allocate). > And, if the log is switched we write a WAL for the same and during > recovery whenever this WAL is replayed we stored the undo record > pointer of the transaction header (which is in the previous undo log) > in UndoLogStateData and read it while allocating space the undo record > and immediately reset it. > > 2) For handling the discard issue, along with updating the current > transaction's start header in the previous undo log we also update the > previous transaction's start header in the current log if we get > assigned with the non-empty undo log. > > 3) For, identifying the previous undo record of the transaction during > rollback (when undo log is switched), we store the transaction's last > record's (in previous undo log) undo record pointer in the transaction > header of the first undo record in the new undo log. > Thanks, the new changes look mostly okay to me, but I have few comments: 1. + /* + * WAL log, for log switch. This is required to identify the log switch + * during recovery. + */ + if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT) + { + XLogBeginInsert(); + XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr)); + XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH); + } + Don't we want to do this under critical section? 2. +UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords, + TransactionId txid, UndoPersistence upersistence) { .. + if (log_switched) + { + /* + * If undo log is switched then during rollback we can not go + * to the previous undo record of the transaction by prevlen + * so we store the previous undo record pointer in the + * transaction header. + */ + log = UndoLogGet(prevlogno, false); + urec->uur_prevurp = MakeUndoRecPtr(prevlogno, + log->meta.insert - log->meta.prevlen); + } .. } Can we have an Assert for a valid prevlogno in the above condition? > > + uint64 urec_next; /* urec pointer of the next transaction */ > > +} UndoRecordTransaction; > > + > > +#define SizeOfUrecNext (sizeof(UndoRecPtr)) > > +#define SizeOfUndoRecordTransaction \ > > + (offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext) > > > > Isn't it better to define urec_next as UndoRecPtr, even though it is > > technically the same as per the current code? > > While replying I noticed that I haven't address this comment, I will > handle this in next patch. I have to change this at couple of place. > Okay, I think the new variable (uur_prevurp) introduced by this version of the patch also needs to be changed in a similar way. Apart from the above, I have made quite a few cosmetic changes and modified few comments, most notably, I have updated the comments related to the handling of multiple logs at the beginning of undoinsert.c file. Kindly, include these changes in your next patchset, if they look okay to you. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Jan 1, 2019 at 4:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > Thanks, the new changes look mostly okay to me, but I have few comments: > 1. > + /* > + * WAL log, for log switch. This is required to identify the log switch > + * during recovery. > + */ > + if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT) > + { > + XLogBeginInsert(); > + XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr)); > + XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH); > + } > + > > Don't we want to do this under critical section? I think we are not making any buffer changes here and just inserting a WAL, IMHO we don't need any critical section. Am I missing something?. > > 2. > +UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords, > + TransactionId txid, UndoPersistence upersistence) > { > .. > + if (log_switched) > + { > + /* > + * If undo log is switched then during rollback we can not go > + * to the previous undo record of the transaction by prevlen > + * so we store the previous undo record pointer in the > + * transaction header. > + */ > + log = UndoLogGet(prevlogno, false); > + urec->uur_prevurp = MakeUndoRecPtr(prevlogno, > + log->meta.insert - > log->meta.prevlen); > + } > .. > } > > Can we have an Assert for a valid prevlogno in the above condition? Done > > > > + uint64 urec_next; /* urec pointer of the next transaction */ > > > +} UndoRecordTransaction; > > > + > > > +#define SizeOfUrecNext (sizeof(UndoRecPtr)) > > > +#define SizeOfUndoRecordTransaction \ > > > + (offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext) > > > > > > Isn't it better to define urec_next as UndoRecPtr, even though it is > > > technically the same as per the current code? > > > > While replying I noticed that I haven't address this comment, I will > > handle this in next patch. I have to change this at couple of place. > > > > Okay, I think the new variable (uur_prevurp) introduced by this > version of the patch also needs to be changed in a similar way. DOne > > Apart from the above, I have made quite a few cosmetic changes and > modified few comments, most notably, I have updated the comments > related to the handling of multiple logs at the beginning of > undoinsert.c file. Kindly, include these changes in your next > patchset, if they look okay to you. > I have taken all changes except this one if (xact_urec_info_idx > 0) { - int i = 0; + int i = 0; --> pgindent changed it back to the above one. for (i = 0; i < xact_urec_info_idx; i++) - UndoRecordUpdateTransInfo(i); + UndoRecordUpdateTransInfo(i); -- This induce extra space so I ignored this } Undo-log patches need rebased so I have done that as well along with the changes mentioned above. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
Attachment
On Sat, Jan 5, 2019 at 11:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Tue, Jan 1, 2019 at 4:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > Thanks, the new changes look mostly okay to me, but I have few comments: > > 1. > > + /* > > + * WAL log, for log switch. This is required to identify the log switch > > + * during recovery. > > + */ > > + if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT) > > + { > > + XLogBeginInsert(); > > + XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr)); > > + XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH); > > + } > > + > > > > Don't we want to do this under critical section? > I think we are not making any buffer changes here and just inserting a > WAL, IMHO we don't need any critical section. Am I missing > something?. > No, you are correct. Few more comments: -------------------------------- Few comments: ---------------- 1. + * undorecord.c + * encode and decode undo records + * + * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group Change the year in Copyright notice for all new files? 2. + * This function sets uur->uur_info as a side effect. + */ +bool +InsertUndoRecord(UnpackedUndoRecord *uur, Page page, + int starting_byte, int *already_written, bool header_only) Is the above part of comment still correct? I don't see uur_info being set here. 3. + work_txn.urec_next = uur->uur_next; + work_txn.urec_xidepoch = uur->uur_xidepoch; + work_txn.urec_progress = uur->uur_progress; + work_txn.urec_prevurp = uur->uur_prevurp; + work_txn.urec_dbid = uur->uur_dbid; It would be better if we initialize these members in the order in which they appear in the actual structure. All other undo header structures are initialized that way, so this looks out-of-place. 4. + * 'my_bytes_written' is a pointer to the count of previous-written bytes + * from this and following structures in this undo record; that is, any + * bytes that are part of previous structures in the record have already + * been subtracted out. We must update it for the bytes we write. + * .. +static bool +InsertUndoBytes(char *sourceptr, int sourcelen, + char **writeptr, char *endptr, + int *my_bytes_written, int *total_bytes_written) +{ .. + + /* Update bookkeeeping infrormation. */ + *writeptr += can_write; + *total_bytes_written += can_write; + *my_bytes_written = 0; I don't understand the above comment where it is written: "We must update it for the bytes we write.". We always set 'my_bytes_written' as 0 if we write. Can you clarify? I guess this part of the comment is about total_bytes_written or here does it mean to say that caller should update it. I think some wording change might be required based on what we intend to say here. Similar to above, there is a confusion in the description of my_bytes_read atop ReadUndoBytes. 5. +uint32 +GetEpochForXid(TransactionId xid) { .. + /* + * Xid can be on either side when near wrap-around. Xid is certainly + * logically later than ckptXid. .. From the usage of this function in the patch, can we say that Xid is always later than ckptxid, if so, how? Also, I think you previously told in this thread that usage of uur_xidepoch is mainly for zheap, so we might want to postpone the inclusion of this undo records. On thinking again, I think we should follow your advice as I think the correct usage here would require the patch by Thomas to fix our epoch stuff [1]? Am, I correct, if so, I think we should postpone it for now. 6. /* + * SetCurrentUndoLocation + */ +void +SetCurrentUndoLocation(UndoRecPtr urec_ptr) { .. } I think you can use some comments atop this function to explain the usage of this function or how will callers use it. I am done with the first level of code-review for this patch. I am sure we might need few interface changes here and there while integrating and testing this with other patches, but the basic idea and code look reasonable to me. I have modified the proposed commit message in the attached patch, see if that looks fine to you. To be clear, this patch can't be independently committed/tested, we need undo logs and undo worker machinery patches to be ready as well. I will review those next. [1] - https://www.postgresql.org/message-id/CAEepm%3D2YYAtkSnens%3DTR2S%3DoRcAF9%3D2P7GPMK0wMJtxKF1QRig%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Jan 8, 2019 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Few more comments:
--------------------------------
Few comments:
----------------
1.
+ * undorecord.c
+ * encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
Change the year in Copyright notice for all new files?
Done
2.
+ * This function sets uur->uur_info as a side effect.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+ int starting_byte, int *already_written, bool header_only)
Is the above part of comment still correct? I don't see uur_info being set here.
Changed
3.
+ work_txn.urec_next = uur->uur_next;
+ work_txn.urec_xidepoch = uur->uur_xidepoch;
+ work_txn.urec_progress = uur->uur_progress;
+ work_txn.urec_prevurp = uur->uur_prevurp;
+ work_txn.urec_dbid = uur->uur_dbid;
It would be better if we initialize these members in the order in
which they appear in the actual structure. All other undo header
structures are initialized that way, so this looks out-of-place.
Fixed
4.
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out. We must update it for the bytes we write.
+ *
..
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+ char **writeptr, char *endptr,
+ int *my_bytes_written, int *total_bytes_written)
+{
..
+
+ /* Update bookkeeeping infrormation. */
+ *writeptr += can_write;
+ *total_bytes_written += can_write;
+ *my_bytes_written = 0;
I don't understand the above comment where it is written: "We must
update it for the bytes we write.". We always set 'my_bytes_written'
as 0 if we write. Can you clarify? I guess this part of the comment
is about total_bytes_written or here does it mean to say that caller
should update it. I think some wording change might be required based
on what we intend to say here.
Similar to above, there is a confusion in the description of
my_bytes_read atop ReadUndoBytes.
Fixed
5.
+uint32
+GetEpochForXid(TransactionId xid)
{
..
+ /*
+ * Xid can be on either side when near wrap-around. Xid is certainly
+ * logically later than ckptXid.
..
From the usage of this function in the patch, can we say that Xid is
always later than ckptxid, if so, how? Also, I think you previously
told in this thread that usage of uur_xidepoch is mainly for zheap, so
we might want to postpone the inclusion of this undo records. On
thinking again, I think we should follow your advice as I think the
correct usage here would require the patch by Thomas to fix our epoch
stuff [1]? Am, I correct, if so, I think we should postpone it for
now.
Removed
6.
/*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
{
..
}
I think you can use some comments atop this function to explain the
usage of this function or how will callers use it.
Done
I am done with the first level of code-review for this patch. I am
sure we might need few interface changes here and there while
integrating and testing this with other patches, but the basic idea
and code look reasonable to me. I have modified the proposed commit
message in the attached patch, see if that looks fine to you.
To be clear, this patch can't be independently committed/tested, we
need undo logs and undo worker machinery patches to be ready as well.
I will review those next.
Make sense
[1] - https://www.postgresql.org/message-id/CAEepm%3D2YYAtkSnens%3DTR2S%3DoRcAF9%3D2P7GPMK0wMJtxKF1QRig%40mail.gmail.com
Attachment
On Wed, Jan 9, 2019 at 11:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Jan 8, 2019 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
3.
+ work_txn.urec_next = uur->uur_next;
+ work_txn.urec_xidepoch = uur->uur_xidepoch;
+ work_txn.urec_progress = uur->uur_progress;
+ work_txn.urec_prevurp = uur->uur_prevurp;
+ work_txn.urec_dbid = uur->uur_dbid;
It would be better if we initialize these members in the order in
which they appear in the actual structure. All other undo header
structures are initialized that way, so this looks out-of-place.
One more change in ReadUndoByte on same line.
Attachment
Hi, This thread is curently marked as returned with feedback, set so 2018-12-01. Given there've been several new versions submitted since, is that accurate? - Andres
On Sun, Feb 03, 2019 at 02:23:16AM -0800, Andres Freund wrote: > This thread is curently marked as returned with feedback, set so > 2018-12-01. Given there've been several new versions submitted since, is > that accurate? From the latest status of this thread, there have been new patches but no reviews on them, so moved to next CF. -- Michael
Attachment
On Mon, Feb 4, 2019 at 3:55 PM Michael Paquier <michael@paquier.xyz> wrote: > On Sun, Feb 03, 2019 at 02:23:16AM -0800, Andres Freund wrote: > > This thread is curently marked as returned with feedback, set so > > 2018-12-01. Given there've been several new versions submitted since, is > > that accurate? > > From the latest status of this thread, there have been new patches but > no reviews on them, so moved to next CF. Thank you. New patches coming soon. -- Thomas Munro http://www.enterprisedb.com
On Sun, Feb 3, 2019 at 3:53 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > This thread is curently marked as returned with feedback, set so > 2018-12-01. Given there've been several new versions submitted since, is > that accurate? > As part of this thread we have been reviewing and fixing the comment for undo-interface patch. Now, Michael have already moved to new commitfest with status need review so I guess as of now the status is correct. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On 2019-Feb-04, Thomas Munro wrote: > On Mon, Feb 4, 2019 at 3:55 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Sun, Feb 03, 2019 at 02:23:16AM -0800, Andres Freund wrote: > > > This thread is curently marked as returned with feedback, set so > > > 2018-12-01. Given there've been several new versions submitted since, is > > > that accurate? > > > > From the latest status of this thread, there have been new patches but > > no reviews on them, so moved to next CF. > > Thank you. New patches coming soon. This series is for pg13, right? We're not considering any of this for pg12? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Feb 7, 2019 at 1:16 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > On 2019-Feb-04, Thomas Munro wrote: > > On Mon, Feb 4, 2019 at 3:55 PM Michael Paquier <michael@paquier.xyz> wrote: > > > On Sun, Feb 03, 2019 at 02:23:16AM -0800, Andres Freund wrote: > > > > This thread is curently marked as returned with feedback, set so > > > > 2018-12-01. Given there've been several new versions submitted since, is > > > > that accurate? > > > > > > From the latest status of this thread, there have been new patches but > > > no reviews on them, so moved to next CF. > > > > Thank you. New patches coming soon. > > This series is for pg13, right? We're not considering any of this for > pg12? Correct. Originally the target was 12 but that was a bit too ambitious. -- Thomas Munro http://www.enterprisedb.com
On Thu, Feb 07, 2019 at 03:21:09AM +1100, Thomas Munro wrote: > Correct. Originally the target was 12 but that was a bit too ambitious. Could it be possible to move the patch set into the first PG-13 commit fest then? We could use this CF as recipient for now, even if the schedule for next development cycle is not set in stone: https://commitfest.postgresql.org/23/ -- Michael
Attachment
On February 7, 2019 7:21:49 AM GMT+05:30, Michael Paquier <michael@paquier.xyz> wrote: >On Thu, Feb 07, 2019 at 03:21:09AM +1100, Thomas Munro wrote: >> Correct. Originally the target was 12 but that was a bit too >ambitious. > >Could it be possible to move the patch set into the first PG-13 commit >fest then? We could use this CF as recipient for now, even if the >schedule for next development cycle is not set in stone: >https://commitfest.postgresql.org/23/ We now have the target version as a field, that should make such moves unnecessary, right? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, Feb 07, 2019 at 07:25:57AM +0530, Andres Freund wrote: > We now have the target version as a field, that should make such > moves unnecessary, right? Oh, I missed this stuff. Thanks for pointing it out. -- Michael
Attachment
On February 7, 2019 7:34:11 AM GMT+05:30, Michael Paquier <michael@paquier.xyz> wrote: >On Thu, Feb 07, 2019 at 07:25:57AM +0530, Andres Freund wrote: >> We now have the target version as a field, that should make such >> moves unnecessary, right? > >Oh, I missed this stuff. Thanks for pointing it out. It was JUST added ... :) thought I saw you reply on the other thread about it, but I was wrong... Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Thu, Feb 07, 2019 at 07:35:31AM +0530, Andres Freund wrote: > It was JUST added ... :) thought I saw you reply on the other thread > about it, but I was wrong... Six months later without any activity, I am marking this entry as returned with feedback. The latest patch set does not apply anymore, so having a rebase would be nice if submitted again. -- Michael
Attachment
On Sat, Nov 30, 2019 at 9:25 PM Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Feb 07, 2019 at 07:35:31AM +0530, Andres Freund wrote: > > It was JUST added ... :) thought I saw you reply on the other thread > > about it, but I was wrong... > > Six months later without any activity, I am marking this entry as > returned with feedback. The latest patch set does not apply anymore, > so having a rebase would be nice if submitted again. Sounds fair, thanks. Actually, we've rewritten large amounts of this, but unfortunately not to the point where it's ready to post yet. If anyone wants to see the development in progress, see https://github.com/EnterpriseDB/zheap/commits/undo-record-set This is not really an EnterpriseDB project any more because Andres and Thomas decided to leave EnterpriseDB, but both expressed an intention to continue working on the project. So hopefully we'll get there. That being said, here's what the three of us are working towards: - Undo locations are identified by a 64-bit UndoRecPtr, which is very similar to a WAL LSN. However, each undo log (1TB of address space) has its own insertion point, so that many backends can insert simultaneously without contending on the insertion point. The code for this is by Thomas and is mostly the same as before. - To insert undo data, you create an UndoRecordSet, which has a record set header followed by any number of records. In the common case, an UndoRecordSet corresponds to the intersection of a transaction and a persistence level - that is, XID 12345 could have up to 3 UndoRecordSets, one for permanent undo, one for unlogged undo, and one for temporary undo. We might in the future have support for other kinds of UndoRecordSets, e.g. for multixact-like things that are associated with a group of transactions rather than just one. This code is new, by Thomas with contributions from me. - The records that get stored into an UndoRecordSet will be serialized from an in-memory representation and then deserialized when the data is read later. Andres is writing the code for this, but hasn't pushed it to the branch yet. The idea here is to allow a lot of flexibility about what gets stored, responding to criticisms of the earlier design from Heikki, while still being efficient about what we actually write on disk, since we know from testing that undo volume is a significant performance concern. - Each transaction that writes permanent or unlogged undo gets an UndoRequest, which tracks the fact that there is work to do if the transaction aborts. Undo can be applied either in the foreground right after abort or in the background. The latter case is necessary because crashes or FATAL errors can abort transactions, but the former case is important as a way of keeping the undo work from ballooning out of control in a workload where people just abort transactions nonstop; we have to slow things down so that we can keep up. This code is by me, based on a design sketch from Andres. Getting all of this working has been harder and slower than I'd hoped, but I think the new design fixes a lot of things that weren't right in earlier iterations, so I feel like we are at least headed in the right direction. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company