Thread: Undo logs

Undo logs

From

Thomas Munro

Date:

25 May 2018, 01:22:11

Hello hackers,

As announced elsewhere[1][2][3], at EnterpriseDB we are working on a
proposal to add in-place updates with undo logs to PostgreSQL.  The
goal is to improve performance and resource usage by recycling space
better.

The lowest level piece of this work is a physical undo log manager,
which I've personally been working on.  Later patches will build on
top, adding record-oriented access and then the main "zheap" access
manager and related infrastructure.  My colleagues will write about
those.

The README files[4][5] explain in more detail, but here is a
bullet-point description of what the attached patch set gives you:

1.  Efficient appending of new undo data from many concurrent
backends.  Like logs.
2.  Efficient discarding of old undo data that isn't needed anymore.
Like queues.
3.  Efficient buffered random reading of undo data.  Like relations.

A test module is provided that can be used to exercise the undo log
code paths without needing any of the later zheap patches.

This is work in progress.  A few aspects are under active development
and liable to change, as indicated by comments, and there are no doubt
bugs and room for improvement.  The code is also available at
github.com/EnterpriseDB/zheap (these patches are from the
undo-log-storage branch, see also the master branch which has the full
zheap feature).  We'd be grateful for any questions, feedback or
ideas.

[1] https://amitkapila16.blogspot.com/2018/03/zheap-storage-engine-to-provide-better.html
[2] https://rhaas.blogspot.com/2018/01/do-or-undo-there-is-no-vacuum.html
[3] https://www.pgcon.org/2018/schedule/events/1190.en.html
[4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
[5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr

-- 
Thomas Munro
http://www.enterprisedb.com

Attachment

Re: Undo logs

From

James Sewell

Date:

28 May 2018, 07:36:49

Exciting stuff! Really looking forward to having a play with this.

James Sewell,

Chief Architect

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009

P (+61) 2 8099 9000 W www.jirotech.com F (+61) 2 8099 9099

On 25 May 2018 at 08:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Hello hackers,

As announced elsewhere[1][2][3], at EnterpriseDB we are working on a
proposal to add in-place updates with undo logs to PostgreSQL. The
goal is to improve performance and resource usage by recycling space
better.

The lowest level piece of this work is a physical undo log manager,
which I've personally been working on. Later patches will build on
top, adding record-oriented access and then the main "zheap" access
manager and related infrastructure. My colleagues will write about
those.

The README files[4][5] explain in more detail, but here is a
bullet-point description of what the attached patch set gives you:

1. Efficient appending of new undo data from many concurrent
backends. Like logs.
2. Efficient discarding of old undo data that isn't needed anymore.
Like queues.
3. Efficient buffered random reading of undo data. Like relations.

A test module is provided that can be used to exercise the undo log
code paths without needing any of the later zheap patches.

This is work in progress. A few aspects are under active development
and liable to change, as indicated by comments, and there are no doubt
bugs and room for improvement. The code is also available at
github.com/EnterpriseDB/zheap (these patches are from the
undo-log-storage branch, see also the master branch which has the full
zheap feature). We'd be grateful for any questions, feedback or
ideas.

[1] https://amitkapila16.blogspot.com/2018/03/zheap-storage-engine-to-provide-better.html
[2] https://rhaas.blogspot.com/2018/01/do-or-undo-there-is-no-vacuum.html
[3] https://www.pgcon.org/2018/schedule/events/1190.en.html
[4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
[5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr

--
Thomas Munro
http://www.enterprisedb.com

The contents of this email are confidential and may be subject to legal or professional privilege and copyright. No representation is made that this email is free of viruses or other defects. If you have received this communication in error, you may not copy or distribute any part of it or otherwise disclose its contents to anyone. Please advise the sender of your incorrect receipt of this correspondence.

Re: Undo logs

From

Simon Riggs

Date:

28 May 2018, 14:40:50

On 24 May 2018 at 23:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

> As announced elsewhere[1][2][3], at EnterpriseDB we are working on a
> proposal to add in-place updates with undo logs to PostgreSQL.  The
> goal is to improve performance and resource usage by recycling space
> better.

Cool

> The lowest level piece of this work is a physical undo log manager,

> 1.  Efficient appending of new undo data from many concurrent
> backends.  Like logs.
> 2.  Efficient discarding of old undo data that isn't needed anymore.
> Like queues.
> 3.  Efficient buffered random reading of undo data.  Like relations.

Like an SLRU?

> [4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
> [5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr

I think there are quite a few design decisions there that need to be
discussed, so lets crack on and discuss them please.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Undo logs

From

Thomas Munro

Date:

31 May 2018, 01:57:30

Hi Simon,

On Mon, May 28, 2018 at 11:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> On 24 May 2018 at 23:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> The lowest level piece of this work is a physical undo log manager,
>
>> 1.  Efficient appending of new undo data from many concurrent
>> backends.  Like logs.
>> 2.  Efficient discarding of old undo data that isn't needed anymore.
>> Like queues.
>> 3.  Efficient buffered random reading of undo data.  Like relations.
>
> Like an SLRU?

Yes, but with some difference:

1.  There is a variable number of undo logs.  Each one corresponds to
a range of the 64 bit address space, and has its own head and tail
pointers, so that concurrent writers don't contend for buffers when
appending data.  (Unlike SLRUs which are statically defined, one for
clog.c, one for commit_ts.c, ...).
2.  Undo logs use regular buffers instead of having their own mini
buffer pool, ad hoc search and reclamation algorithm etc.
3.  Undo logs support temporary, unlogged and permanent storage (=
local buffers and reset-on-crash-restart, for undo data relating to
relations of those persistence levels).
4.  Undo logs storage files are preallocated (rather than being
extended block by block), and the oldest file is renamed to become the
newest file in common cases, like WAL.

>> [4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
>> [5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr
>
> I think there are quite a few design decisions there that need to be
> discussed, so lets crack on and discuss them please.

What do you think about using the main buffer pool?

Best case: pgbench type workload, discard pointer following closely
behind insert pointer, we never write anything out to disk (except for
checkpoints when we write a few pages), never advance the buffer pool
clock hand, and we use and constantly recycle 1-2 pages per connection
via the free list (as can be seen by monitoring insert - discard in
the pg_stat_undo_logs view).

Worst case: someone opens a snapshot and goes out to lunch so we can't
discard old undo data, and then we start to compete with other stuff
for buffers, and we hope the buffer reclamation algorithm is good at
its job (or can be improved).

I just talked about this proposal at a pgcon unconference session.
Here's some of the feedback I got:

1.  Jeff Davis pointed out that I'm probably wrong about not needing
FPI, and there must at least be checksum problems with torn pages.  He
also gave me an idea on how to fix that very cheaply, and I'm still
processing that feedback.
2.  Andres Freund thought it seemed OK if we have smgr.c routing to
md.c for relations and undofile.c for undo, but if we're going to
generalise this technique to put other things into shared buffers
eventually too (like the SLRUs, as proposed by Shawn Debnath in
another unconf session) then it might be worth investigating how to
get md.c to handle all of their needs.  They'd all just use fd.c
files, after all, so it'd be weird if we had to maintain several
different similar things.
3.  Andres also suggested that high frequency free page list access
might be quite contended in the "best case" described above.  I'll look
into that.
4.  Someone said that segment sizes probably shouldn't be hard coded
(cf WAL experience).

I also learned in other sessions that there are other access managers
in development that need undo logs.  I'm hoping to find out more about
that.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Undo logs

From

Dilip Kumar

Date:

31 August 2018, 12:38:37

Hello hackers,

As Thomas has already mentioned upthread that we are working on an
undo-log based storage and he has posted the patch sets for the lowest
layer called undo-log-storage.

This is the next layer which sits on top of the undo log storage,
which will provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

To prepare an undo record, first, it will allocate required space
using undo_log_storage module. Next, it will pin and lock the required
buffers and return an undo record pointer where it will insert the
record.  Finally, it calls the Insert routine for final insertion of
prepared record. Additionally, there is a mechanism for multi-insert,
wherein multiple records are prepared and inserted at a time.

To fetch an undo record, a caller must provide a valid undo record
pointer. Optionally, the caller can provide a callback function with
the information of the block and offset, which will help in faster
retrieval of undo record, otherwise, it has to traverse the undo-chain.

These patch sets will apply on top of the undo-log-storage branch [1],
commit id fa3803a048955c4961581e8757fe7263a98fe6e6.

[1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/


undo_interface_v1.patch is the main patch for providing the undo interface.
undo_interface_test_v1.patch is a simple test module to test the undo
interface layer.


On Thu, May 31, 2018 at 4:27 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Hi Simon,
>
> On Mon, May 28, 2018 at 11:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On 24 May 2018 at 23:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>>> The lowest level piece of this work is a physical undo log manager,
>>
>>> 1.  Efficient appending of new undo data from many concurrent
>>> backends.  Like logs.
>>> 2.  Efficient discarding of old undo data that isn't needed anymore.
>>> Like queues.
>>> 3.  Efficient buffered random reading of undo data.  Like relations.
>>
>> Like an SLRU?
>
> Yes, but with some difference:
>
> 1.  There is a variable number of undo logs.  Each one corresponds to
> a range of the 64 bit address space, and has its own head and tail
> pointers, so that concurrent writers don't contend for buffers when
> appending data.  (Unlike SLRUs which are statically defined, one for
> clog.c, one for commit_ts.c, ...).
> 2.  Undo logs use regular buffers instead of having their own mini
> buffer pool, ad hoc search and reclamation algorithm etc.
> 3.  Undo logs support temporary, unlogged and permanent storage (=
> local buffers and reset-on-crash-restart, for undo data relating to
> relations of those persistence levels).
> 4.  Undo logs storage files are preallocated (rather than being
> extended block by block), and the oldest file is renamed to become the
> newest file in common cases, like WAL.
>
>>> [4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
>>> [5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr
>>
>> I think there are quite a few design decisions there that need to be
>> discussed, so lets crack on and discuss them please.
>
> What do you think about using the main buffer pool?
>
> Best case: pgbench type workload, discard pointer following closely
> behind insert pointer, we never write anything out to disk (except for
> checkpoints when we write a few pages), never advance the buffer pool
> clock hand, and we use and constantly recycle 1-2 pages per connection
> via the free list (as can be seen by monitoring insert - discard in
> the pg_stat_undo_logs view).
>
> Worst case: someone opens a snapshot and goes out to lunch so we can't
> discard old undo data, and then we start to compete with other stuff
> for buffers, and we hope the buffer reclamation algorithm is good at
> its job (or can be improved).
>
> I just talked about this proposal at a pgcon unconference session.
> Here's some of the feedback I got:
>
> 1.  Jeff Davis pointed out that I'm probably wrong about not needing
> FPI, and there must at least be checksum problems with torn pages.  He
> also gave me an idea on how to fix that very cheaply, and I'm still
> processing that feedback.
> 2.  Andres Freund thought it seemed OK if we have smgr.c routing to
> md.c for relations and undofile.c for undo, but if we're going to
> generalise this technique to put other things into shared buffers
> eventually too (like the SLRUs, as proposed by Shawn Debnath in
> another unconf session) then it might be worth investigating how to
> get md.c to handle all of their needs.  They'd all just use fd.c
> files, after all, so it'd be weird if we had to maintain several
> different similar things.
> 3.  Andres also suggested that high frequency free page list access
> might be quite contended in the "best case" described above.  I'll look
> into that.
> 4.  Someone said that segment sizes probably shouldn't be hard coded
> (cf WAL experience).
>
> I also learned in other sessions that there are other access managers
> in development that need undo logs.  I'm hoping to find out more about
> that.
>
> --
> Thomas Munro
> http://www.enterprisedb.com
>



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Undo logs

From

Dilip Kumar

Date:

31 August 2018, 13:24:06

On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Hello hackers,
>
> As Thomas has already mentioned upthread that we are working on an
> undo-log based storage and he has posted the patch sets for the lowest
> layer called undo-log-storage.
>
> This is the next layer which sits on top of the undo log storage,
> which will provide an interface for prepare, insert, or fetch the undo
> records. This layer will use undo-log-storage to reserve the space for
> the undo records and buffer management routine to write and read the
> undo records.
>
> To prepare an undo record, first, it will allocate required space
> using undo_log_storage module. Next, it will pin and lock the required
> buffers and return an undo record pointer where it will insert the
> record.  Finally, it calls the Insert routine for final insertion of
> prepared record. Additionally, there is a mechanism for multi-insert,
> wherein multiple records are prepared and inserted at a time.
>
> To fetch an undo record, a caller must provide a valid undo record
> pointer. Optionally, the caller can provide a callback function with
> the information of the block and offset, which will help in faster
> retrieval of undo record, otherwise, it has to traverse the undo-chain.
>
> These patch sets will apply on top of the undo-log-storage branch [1],
> commit id fa3803a048955c4961581e8757fe7263a98fe6e6.
>
> [1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/
>
>
> undo_interface_v1.patch is the main patch for providing the undo interface.
> undo_interface_test_v1.patch is a simple test module to test the undo
> interface layer.
>

Thanks to Robert Haas for designing an early prototype for forming
undo record. Later, I’ve completed the remaining parts of the code
including undo record prepare, insert, fetch and other related APIs
with help of Rafia Sabih. Thanks to Amit Kapila for providing valuable
design inputs.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Undo logs

From

Thomas Munro

Date:

01 September 2018, 21:48:56

On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar
<dilip.kumar@enterprisedb.com> wrote:
> On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > As Thomas has already mentioned upthread that we are working on an
> > undo-log based storage and he has posted the patch sets for the lowest
> > layer called undo-log-storage.
> >
> > This is the next layer which sits on top of the undo log storage,
> > which will provide an interface for prepare, insert, or fetch the undo
> > records. This layer will use undo-log-storage to reserve the space for
> > the undo records and buffer management routine to write and read the
> > undo records.

I have also pushed a new WIP version of the lower level undo log
storage layer patch set to a public branch[1].  I'll leave the earlier
branch[2] there because the record-level patch posted by Dilip depends
on it for now.

The changes are mostly internal: it doesn't use DSM segments any more.
Originally I wanted to use DSM because I didn't want arbitrary limits,
but in fact DSM slots can run out in unpredictable ways, and unlike
parallel query the undo log subsystem doesn't have a plan B for when
it can't get the space it needs due to concurrent queries.  Instead,
this version uses a pool of size 4 * max_connections, fixed at startup
in regular shared memory.  This creates an arbitrary limit on
transaction size, but it's a large at 1TB per slot, can be increased,
doesn't disappear unpredictably, is easy to monitor
(pg_stat_undo_logs), and is probably a useful brake on a system in
trouble.

More soon.

[1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage-v2
[2] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Undo logs

From

Dilip Kumar

Date:

03 September 2018, 08:56:30

On Sun, Sep 2, 2018 at 12:18 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar
> <dilip.kumar@enterprisedb.com> wrote:
>> On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>> > As Thomas has already mentioned upthread that we are working on an
>> > undo-log based storage and he has posted the patch sets for the lowest
>> > layer called undo-log-storage.
>> >
>> > This is the next layer which sits on top of the undo log storage,
>> > which will provide an interface for prepare, insert, or fetch the undo
>> > records. This layer will use undo-log-storage to reserve the space for
>> > the undo records and buffer management routine to write and read the
>> > undo records.
>
> I have also pushed a new WIP version of the lower level undo log
> storage layer patch set to a public branch[1].  I'll leave the earlier
> branch[2] there because the record-level patch posted by Dilip depends
> on it for now.

Rebased undo_interface patches on top of the new branch of undo-log-storage[1].

[1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage-v2

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Undo logs

From

Amit Kapila

Date:

15 October 2018, 12:39:45

On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
>
> On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar
> <dilip.kumar@enterprisedb.com> wrote:
> > On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > As Thomas has already mentioned upthread that we are working on an
> > > undo-log based storage and he has posted the patch sets for the lowest
> > > layer called undo-log-storage.
> > >
> > > This is the next layer which sits on top of the undo log storage,
> > > which will provide an interface for prepare, insert, or fetch the undo
> > > records. This layer will use undo-log-storage to reserve the space for
> > > the undo records and buffer management routine to write and read the
> > > undo records.
>
> I have also pushed a new WIP version of the lower level undo log
> storage layer patch set to a public branch[1].  I'll leave the earlier
> branch[2] there because the record-level patch posted by Dilip depends
> on it for now.
>

I have started reading the patch and have a few assorted comments
which are mentioned below.  I have been involved in the high-level
design of this module and I have also shared given some suggestions
during development, but this is mainly Thomas's work with some help
from Dilip.  It would be good if other members of the community also
review the design or participate in the discussion.

Comments
------------------
undo/README
-----------------------
1.
+The undo log subsystem provides a way to store data that is needed for
+a limited time.  Undo data is generated
whenever zheap relations are
+modified, but it is only useful until (1) the generating transaction
+is committed or
rolled back and (2) there is no snapshot that might
+need it for MVCC purposes.

I think the snapshots need it for MVCC purpose and we need it till the
transaction is committed and all-visible.

2.
+* the tablespace that holds its segment files
+* the persistence level (permanent, unlogged, temporary)
+* the
"discard" pointer; data before this point has been discarded
+* the "insert" pointer: new data will be written here
+*
the "end" pointer: a new undo segment file will be needed at this point
+
+The three pointers discard, insert and end
move strictly forwards
+until the whole undo log has been exhausted.  At all times discard <=
+insert <= end.  When
discard == insert, the undo log is empty
+(everything that has ever been inserted has since been discarded).
+The
insert pointer advances when regular backends allocate new space,
+and the discard pointer usually advances when an
undo worker process
+determines that no session could need the data either for rollback or
+for finding old versions of
tuples to satisfy a snapshot.  In some
+special cases including single-user mode and temporary undo logs the
+discard
pointer might also be advanced synchronously by a foreground
+session.

Here, the use of insert and discard pointer are explained nicely.  Can
you elaborate the usage of end pointer as well?

3.
+UndoLogControl objects corresponding to the current set of active undo
+logs are held in a fixed-sized pool in shared
memory.  The size of
+the array is a multiple of max_connections, and limits the total size of
+transactions.

Here, it is mentioned the array is multiple of max_connections, but
the code uses MaxBackends.  Can you sync them?

4.
+The meta-data for all undo logs is written to disk at every
+checkpoint.  It is stored in files under
PGDATA/pg_undo/, using the
+checkpoint's redo point (a WAL LSN) as its filename.  At startup time,
+the redo point's
file can be used to restore all undo logs' meta-data
+as of the moment of the redo point into shared memory.  Changes
to the
+discard pointer and end pointer are WAL-logged by undolog.c and will
+bring the in-memory meta-data up to date
in the event of recovery
+after a crash.  Changes to insert pointers are included in other WAL
+records (see below).

I see one inconvenience for using checkpoint's redo point for meta
file name which is what if someone uses pg_resetxlog to truncate the
redo?  Is there any reason we can't keep a different name for the meta
file?

5.
+stabilize on one undo log per active writing backend (or more if
+different tablespaces are persistence levels are
used).

/tablespaces are persistence levels/tablespaces and persistence levels

I think due to the above design, we can now reach the maximum number
of undo logs quickly as the patch now uses fixed shared memory to
represent them.  I am not sure if there is an easy way to avoid that.
Can we try to expose guc for a maximum number of undo slots such that
instead of MaxBackends * 4, it could be MaxBackends * <new_guc>?

undo-log-manager patch
------------------------------------
6.
@@ -127,6 +128,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
  size = add_size(size,
ProcGlobalShmemSize());
  size = add_size(size, XLOGShmemSize());
  size = add_size(size,
CLOGShmemSize());
+ size = add_size(size, UndoLogShmemSize());
  size = add_size(size,
CommitTsShmemSize());
  size = add_size(size, SUBTRANSShmemSize());
  size = add_size(size,
TwoPhaseShmemSize());
@@ -219,6 +221,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
  */

XLOGShmemInit();
  CLOGShmemInit();
+ UndoLogShmemInit();

It seems to me that we will always allocate shared memory for undo
logs irrespective of whether someone wants to use them or not.  Am I
right?  If so, isn't it better if we find some way that this memory is
allocated only when someone has a need for it?

7.
+/*
+ * How many undo logs can be active at a time?  This creates a theoretical
+ * maximum transaction size, but it we
set it to a factor the maximum number
+ * of backends it will be a very high limit.  Alternative designs involving
+ *
demand paging or dynamic shared memory could remove this limit but
+ * introduce other problems.
+ */
+static inline
size_t
+UndoLogNumSlots(void)
+{+ return MaxBackends * 4;
+}

Seems like typos in above comment
/but it we/but if we
/factor the maximum number  -- the sentence is not completely clear.

8.
+ * Extra shared memory will be managed using DSM segments.
+ */
+Size
+UndoLogShmemSize(void)

You told in the email that the patch doesn't use DSM segments anymore,
but the comment seems to indicate otherwise.

9.
/*
+ * TODO: Should this function be usable in a critical section?
+
 * Woudl it make sense to detect that we are in a critical

Typo
/Woudl/Would

10.
+static void
+undolog_xlog_attach(XLogReaderState *record)
+{
+ xl_undolog_attach *xlrec = (xl_undolog_attach *)
XLogRecGetData(record);
+ UndoLogControl *log;
+
+ undolog_xid_map_add(xlrec->xid, xlrec->logno);
+
+ /*
+
 * Whatever follows is the first record for this transaction.  Zheap will
+ * use this to add
UREC_INFO_TRANSACTION.
+ */
+ log = get_undo_log(xlrec->logno, false);
+ /* TODO */

There are a lot of TODO's in the code, among them, above is not at all clear.

11.
+ UndoRecPtr oldest_data;
+
+} UndoLogControl;

Extra space after the last member looks odd.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Undo logs

From

Amit Kapila

Date:

17 October 2018, 09:57:54

On Mon, Oct 15, 2018 at 6:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > I have also pushed a new WIP version of the lower level undo log
> > storage layer patch set to a public branch[1].  I'll leave the earlier
> > branch[2] there because the record-level patch posted by Dilip depends
> > on it for now.
> >
>
> I have started reading the patch and have a few assorted comments
> which are mentioned below.  I have been involved in the high-level
> design of this module and I have also shared given some suggestions
> during development, but this is mainly Thomas's work with some help
> from Dilip.  It would be good if other members of the community also
> review the design or participate in the discussion.
>
> Comments
> ------------------
>

Some more comments/questions on the design level choices you have made
in this patch and some general comments.

1.  To allocate an undo log (UndoLogAllocate()), it seems first we are
creating the shared memory state for an undo log, write a WAL for it,
create an actual file and segment in it and write a separate WAL for
it.  Now imagine the system crashed after creating a shared memory
state and before actually allocating an undo log segment, then it is
quite possible that after recovery we will block multiple slots for
undo logs without having actual undo logs for them.  Apart from that
writing separate WAL for them doesn't appear to be the best way to
deal with it considering that we also need to write a third WAL to
attach an undo log.

Now, IIUC, one advantage of arranging the things this way is that we
avoid dropping the tablespaces when a particular undo log exists in
it.  I understand that this design kind of works, but I think we
should try to think of some alternatives here.  You might have already
thought of making it work similar to how the interaction for regular
tables or temp_tablespaces works with dropping the tablespaces but
decided to do something different here.  Can you explain why you have
made a different design choice here?

2.
extend_undo_log()
{
..
+ /*
+ * Flush the parent dir so that the directory metadata survives a crash
+ * after this point.
+
 */
+ UndoLogDirectory(log->meta.tablespace, dir);
+ fsync_fname(dir, true);
+
+ /*
+ * If we're not in
recovery, we need to WAL-log the creation of the new
+ * file(s).  We do that after the above filesystem
modifications, in
+ * violation of the data-before-WAL rule as exempted by
+ *
src/backend/access/transam/README.  This means that it's possible for
+ * us to crash having made some or all of the
filesystem changes but
+ * before WAL logging, but in that case we'll eventually try to create the
+ * same
segment(s) again, which is tolerated.
+ */
+ if (!InRecovery)
+ {
+ xl_undolog_extend xlrec;
+
XLogRecPtr ptr;
..
}

I don't understand this WAL logging action.  If the crash happens
before or during syncing the file, then we anyway don't have WAL to
replay.  If it happens after WAL writing, then anyway we are sure that
the extended undo log segment must be there.  Can you explain how this
works?

3.
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+
UndoLogOffset end)
{
..
}

What will happen if the transaction creating undo log segment rolls
back?  Do we want to have pendingDeletes stuff as we have for normal
relation files?  This might also help in clearing the shared memory
state (undo log slots) if any.

4.
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
{
..
/*
+ * Take the tablespace create/drop lock while we look the name up.
+ * This prevents the
tablespace from being dropped while we're trying
+ * to resolve the name, or while the called is trying
to create an
+ * undo log in it.  The caller will have to release this lock.
+ */
+
LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
..

This appears quite expensive, for selecting an undo log to attach, we
might need to wait for an unrelated tablespace create/drop.  Have you
considered any other ideas to prevent this?  How other callers of
get_tablespace_oid prevent it from being dropped?  If we don't find
any better solution, then I think at the very least we should start a
separate thread to know the opinion of others on this matter.  I think
this is somewhat related to point-1.

5.
+static inline Oid
+UndoRecPtrGetTablespace(UndoRecPtr urp)
+{
+ UndoLogNumber logno = UndoRecPtrGetLogNo
(urp);
+ UndoLogTableEntry  *entry;
+
+ /*
+ * Fast path, for undo logs we've seen before.  This is safe because
+
 * tablespaces are constant for the lifetime of an undo log number.
+ */
+ entry = undologtable_lookup
(undologtable_cache, logno);
+ if (likely(entry))
+ return entry->tablespace;
+
+ /*
+ * Slow path:
force cache entry to be created.  Raises an error if the
+ * undo log has been entirely discarded, or hasn't
been created yet.  That
+ * is appropriate here, because this interface is designed for accessing
+ *
undo pages via bufmgr, and we should never be trying to access undo
+ * pages that have been discarded.
+ */
+
UndoLogGet(logno, false);

It seems UndoLogGet() probes hash table first, so what is the need for
doing it in the caller and if you think it is better to perform in the
caller, then maybe we should avoid doing it inside
UndoLogGet()->get_undo_log()->undologtable_lookup().

6.
+get_undo_log(UndoLogNumber logno, bool locked)
{
..
+ /*
+ * If we didn't find it, then it must already have been entirely
+ *
discarded.  We create a negative cache entry so that we can answer
+ * this question quickly next time.
+
*
+ * TODO: We could track the lowest known undo log number, to reduce
+ * the
negative cache entry bloat.
+ */
+ if (result == NULL)
+ {
+ /*
+
* Sanity check: the caller should not be asking about undo logs
+ * that have
never existed.
+ */
+ if (logno >= shared->next_logno)
+
elog(ERROR, "undo log %u hasn't been created yet", logno);
+ entry = undologtable_insert
(undologtable_cache, logno, &found);
+ entry->number = logno;
+ entry->control =
NULL;
+ entry->tablespace = 0;
+ }
..
}

Are you planning to take care of this TODO?  In any case, do we have
any mechanism to clear this bloat or will it stay till the end of the
session?  If it is later, then I think it is important to take care of
TODO.

7.
+void UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno);
+/* Redo interface. */
+extern void
undolog_redo(XLogReaderState *record);

You might want to add an extra line before /* Redo interface. */
following what has been done earlier in this file.

8.
+ * XXX For now an xl_undolog_meta object is filled in, in case it turns out
+ * to be necessary to write it into the
WAL record (like FPI, this must be
+ * logged once for each undo log after each checkpoint).  I think this should
+ *
be moved out of this interface and done differently -- to review.
+ */
+UndoRecPtr
+UndoLogAllocate(size_t size,
UndoPersistence persistence)

This function doesn't appear to be filling xl_undolog_meta.  Am I
missing something, if not, then this comments needs to be changed?

9.
+static bool
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
+{
+ char   *rawname;
+ List
*namelist;
+ bool need_to_unlock;
+ int length;
+ int i;
+
+ /* We need a
modifiable copy of string. */
+ rawname = pstrdup(undo_tablespaces);

I don't see the usage of rawname outside this function, isn't it
better to free it?  I understand that this function won't be called
frequently enough to matter, but still there is some theoretical
danger if a user continuously changes undo_tablespaces.

10.
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
{
..
+ /*
+ * For now we have a simple linked list of unattached undo logs for each
+ * persistence level.
 We'll grovel though it to find something for the

Typo.
/though/through

11.
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
{
..
+ /* WAL-log the creation of this new undo log. */
+ {
+
xl_undolog_create xlrec;
+
+ xlrec.logno = logno;
+ xlrec.tablespace = log-
>meta.tablespace;
+ xlrec.persistence = log->meta.persistence;
+
+
XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_CREATE);
+ }
..
}

Do we need to WAL log this for temporary/unlogged persistence level?

12.
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
{
..
+ oid = get_tablespace_oid(name, true);
..

Do we need to check permissions to see if the current user is allowed
to create in this tablespace?

13.
+UndoLogAllocate(size_t size, UndoPersistence persistence)
{
..
+ log->meta.prevlogno = prevlogno;

Is it okay to update meta information without lock or we should do it
few lines down after taking mutex lock?  If it is okay, then it is
better to write a comment for the same?

14.
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+
UndoLogOffset end)
{
..
+ /* Flush the contents of the file to disk. */
+ if (pg_fsync(fd) != 0)
+ elog(ERROR, "cannot fsync
file \"%s\": %m", path);
..
}

You might want to have a wait event for this as we do have at other
places where we perform fsync.

15.
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+
UndoLogOffset end)
{
..
+ if (!InRecovery)
+ {
+ xl_undolog_extend xlrec;
+ XLogRecPtr ptr;
+
+
xlrec.logno = logno;
+ xlrec.end = end;
+
+ XLogBeginInsert();
+ XLogRegisterData
((char *) &xlrec, sizeof(xlrec));
+ ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+
XLogFlush(ptr);
+ }
..
}

Do we need it for temporary/unlogged persistence level?

16.
+static void
+undolog_xlog_create(XLogReaderState *record)
+{
+ xl_undolog_create *xlrec = (xl_undolog_create *)
XLogRecGetData(record);
+ UndoLogControl *log;
+ UndoLogSharedData *shared = MyUndoLogState.shared;
+
+ /*
Create meta-data space in shared memory. */
+ LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+ /* TODO: assert that
it doesn't exist already? */
+ log = allocate_undo_log();
+ LWLockAcquire(&log->mutex, LW_EXCLUSIVE);

Do we need to acquire UndoLogLock during replay?  What else can be
going in concurrent to this which can create a problem?

17.
UndoLogAllocate()
{
..
+ /*
+ * While we have the lock, check if we have been forcibly detached by
+ *
DROP TABLESPACE.  That can only happen between transactions (see
+ * DropUndoLogsInsTablespace()).
+
*/
..
}

Function name in above comment is wrong.

18.
+ {
+ {"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT,
+ gettext_noop("Sets the tablespace(s) to use for undo logs."),
+ NULL,
+ GUC_LIST_INPUT | GUC_LIST_QUOTE
+ },
+ &undo_tablespaces,
+ "",
+ check_undo_tablespaces, assign_undo_tablespaces, NULL
+ },

It seems you need to update variable_is_guc_list_quote for this variable.

Till now, I have mainly reviewed undo log allocation part.  This is a
big patch and can take much more time to complete the review.  I will
review the other parts of the patch later.  I have changed the status
of this CF entry as "Waiting on Author", feel free to change it once
you think all the comments are addressed.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Undo logs

From

Dilip Kumar

Date:

05 November 2018, 11:39:35

On Mon, Sep 3, 2018 at 11:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thomas has already posted the latest version of undo log patches on
'Cleaning up orphaned files using undo logs' thread[1]. So I have
rebased the undo-interface patch also. This patch also includes
latest defect fixes from the main zheap branch [2].

I have also done some changes to the undo-log patches. Basically, it
is just some cleanup work and also make these patches independently
compilable. I have moved some of the code into undo-log patches and
also moved out some code which is not relevant to undo-log.

Some examples:
1. Moved UndoLog Startup and Checkpoint related code into
'0001-Add-undo-log-manager_v2.patch patch'
+ /* Recover undo log meta data corresponding to this checkpoint. */
+ StartupUndoLogs(ControlFile->checkPointCopy.redo);
+
2. Removed undo-worker related stuff out of this patch
+ case WAIT_EVENT_UNDO_DISCARD_WORKER_MAIN:
+ event_name = "UndoDiscardWorkerMain";
+ break;
+ case WAIT_EVENT_UNDO_LAUNCHER_MAIN:
+ event_name = "UndoLauncherMain";
+ break;

[1] https://www.postgresql.org/message-id/flat/CAEepm=0ULqYgM2aFeOnrx6YrtBg3xUdxALoyCG+XpssKqmezug@mail.gmail.com
[2] https://github.com/EnterpriseDB/zheap/

Patch applying order:
0001-Add-undo-log-manager.patch
0002-Provide-access-to-undo-log-data-via-the-buffer-manag.patch
0003-undo-interface-v3.patch
0004-Add-tests-for-the-undo-log-manager.patch from Cleaning up
orphaned files using undo logs' thread[1]
0004-undo-interface-test-v3.patch

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Undo logs

From

Amit Kapila

Date:

10 November 2018, 03:57:43

On Mon, Nov 5, 2018 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
[review for undo record layer (0003-undo-interface-v3)]

I might sound repeating myself, but just to be clear, I was involved
in the design of this patch as well and I have given a few high-level
inputs for this patch.  I have used this interface in the zheap
development, but haven't done any sort of detailed review which I am
doing now.  I encourage others also to review this patch.

1.
 * NOTES:
+ * Handling multilog -
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.

I think before describing how the undo record is spread across
multiple logs, you can explain how it is laid out when that is not the
case.  You can also explain how undo record headers are linked.  I am
not sure file header is the best place or it should be mentioned in
README, but I think for now we can use file header for this purpose
and later we can move it to README if required.

2.
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */

+#define MAX_BUFFER_PER_UNDO    2

I think here the right question is what is the possibility of undo
record to be greater than BLCKSZ?  For zheap, as of today, we don'
have any such requirement as the largest undo record is written for
update or multi_insert and in both cases we don't exceed the limit of
BLCKSZ.  I guess some user other than zheap could probably have such
requirement and I don't think it is impossible to enhance this if we
have any requirement.

If anybody else has an opinion here, please feel to share it.

3.
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/* Maximum number of undo record that can be prepared before calling insert. */
+#define MAX_PREPARED_UNDO 2

I think it is better to define MAX_PREPARED_UNDO before
MAX_UNDO_BUFFERS as the first one is used in the definition of a
second.

4.
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+   TransactionId xid)

This function locks the buffer as well which is right as we need to do
that before critical section, but the function header comments doesn't
indicate it.  You can modify it as:
"Upon return, the necessary undo buffers are pinned and locked."

Note that similar modification is required in .h file as well.

5.
+/*
+ * Insert a previously-prepared undo record.  This will lock the buffers
+ * pinned in the previous step, write the actual undo record into them,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)

Here, the comments are wrong as buffers are already locked in the
previous step.  A similar change is required in .h file as well.

6.
+InsertPreparedUndo(void)
{
..
/*
+ * Try to insert the record into the current page. If it doesn't
+ * succeed then recall the routine with the next page.
+ */
+ if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+ {
+ undo_len += already_written;
+ MarkBufferDirty(buffer);
+ break;
+ }
+
+ MarkBufferDirty(buffer);
..
}

Here, you are writing into a shared buffer and marking it dirty, isn't
it a good idea to Assert for being in the critical section?

7.
+/* Maximum number of undo record that can be prepared before calling insert. */
+#define MAX_PREPARED_UNDO 2

/record/records

I think this definition doesn't define the maximum number of undo
records that can be prepared as the caller can use UndoSetPrepareSize
to change it.  I think you can modify the comment as below or
something on those lines:
"This defines the number of undo records that can be prepared before
calling insert by default.  If you need to prepare more than
MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
first."

8.
+ * Caller should call this under log->discard_lock
+ */
+static bool
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+ if (log->oldest_data == InvalidUndoRecPtr)
..

Isn't it a good idea to have an Assert that we already have discard_lock?

9.
+ UnpackedUndoRecord uur; /* prev txn's first undo record.*/
+} PreviousTxnInfo;

Extra space at the of comment is required.

10.
+/*
+ * Check if previous transactions undo is already discarded.
+ *
+ * Caller should call this under log->discard_lock
+ */
+static bool
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{

The name suggests that this function is doing something special for
the previous transaction whereas this it just checks whether undo is
discarded corresponding to a particular undo location.  Isn't it
better if we name it as UndoRecordExists or UndoRecordIsValid?  Then
explain in comments when do you consider particular record exists.

Another point to note is that you are not releasing the lock in all
paths, so it is better to mention in comments when will it be released
and when not.


11.
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+ if (log->oldest_data == InvalidUndoRecPtr)
+ {
+ /*
+ * oldest_data is not yet initialized.  We have to check
+ * UndoLogIsDiscarded and if it's already discarded then we have
+ * nothing to do.
+ */
+ LWLockRelease(&log->discard_lock);
+ if (UndoLogIsDiscarded(prev_xact_urp))
+ return true;

The comment in above code is just trying to write the code in words.
I think here you should tell why we need to call UndoLogIsDiscarded
when oldest_data is not initialized and or the scenario when
oldest_data will not be initialized.

12.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+void
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
..

I find this function name bit awkward.  How about UndoRecordPrepareTransInfo?

13.
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
{
..
+ /*
+ * TODO: For now we don't know how to build a transaction chain for
+ * temporary undo logs.  That's because this log might have been used by a
+ * different backend, and we can't access its buffers.  What should happen
+ * is that the undo data should be automatically discarded when the other
+ * backend detaches, but that code doesn't exist yet and the undo worker
+ * can't do it either.
+ */
+ if (log->meta.persistence == UNDO_TEMP)
+ return;

Aren't we already dealing with this case in the other patch [1]?
Basically, I think we should discard it at commit time and or when the
backend is detached.

14.
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
{
..
/*
+ * If previous transaction's urp is not valid means this backend is
+ * preparing its first undo so fetch the information from the undo log
+ * if it's still invalid urp means this is the first undo record for this
+ * log and we have nothing to update.
+ */
+ if (!UndoRecPtrIsValid(prev_xact_urp))
+ return;
..

This comment is confusing.  It appears to be saying same thing twice.
You can write it along something like:

"The absence of previous transaction's undo indicate that this backend
is preparing its first undo in which case we have nothing to update.".

15.
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
/*
+ * Acquire the discard lock before accessing the undo record so that
+ * discard worker doen't remove the record while we are in process of
+ * reading it.
+ */

Typo doen't/doesn't.

I think you can use 'can't' instead of doesn't.

This is by no means a complete review, rather just noticed a few
things while reading the patch.

[1] - https://www.postgresql.org/message-id/CAFiTN-t8fv-qYG9zynhS-1jRrvt_o5C-wCMRtzOsK8S%3DMXvKKw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Undo logs

From

Amit Kapila

Date:

13 November 2018, 08:36:33

On Wed, Oct 17, 2018 at 3:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Oct 15, 2018 at 6:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro
> > <thomas.munro@enterprisedb.com> wrote:
> > > I have also pushed a new WIP version of the lower level undo log
> > > storage layer patch set to a public branch[1].  I'll leave the earlier
> > > branch[2] there because the record-level patch posted by Dilip depends
> > > on it for now.
> > >
> >
>
> Till now, I have mainly reviewed undo log allocation part.  This is a
> big patch and can take much more time to complete the review.  I will
> review the other parts of the patch later.
>

I think I see another issue with this patch.  Basically, during
extend_undo_log, there is an assumption that no one could update
log->meta.end concurrently, but it
is not true as the same can be updated by UndoLogDiscard which can
lead to assertion failure in extend_undo_log.

+static void
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
{
..
/*
+ * Create all the segments needed to increase 'end' to the requested
+ * size.  This is quite expensive, so we will try to avoid it completely
+ * by renaming files into place in UndoLogDiscard instead.
+ */
+ end = log->meta.end;
+ while (end < new_end)
+ {
+ allocate_empty_undo_segment(logno, log->meta.tablespace, end);
+ end += UndoLogSegmentSize;
+ }
..
+ Assert(end == new_end);
..
/*
+ * We didn't need to acquire the mutex to read 'end' above because only
+ * we write to it.  But we need the mutex to update it, because the
+ * checkpointer might read it concurrently.
+ *
+ * XXX It's possible for meta.end to be higher already during
+ * recovery, because of the timing of a checkpoint; in that case we did
+ * nothing above and we shouldn't update shmem here.  That interaction
+ * needs more analysis.
+ */
+ LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+ if (log->meta.end < end)
+ log->meta.end = end;
+ LWLockRelease(&log->mutex);
..
}

Assume, before we read log->meta.end in above code, concurrently,
discard process discards the undo and moves the end pointer to a later
location, then the above assertion will fail.  Now, if discard happens
just after we read log->meta.end and before we do other stuff in this
function, then it will crash in recovery.

Can't we just remove this Assert?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Undo logs

From

Dilip Kumar

Date:

14 November 2018, 06:38:01

On Mon, Nov 5, 2018 at 5:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Sep 3, 2018 at 11:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Thomas has already posted the latest version of undo log patches on
> 'Cleaning up orphaned files using undo logs' thread[1].  So I have
> rebased the undo-interface patch also.  This patch also includes
> latest defect fixes from the main zheap branch [2].
>
Hi Thomas,

The latest patch for undo log storage is not compiling on the head, I
think it needs to be rebased due to your commit related to "pg_pread()
and pg_pwrite() for data files and WAL"

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Undo logs

From

Dilip Kumar

Date:

14 November 2018, 09:12:17

On Sat, Nov 10, 2018 at 9:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Nov 5, 2018 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> [review for undo record layer (0003-undo-interface-v3)]
>
> I might sound repeating myself, but just to be clear, I was involved
> in the design of this patch as well and I have given a few high-level
> inputs for this patch.  I have used this interface in the zheap
> development, but haven't done any sort of detailed review which I am
> doing now.  I encourage others also to review this patch.

Thanks for the review, please find my reply inline.
>
> 1.
>  * NOTES:
> + * Handling multilog -
> + *  It is possible that the undo record of a transaction can be spread across
> + *  multiple undo log.  And, we need some special handling while inserting the
> + *  undo for discard and rollback to work sanely.
>
> I think before describing how the undo record is spread across
> multiple logs, you can explain how it is laid out when that is not the
> case.  You can also explain how undo record headers are linked.  I am
> not sure file header is the best place or it should be mentioned in
> README, but I think for now we can use file header for this purpose
> and later we can move it to README if required.
Added in the header.

>
> 2.
> +/*
> + * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
> + * if not than undo record can spread across 2 buffers at the max.
> + */
>
> +#define MAX_BUFFER_PER_UNDO    2
>
> I think here the right question is what is the possibility of undo
> record to be greater than BLCKSZ?  For zheap, as of today, we don'
> have any such requirement as the largest undo record is written for
> update or multi_insert and in both cases we don't exceed the limit of
> BLCKSZ.  I guess some user other than zheap could probably have such
> requirement and I don't think it is impossible to enhance this if we
> have any requirement.
>
> If anybody else has an opinion here, please feel to share it.

Should we remove this FIXME or lets wait for some other opinion.  As
of now I have kept it as it is.
>
> 3.
> +/*
> + * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
> + * if not than undo record can spread across 2 buffers at the max.
> + */
> +#define MAX_BUFFER_PER_UNDO    2
> +
> +/*
> + * Consider buffers needed for updating previous transaction's
> + * starting undo record. Hence increased by 1.
> + */
> +#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
> +
> +/* Maximum number of undo record that can be prepared before calling insert. */
> +#define MAX_PREPARED_UNDO 2
>
> I think it is better to define MAX_PREPARED_UNDO before
> MAX_UNDO_BUFFERS as the first one is used in the definition of a
> second.

Done
>
> 4.
> +/*
> + * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
> + * intended to insert.  Upon return, the necessary undo buffers are pinned.
> + * This should be done before any critical section is established, since it
> + * can fail.
> + *
> + * If not in recovery, 'xid' should refer to the top transaction id because
> + * undo log only stores mapping for the top most transactions.
> + * If in recovery, 'xid' refers to the transaction id stored in WAL.
> + */
> +UndoRecPtr
> +PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
> +   TransactionId xid)
>
> This function locks the buffer as well which is right as we need to do
> that before critical section, but the function header comments doesn't
> indicate it.  You can modify it as:
> "Upon return, the necessary undo buffers are pinned and locked."
>
> Note that similar modification is required in .h file as well.

Done
>
> 5.
> +/*
> + * Insert a previously-prepared undo record.  This will lock the buffers
> + * pinned in the previous step, write the actual undo record into them,
> + * and mark them dirty.  For persistent undo, this step should be performed
> + * after entering a critical section; it should never fail.
> + */
> +void
> +InsertPreparedUndo(void)
>
> Here, the comments are wrong as buffers are already locked in the
> previous step.  A similar change is required in .h file as well.

Fixed
>
> 6.
> +InsertPreparedUndo(void)
> {
> ..
> /*
> + * Try to insert the record into the current page. If it doesn't
> + * succeed then recall the routine with the next page.
> + */
> + if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
> + {
> + undo_len += already_written;
> + MarkBufferDirty(buffer);
> + break;
> + }
> +
> + MarkBufferDirty(buffer);
> ..
> }
>
> Here, you are writing into a shared buffer and marking it dirty, isn't
> it a good idea to Assert for being in the critical section?
Done
>
> 7.
> +/* Maximum number of undo record that can be prepared before calling insert. */
> +#define MAX_PREPARED_UNDO 2
>
> /record/records
>
> I think this definition doesn't define the maximum number of undo
> records that can be prepared as the caller can use UndoSetPrepareSize
> to change it.  I think you can modify the comment as below or
> something on those lines:
> "This defines the number of undo records that can be prepared before
> calling insert by default.  If you need to prepare more than
> MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
> first."

Fixed
>
> 8.
> + * Caller should call this under log->discard_lock
> + */
> +static bool
> +IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
> +{
> + if (log->oldest_data == InvalidUndoRecPtr)
> ..
>
> Isn't it a good idea to have an Assert that we already have discard_lock?
Done

>
> 9.
> + UnpackedUndoRecord uur; /* prev txn's first undo record.*/
> +} PreviousTxnInfo;
>
> Extra space at the of comment is required.

Done
>
> 10.
> +/*
> + * Check if previous transactions undo is already discarded.
> + *
> + * Caller should call this under log->discard_lock
> + */
> +static bool
> +IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
> +{
>
> The name suggests that this function is doing something special for
> the previous transaction whereas this it just checks whether undo is
> discarded corresponding to a particular undo location.  Isn't it
> better if we name it as UndoRecordExists or UndoRecordIsValid?  Then
> explain in comments when do you consider particular record exists.
>
Changed to UndoRecordIsValid

> Another point to note is that you are not releasing the lock in all
> paths, so it is better to mention in comments when will it be released
> and when not.
Done

>
>
> 11.
> +IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
> +{
> + if (log->oldest_data == InvalidUndoRecPtr)
> + {
> + /*
> + * oldest_data is not yet initialized.  We have to check
> + * UndoLogIsDiscarded and if it's already discarded then we have
> + * nothing to do.
> + */
> + LWLockRelease(&log->discard_lock);
> + if (UndoLogIsDiscarded(prev_xact_urp))
> + return true;
>
> The comment in above code is just trying to write the code in words.
> I think here you should tell why we need to call UndoLogIsDiscarded
> when oldest_data is not initialized and or the scenario when
> oldest_data will not be initialized.
Fixed
>
> 12.
> + * The actual update will be done by UndoRecordUpdateTransInfo under the
> + * critical section.
> + */
> +void
> +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
> +{
> ..
>
> I find this function name bit awkward.  How about UndoRecordPrepareTransInfo?

Changed as per the suggestion
>
> 13.
> +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
> {
> ..
> + /*
> + * TODO: For now we don't know how to build a transaction chain for
> + * temporary undo logs.  That's because this log might have been used by a
> + * different backend, and we can't access its buffers.  What should happen
> + * is that the undo data should be automatically discarded when the other
> + * backend detaches, but that code doesn't exist yet and the undo worker
> + * can't do it either.
> + */
> + if (log->meta.persistence == UNDO_TEMP)
> + return;
>
> Aren't we already dealing with this case in the other patch [1]?
> Basically, I think we should discard it at commit time and or when the
> backend is detached.

Changed
>
> 14.
> +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
> {
> ..
> /*
> + * If previous transaction's urp is not valid means this backend is
> + * preparing its first undo so fetch the information from the undo log
> + * if it's still invalid urp means this is the first undo record for this
> + * log and we have nothing to update.
> + */
> + if (!UndoRecPtrIsValid(prev_xact_urp))
> + return;
> ..
>
> This comment is confusing.  It appears to be saying same thing twice.
> You can write it along something like:
>
> "The absence of previous transaction's undo indicate that this backend
> is preparing its first undo in which case we have nothing to update.".

Done as per the suggestion
>
> 15.
> +PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
> /*
> + * Acquire the discard lock before accessing the undo record so that
> + * discard worker doen't remove the record while we are in process of
> + * reading it.
> + */
>
> Typo doen't/doesn't.
>
> I think you can use 'can't' instead of doesn't.

Fixed
>
> This is by no means a complete review, rather just noticed a few
> things while reading the patch.
>
> [1] - https://www.postgresql.org/message-id/CAFiTN-t8fv-qYG9zynhS-1jRrvt_o5C-wCMRtzOsK8S%3DMXvKKw%40mail.gmail.com
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Undo logs

From

Amit Kapila

Date:

14 November 2018, 09:27:55

On Wed, Nov 14, 2018 at 2:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Nov 10, 2018 at 9:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Mon, Nov 5, 2018 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > [review for undo record layer (0003-undo-interface-v3)]
> >
> > I might sound repeating myself, but just to be clear, I was involved
> > in the design of this patch as well and I have given a few high-level
> > inputs for this patch.  I have used this interface in the zheap
> > development, but haven't done any sort of detailed review which I am
> > doing now.  I encourage others also to review this patch.
>
> Thanks for the review, please find my reply inline.
> >
> > 1.
> >  * NOTES:
> > + * Handling multilog -
> > + *  It is possible that the undo record of a transaction can be spread across
> > + *  multiple undo log.  And, we need some special handling while inserting the
> > + *  undo for discard and rollback to work sanely.
> >
> > I think before describing how the undo record is spread across
> > multiple logs, you can explain how it is laid out when that is not the
> > case.  You can also explain how undo record headers are linked.  I am
> > not sure file header is the best place or it should be mentioned in
> > README, but I think for now we can use file header for this purpose
> > and later we can move it to README if required.
> Added in the header.
>
> >
> > 2.
> > +/*
> > + * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
> > + * if not than undo record can spread across 2 buffers at the max.
> > + */
> >
> > +#define MAX_BUFFER_PER_UNDO    2
> >
> > I think here the right question is what is the possibility of undo
> > record to be greater than BLCKSZ?  For zheap, as of today, we don'
> > have any such requirement as the largest undo record is written for
> > update or multi_insert and in both cases we don't exceed the limit of
> > BLCKSZ.  I guess some user other than zheap could probably have such
> > requirement and I don't think it is impossible to enhance this if we
> > have any requirement.
> >
> > If anybody else has an opinion here, please feel to share it.
>
> Should we remove this FIXME or lets wait for some other opinion.  As
> of now I have kept it as it is.
> >

I think you can keep it with XXX instead of Fixme as there is nothing to fix.

Both the patches 0003-undo-interface-v4.patch and
0004-undo-interface-test-v4.patch appears to be same except for the
name?

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Undo logs

From

Dilip Kumar

Date:

14 November 2018, 10:18:28

On Wed, Nov 14, 2018 at 2:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think you can keep it with XXX instead of Fixme as there is nothing to fix.
Changed
>
> Both the patches 0003-undo-interface-v4.patch and
> 0004-undo-interface-test-v4.patch appears to be same except for the
> name?
My bad, please find the updated patch.

>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Undo logs

From

Dilip Kumar

Date:

15 November 2018, 06:44:09

On Wed, Nov 14, 2018 at 3:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Nov 14, 2018 at 2:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > I think you can keep it with XXX instead of Fixme as there is nothing to fix.
> Changed
> >
> > Both the patches 0003-undo-interface-v4.patch and
> > 0004-undo-interface-test-v4.patch appears to be same except for the
> > name?
> My bad, please find the updated patch.
>
There was some problem in a assert and one comment was not aligned
properly so I have fixed that in the latest patch.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

0003-undo-interface-v6.patch

Re: Undo logs

From

Dilip Kumar

Date:

16 November 2018, 04:16:27

On Thu, Nov 15, 2018 at 12:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
Updated patch (merged latest code from the zheap main branch [1]).
The main chain is related to removing relfilenode and tablespace id
from the undo record and store reloid.
Earlier, we kept it thinking that we will perform rollback without
database connection but that's not the case now.

[1]  https://github.com/EnterpriseDB/zheap

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Undo logs

From

Amit Kapila

Date:

17 November 2018, 11:42:01

On Fri, Nov 16, 2018 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Nov 15, 2018 at 12:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> Updated patch (merged latest code from the zheap main branch [1]).
>

Review comments:
-------------------------------
1.
UndoRecordPrepareTransInfo()
{
..
+ /*
+ * The absence of previous transaction's undo indicate that this backend
+ * is preparing its first undo in which case we have nothing to update.
+ */
+ if (!UndoRecPtrIsValid(prev_xact_urp))
+ return;
..
}

It is expected that the caller of UndoRecPtrIsValid should have
discard lock, but I don't see that how this the call from this place
ensures the same?

2.
UndoRecordPrepareTransInfo()
{
..
/*
+ * The absence of previous transaction's undo indicate that this backend
+ * is preparing its first undo in which case we have nothing to update.
+ */
+ if (!UndoRecPtrIsValid(prev_xact_urp))
+ return;
+
+ /*
+ * Acquire the discard lock before accessing the undo record so that
+ * discard worker doesn't remove the record while we are in process of
+ * reading it.
+ */
+ LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+ if (!UndoRecordIsValid(log, prev_xact_urp))
+ return;
..
}

I don't understand this logic where you are checking the same
information with and without a lock, is there any reason for same?  It
seems we don't need the first call to  UndoRecPtrIsValid is not
required.

3.
UndoRecordPrepareTransInfo()
{
..
+ while (true)
+ {
+ bufidx = InsertFindBufferSlot(rnode, cur_blk,
+   RBM_NORMAL,
+   log->meta.persistence);
+ prev_txn_info.prev_txn_undo_buffers[index] = bufidx;
+ buffer = undo_buffer[bufidx].buf;
+ page = BufferGetPage(buffer);
+ index++;
+
+ if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte,
+ &already_decoded, true))
+ break;
+
+ starting_byte = UndoLogBlockHeaderSize;
+ cur_blk++;
+ }

Can you write some commentary on what this code is doing?

There is no need to use index++; as a separate statement, you can do
it while assigning the buffer in that index.

4.
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+ UndoRecPtr prev_xact_urp;

I think you can simply name this variable as xact_urp.  All this and
related prev_* terminology used for variables seems confusing to me. I
understand that you are trying to update the last transactions undo
record information, but you can explain that via comments.  Keeping
such information as part of variable names not only makes their length
longer, but is also confusing.

5.
/*
+ * Structure to hold the previous transaction's undo update information.
+ */
+typedef struct PreviousTxnUndoRecord
+{
+ UndoRecPtr prev_urecptr; /* prev txn's starting urecptr */
+ int prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO];
+ UnpackedUndoRecord uur; /* prev txn's first undo record. */
+} PreviousTxnInfo;
+
+static PreviousTxnInfo prev_txn_info;

Due to reasons mentioned in point-4, lets name the structure and it's
variables as below:

typedef struct XactUndoRecordInfo
{
UndoRecPtr start_urecptr; /* prev txn's starting urecptr */
int idx_undo_buffers[MAX_BUFFER_PER_UNDO];
UnpackedUndoRecord first_uur; /* prev txn's first undo record. */
} XactUndoRecordInfo;

static XactUndoRecordInfo xact_ur_info;

6.
+static int
+InsertFindBufferSlot(RelFileNode rnode,

The name of this function is not clear, can we change it to
UndoGetBufferSlot or UndoGetBuffer?

7.
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+ UndoPersistence upersistence, TransactionId txid)
{
..
+ /*
+ * If this is the first undo record for this transaction then set the
+ * uur_next to the SpecialUndoRecPtr.  This is the indication to allocate
+ * the space for the transaction header and the valid value of the uur_next
+ * will be updated while preparing the first undo record of the next
+ * transaction.
+ */
+ first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
..
}

I think it will be better if we move this comment few lines down:
+ if (need_start_undo && i == 0)
+ {
+ urec->uur_next = SpecialUndoRecPtr;

BTW, is the only reason set a special value (SpecialUndoRecPtr) for
uur_next is for allocating transaction header? If so, can't we
directly set the corresponding flag (UREC_INFO_TRANSACTION) in
uur_info and then remove it from UndoRecordSetInfo?

I think it would have been better if there is one central location to
set uur_info, but as that is becoming tricky,
we should not try to add some special stuff to make it possible.

8.
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+ Size size;
+
+ /* Fixme : Temporary hack to allow zheap to set some value for uur_info. */
+ /* if (uur->uur_info == 0) */
+ UndoRecordSetInfo(uur);

Can't we move UndoRecordSetInfo in it's caller
(UndoRecordAllocateMulti)?  It seems another caller of this function
doesn't expect this.  If we do that way, then we can have an Assert
for non-zero uur_info in UndoRecordExpectedSize.

9.
bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+ int starting_byte, int *already_written, bool header_only)
+{
+ char   *writeptr = (char *) page + starting_byte;
+ char   *endptr = (char *) page + BLCKSZ;
+ int my_bytes_written = *already_written;
+
+ if (uur->uur_info == 0)
+ UndoRecordSetInfo(uur);

Do we really need UndoRecordSetInfo here?  If not, then just add an
assert for non-zero uur_info?

10
UndoRecordAllocateMulti()
{
..
else
+ {
+ /*
+ * It is okay to initialize these variables as these are used only
+ * with the first record of transaction.
+ */
+ urec->uur_next = InvalidUndoRecPtr;
+ urec->uur_xidepoch = 0;
+ urec->uur_dbid = 0;
+ urec->uur_progress = 0;
+ }
+
+
+ /* calculate the size of the undo record. */
+ size += UndoRecordExpectedSize(urec);
+ }

Remove one extra line before comment "calculate the size of ..".

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Undo logs

From

Dilip Kumar

Date:

20 November 2018, 14:07:40

On Sat, Nov 17, 2018 at 5:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Fri, Nov 16, 2018 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Thu, Nov 15, 2018 at 12:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > Updated patch (merged latest code from the zheap main branch [1]).
> >
>
> Review comments:
> -------------------------------
> 1.
> UndoRecordPrepareTransInfo()
> {
> ..
> + /*
> + * The absence of previous transaction's undo indicate that this backend
> + * is preparing its first undo in which case we have nothing to update.
> + */
> + if (!UndoRecPtrIsValid(prev_xact_urp))
> + return;
> ..
> }
>
> It is expected that the caller of UndoRecPtrIsValid should have
> discard lock, but I don't see that how this the call from this place
> ensures the same?


I think its duplicate code, made mistake while merging from the zheap branch
>
>
> 2.
> UndoRecordPrepareTransInfo()
> {
> ..
> /*
> + * The absence of previous transaction's undo indicate that this backend
> + * is preparing its first undo in which case we have nothing to update.
> + */
> + if (!UndoRecPtrIsValid(prev_xact_urp))
> + return;
> +
> + /*
> + * Acquire the discard lock before accessing the undo record so that
> + * discard worker doesn't remove the record while we are in process of
> + * reading it.
> + */
> + LWLockAcquire(&log->discard_lock, LW_SHARED);
> +
> + if (!UndoRecordIsValid(log, prev_xact_urp))
> + return;
> ..
> }
>
> I don't understand this logic where you are checking the same
> information with and without a lock, is there any reason for same?  It
> seems we don't need the first call to  UndoRecPtrIsValid is not
> required.


Removed
>
>
> 3.
> UndoRecordPrepareTransInfo()
> {
> ..
> + while (true)
> + {
> + bufidx = InsertFindBufferSlot(rnode, cur_blk,
> +   RBM_NORMAL,
> +   log->meta.persistence);
> + prev_txn_info.prev_txn_undo_buffers[index] = bufidx;
> + buffer = undo_buffer[bufidx].buf;
> + page = BufferGetPage(buffer);
> + index++;
> +
> + if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte,
> + &already_decoded, true))
> + break;
> +
> + starting_byte = UndoLogBlockHeaderSize;
> + cur_blk++;
> + }
>
>
> Can you write some commentary on what this code is doing?

Done
>
>
> There is no need to use index++; as a separate statement, you can do
> it while assigning the buffer in that index.

Done
>
>
> 4.
> +UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
> +{
> + UndoRecPtr prev_xact_urp;
>
> I think you can simply name this variable as xact_urp.  All this and
> related prev_* terminology used for variables seems confusing to me. I
> understand that you are trying to update the last transactions undo
> record information, but you can explain that via comments.  Keeping
> such information as part of variable names not only makes their length
> longer, but is also confusing.
>
> 5.
> /*
> + * Structure to hold the previous transaction's undo update information.
> + */
> +typedef struct PreviousTxnUndoRecord
> +{
> + UndoRecPtr prev_urecptr; /* prev txn's starting urecptr */
> + int prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO];
> + UnpackedUndoRecord uur; /* prev txn's first undo record. */
> +} PreviousTxnInfo;
> +
> +static PreviousTxnInfo prev_txn_info;
>
> Due to reasons mentioned in point-4, lets name the structure and it's
> variables as below:
>
> typedef struct XactUndoRecordInfo
> {
> UndoRecPtr start_urecptr; /* prev txn's starting urecptr */
> int idx_undo_buffers[MAX_BUFFER_PER_UNDO];
> UnpackedUndoRecord first_uur; /* prev txn's first undo record. */
> } XactUndoRecordInfo;
>
> static XactUndoRecordInfo xact_ur_info;


Done,  but I have kept start_urecptr as urecptr and first_uur as uur
and explained in comment.
>
>
> 6.
> +static int
> +InsertFindBufferSlot(RelFileNode rnode,
>
> The name of this function is not clear, can we change it to
> UndoGetBufferSlot or UndoGetBuffer?


Changed to UndoGetBufferSlot
>
>
> 7.
> +UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
> + UndoPersistence upersistence, TransactionId txid)
> {
> ..
> + /*
> + * If this is the first undo record for this transaction then set the
> + * uur_next to the SpecialUndoRecPtr.  This is the indication to allocate
> + * the space for the transaction header and the valid value of the uur_next
> + * will be updated while preparing the first undo record of the next
> + * transaction.
> + */
> + first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
> ..
> }


Done
>
>
> I think it will be better if we move this comment few lines down:
> + if (need_start_undo && i == 0)
> + {
> + urec->uur_next = SpecialUndoRecPtr;
>
> BTW, is the only reason set a special value (SpecialUndoRecPtr) for
> uur_next is for allocating transaction header? If so, can't we
> directly set the corresponding flag (UREC_INFO_TRANSACTION) in
> uur_info and then remove it from UndoRecordSetInfo?


yeah, Done that way.
>
>
> I think it would have been better if there is one central location to
> set uur_info, but as that is becoming tricky,
> we should not try to add some special stuff to make it possible.
>
> 8.
> +UndoRecordExpectedSize(UnpackedUndoRecord *uur)
> +{
> + Size size;
> +
> + /* Fixme : Temporary hack to allow zheap to set some value for uur_info. */
> + /* if (uur->uur_info == 0) */
> + UndoRecordSetInfo(uur);
>
> Can't we move UndoRecordSetInfo in it's caller
> (UndoRecordAllocateMulti)?  It seems another caller of this function
> doesn't expect this.  If we do that way, then we can have an Assert
> for non-zero uur_info in UndoRecordExpectedSize.

Done that way

>
>
> 9.
> bool
> +InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
> + int starting_byte, int *already_written, bool header_only)
> +{
> + char   *writeptr = (char *) page + starting_byte;
> + char   *endptr = (char *) page + BLCKSZ;
> + int my_bytes_written = *already_written;
> +
> + if (uur->uur_info == 0)
> + UndoRecordSetInfo(uur);
>
> Do we really need UndoRecordSetInfo here?  If not, then just add an
> assert for non-zero uur_info?

Done


>
>
> 10
> UndoRecordAllocateMulti()
> {
> ..
> else
> + {
> + /*
> + * It is okay to initialize these variables as these are used only
> + * with the first record of transaction.
> + */
> + urec->uur_next = InvalidUndoRecPtr;
> + urec->uur_xidepoch = 0;
> + urec->uur_dbid = 0;
> + urec->uur_progress = 0;
> + }
> +
> +
> + /* calculate the size of the undo record. */
> + size += UndoRecordExpectedSize(urec);
> + }
>
> Remove one extra line before comment "calculate the size of ..".

Fixed

Along with that I have merged latest changes in zheap branch committed
by Rafia Sabih for cleaning up the undo buffer information in abort
path.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

0003-undo-interface-v8.patch

Re: Undo logs

From

Amit Kapila

Date:

26 November 2018, 08:43:20

On Tue, Nov 20, 2018 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> Along with that I have merged latest changes in zheap branch committed
> by Rafia Sabih for cleaning up the undo buffer information in abort
> path.
>

Thanks, few more comments:

1.
@@ -2627,6 +2653,7 @@ AbortTransaction(void)
  AtEOXact_HashTables(false);
  AtEOXact_PgStat(false);
  AtEOXact_ApplyLauncher(false);
+ AtAbort_ResetUndoBuffers();

Don't we need similar handling in AbortSubTransaction?

2.
 Read undo record header in by calling UnpackUndoRecord, if the undo
+ * record header is splited across buffers then we need to read the complete
+ * header by invoking UnpackUndoRecord multiple times.

/splited/splitted.  You can just use split here.

3.
+/*
+ * Identifying information for a transaction to which this undo belongs.
+ * it will also store the total size of the undo for this transaction.
+ */
+typedef struct UndoRecordTransaction
+{
+ uint32 urec_progress;  /* undo applying progress. */
+ uint32 urec_xidepoch;  /* epoch of the current transaction */
+ Oid urec_dbid; /* database id */
+ uint64 urec_next; /* urec pointer of the next transaction */
+} UndoRecordTransaction;

/it will/It will.
BTW, which field(s) in the above structure stores the size of the undo?

4.
+ /*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * fields are set and calculate the size of the undo record based on the
+ * uur_info.
+ */

Can we rephrase it as "calculate the size of the undo record based on
the info required"?

5.
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+void
+UnlockReleaseUndoBuffers(void)

Can we change the later sentence as "This step is performed after
exiting any critical section where we have performed undo action."?

6.
+InsertUndoRecord()
{
..
+ Assert (uur->uur_info != 0);

Add a comment above Assert "The undo record must contain a valid information."

6.
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+ UndoPersistence upersistence, TransactionId txid)
{
..
+ first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
+
+ if ((!InRecovery && prev_txid[upersistence] != txid) ||
+ first_rec_in_recovery)
+ {
+ need_start_undo = true;
+ }

Here, I think we can avoid using two boolean variables
(first_rec_in_recovery and need_start_undo).  Also, this same check is
used in this function twice.  I have tried to simplify it in the
attached.  Can you check and let me know if that sounds okay to you?

7.
UndoRecordAllocateMulti
{
..
/*
+ * If this is the first undo record of the transaction then initialize
+ * the transaction header fields of the undorecord. Also, set the flag
+ * in the uur_info to indicate that this record contains the transaction
+ * header so allocate the space for the same.
+ */
+ if (need_start_undo && i == 0)
+ {
+ urec->uur_next = InvalidUndoRecPtr;
+ urec->uur_xidepoch = GetEpochForXid(txid);
+ urec->uur_progress = 0;
+
+ /* During recovery, Fetch database id from the undo log state. */
+ if (InRecovery)
+ urec->uur_dbid = UndoLogStateGetDatabaseId();
+ else
+ urec->uur_dbid = MyDatabaseId;
+
+ /* Set uur_info to include the transaction header. */
+ urec->uur_info |= UREC_INFO_TRANSACTION;
+ }
..
}

It seems here you have written the code in your comments.  I have
changed it in the attached delta patch.

8.
UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+    TransactionId xid, UndoPersistence upersistence)
+{
..
..
+ multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare,

Can we rename this variable as prepared_urec_ptr or prepared_urp?

9.
+void
+UndoSetPrepareSize(int max_prepare,

I think it will be better to use nrecords instead of 'max_prepare'
similar to how you have it in UndoRecordAllocateMulti()

10.
+ if (!UndoRecPtrIsValid(multi_prep_urp))
+ urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
+ else
+ urecptr = multi_prep_urp;
+
+ size = UndoRecordExpectedSize(urec);
..
..
+ if (UndoRecPtrIsValid(multi_prep_urp))
+ {
+ UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+ insert = UndoLogOffsetPlusUsableBytes(insert, size);
+ multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+ }

Can't we use urecptr instead of multi_prep_urp in above code after
urecptr is initialized?

11.
+static int max_prepare_undo = MAX_PREPARED_UNDO;

Let's change the name of this variable as max_prepared_undo.  Already
changed in attached delta patch

12.
PrepareUndoInsert()
{
..
+ /* Already reached maximum prepared limit. */
+ if (prepare_idx == max_prepare_undo)
+ return InvalidUndoRecPtr;
..
}

I think in the above condition, we should have elog, otherwise,
callers need to be prepared to handle it.

13.
UndoRecordAllocateMulti()

How about naming it as UndoRecordAllocate as this is used to allocate
even a single undo record?

14.
If not already done, can you pgindent the new code added by this patch?

Attached is a delta patch on top of your previous patch containing
some fixes as memtioned above and few other minor changes and cleanup.
If you find changes okay, kindly include them in your next version.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

0003-undo-interface-v8-delta-amit.patch

Re: Undo logs

From

Dilip Kumar

Date:

29 November 2018, 12:57:41

On Mon, Nov 26, 2018 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Nov 20, 2018 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > Along with that I have merged latest changes in zheap branch committed
> > by Rafia Sabih for cleaning up the undo buffer information in abort
> > path.
> >
>
> Thanks, few more comments:
>
> 1.
> @@ -2627,6 +2653,7 @@ AbortTransaction(void)
>   AtEOXact_HashTables(false);
>   AtEOXact_PgStat(false);
>   AtEOXact_ApplyLauncher(false);
> + AtAbort_ResetUndoBuffers();
>
> Don't we need similar handling in AbortSubTransaction?

Yeah we do. I have fixed.
>
> 2.
>  Read undo record header in by calling UnpackUndoRecord, if the undo
> + * record header is splited across buffers then we need to read the complete
> + * header by invoking UnpackUndoRecord multiple times.
>
> /splited/splitted.  You can just use split here.
Fixed
>
> 3.
> +/*
> + * Identifying information for a transaction to which this undo belongs.
> + * it will also store the total size of the undo for this transaction.
> + */
> +typedef struct UndoRecordTransaction
> +{
> + uint32 urec_progress;  /* undo applying progress. */
> + uint32 urec_xidepoch;  /* epoch of the current transaction */
> + Oid urec_dbid; /* database id */
> + uint64 urec_next; /* urec pointer of the next transaction */
> +} UndoRecordTransaction;
>
> /it will/It will.
> BTW, which field(s) in the above structure stores the size of the undo?
Fixed

>
> 4.
> + /*
> + * Set uur_info for an UnpackedUndoRecord appropriately based on which
> + * fields are set and calculate the size of the undo record based on the
> + * uur_info.
> + */
>
> Can we rephrase it as "calculate the size of the undo record based on
> the info required"?

Fixed
>
> 5.
> +/*
> + * Unlock and release undo buffers.  This step performed after exiting any
> + * critical section.
> + */
> +void
> +UnlockReleaseUndoBuffers(void)
>
> Can we change the later sentence as "This step is performed after
> exiting any critical section where we have performed undo action."?

Done, I mentioned  This step is performed after exiting any critical
section where we have prepared undo record.
>
> 6.
> +InsertUndoRecord()
> {
> ..
> + Assert (uur->uur_info != 0);
>
> Add a comment above Assert "The undo record must contain a valid information."
Done
>
> 6.
> +UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
> + UndoPersistence upersistence, TransactionId txid)
> {
> ..
> + first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
> +
> + if ((!InRecovery && prev_txid[upersistence] != txid) ||
> + first_rec_in_recovery)
> + {
> + need_start_undo = true;
> + }
>
> Here, I think we can avoid using two boolean variables
> (first_rec_in_recovery and need_start_undo).  Also, this same check is
> used in this function twice.  I have tried to simplify it in the
> attached.  Can you check and let me know if that sounds okay to you?

I have taken your changes
>
> 7.
> UndoRecordAllocateMulti
> {
> ..
> /*
> + * If this is the first undo record of the transaction then initialize
> + * the transaction header fields of the undorecord. Also, set the flag
> + * in the uur_info to indicate that this record contains the transaction
> + * header so allocate the space for the same.
> + */
> + if (need_start_undo && i == 0)
> + {
> + urec->uur_next = InvalidUndoRecPtr;
> + urec->uur_xidepoch = GetEpochForXid(txid);
> + urec->uur_progress = 0;
> +
> + /* During recovery, Fetch database id from the undo log state. */
> + if (InRecovery)
> + urec->uur_dbid = UndoLogStateGetDatabaseId();
> + else
> + urec->uur_dbid = MyDatabaseId;
> +
> + /* Set uur_info to include the transaction header. */
> + urec->uur_info |= UREC_INFO_TRANSACTION;
> + }
> ..
> }
>
> It seems here you have written the code in your comments.  I have
> changed it in the attached delta patch.

Taken you changes
>
> 8.
> UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
> +    TransactionId xid, UndoPersistence upersistence)
> +{
> ..
> ..
> + multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare,
>
> Can we rename this variable as prepared_urec_ptr or prepared_urp?
Done
>
> 9.
> +void
> +UndoSetPrepareSize(int max_prepare,
>
> I think it will be better to use nrecords instead of 'max_prepare'
> similar to how you have it in UndoRecordAllocateMulti()
Done
>
> 10.
> + if (!UndoRecPtrIsValid(multi_prep_urp))
> + urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
> + else
> + urecptr = multi_prep_urp;
> +
> + size = UndoRecordExpectedSize(urec);
> ..
> ..
> + if (UndoRecPtrIsValid(multi_prep_urp))
> + {
> + UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
> + insert = UndoLogOffsetPlusUsableBytes(insert, size);
> + multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
> + }
>
> Can't we use urecptr instead of multi_prep_urp in above code after
> urecptr is initialized?

I think we can't because urecptr is just the current pointer we are
going to return but multi_prep_urp is the static variable we need to
update so that
the next prepare can calculate urecptr from this location.
>
> 11.
> +static int max_prepare_undo = MAX_PREPARED_UNDO;
>
> Let's change the name of this variable as max_prepared_undo.  Already
> changed in attached delta patch

ok
>
> 12.
> PrepareUndoInsert()
> {
> ..
> + /* Already reached maximum prepared limit. */
> + if (prepare_idx == max_prepare_undo)
> + return InvalidUndoRecPtr;
> ..
> }
>
> I think in the above condition, we should have elog, otherwise,
> callers need to be prepared to handle it.
Done
>
> 13.
> UndoRecordAllocateMulti()
>
> How about naming it as UndoRecordAllocate as this is used to allocate
> even a single undo record?
Done
>
> 14.
> If not already done, can you pgindent the new code added by this patch?

Done
>
> Attached is a delta patch on top of your previous patch containing
> some fixes as memtioned above and few other minor changes and cleanup.
> If you find changes okay, kindly include them in your next version.
I have taken your changes.



-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

0003-undo-interface-v9.patch

Re: Undo logs

From

Amit Kapila

Date:

01 December 2018, 07:28:08

On Thu, Nov 29, 2018 at 6:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Nov 26, 2018 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > 10.
> > + if (!UndoRecPtrIsValid(multi_prep_urp))
> > + urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
> > + else
> > + urecptr = multi_prep_urp;
> > +
> > + size = UndoRecordExpectedSize(urec);
> > ..
> > ..
> > + if (UndoRecPtrIsValid(multi_prep_urp))
> > + {
> > + UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
> > + insert = UndoLogOffsetPlusUsableBytes(insert, size);
> > + multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
> > + }
> >
> > Can't we use urecptr instead of multi_prep_urp in above code after
> > urecptr is initialized?
>
> I think we can't because urecptr is just the current pointer we are
> going to return but multi_prep_urp is the static variable we need to
> update so that
> the next prepare can calculate urecptr from this location.
> >

Okay, but that was not apparent from the code, so added a comment in
the attached delta patch.  BTW, wouldn't it be better to move this
code to the end of function once prepare for current record is
complete.

More comments
----------------------------
1.
* We can consider that the log as switched if

/that/ needs to be removed.

2.
+ if (prepare_idx == max_prepared_undo)
+ elog(ERROR, "Already reached the maximum prepared limit.");

a. /Already/already
b. we don't use full-stop (.) in error

3.
+ * also stores the dbid and the  progress of the undo apply during rollback.

/the  progress/  extra space.

4.
+UndoSetPrepareSize(int nrecords, UnpackedUndoRecord *undorecords,
+    TransactionId xid, UndoPersistence upersistence)
+{

nrecords should be a second parameter.

5.
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+   TransactionId xid)

It seems better to have xid parameter before UndoPersistence.

6.
+ /* FIXME: Should we just report error ? */
+ Assert(index < MAX_BUFFER_PER_UNDO);

No need of this Fixme.

7.
PrepareUndoInsert()
{
..
do
{
..
+ /* Undo record can not fit into this block so go to the next block. */
+ cur_blk++;
..
} while (cur_size < size);
..
}

This comment was making me uneasy, so slightly adjusted the code.
Basically, by that time it was not decided whether undo record can fit
in current buffer or not.

8.
+ /*
+ * Save referenced of undo record pointer as well as undo record.
+ * InsertPreparedUndo will use these to insert the prepared record.
+ */
+ prepared_undo[prepare_idx].urec = urec;
+ prepared_undo[prepare_idx].urp = urecptr;

Slightly adjust the above comment.

9.
+InsertPreparedUndo(void)
{
..
+ Assert(prepare_idx > 0);
+
+ /* This must be called under a critical section. */
+ Assert(InRecovery || CritSectionCount > 0);
..
}

I have added a few more comments for above assertions, see if those are correct.

10.
+InsertPreparedUndo(void)
{
..
+ prev_undolen = undo_len;
+
+ UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
..
}

There is no need to use an additional variable prev_undolen in the
above code. I have modified the code to remove it's usage, check if
that is correct.

11.
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease
to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord *
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+ TransactionId xid, UndoRecPtr * urec_ptr_out,
+ SatisfyUndoRecordCallback callback)


The comment above UndoFetchRecord is very zheap specific, so I have
tried to simplify it.  I think we can give so much detailed examples
only when we introduce zheap code.

Apart from above, there are miscellaneous comments and minor code
edits in the attached delta patch.

12.
PrepareUndoInsert()
{
..
+ /*
+ * If this is the first undo record for this top transaction add the
+ * transaction information to the undo record.
+ *
+ * XXX there is also an option that instead of adding the information to
+ * this record we can prepare a new record which only contain transaction
+ * informations.
+ */
+ if (xid == InvalidTransactionId)

The above comment seems to be out of place, we are doing nothing like
that here.  This work is done in UndoRecordAllocate, may be you can
move 'XXX ..' part of the comment in that function.

13.
PrepareUndoInsert()
{
..
if (!UndoRecPtrIsValid(prepared_urec_ptr))
+ urecptr = UndoRecordAllocate(urec, 1, upersistence, txid);
+ else
+ urecptr = prepared_urec_ptr;
+
+ size = UndoRecordExpectedSize(urec);
..

I think we should make above code a bit more bulletproof.  As it is
written, there is no guarantee the size we have allocated is same as
we are using in this function.  How about if we take 'size' as output
parameter from  UndoRecordAllocate and then use it in this function?
Additionally, we can have an Assert that the size returned by
UndoRecordAllocate is same as UndoRecordExpectedSize.

14.
+
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+ int idx;
+ int flags;
+
+ for (idx = 0; idx < buffer_idx; idx++)
+ {
+ flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+ XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+ }
+}
+
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+ int idx;
+
+ for (idx = 0; idx < buffer_idx; idx++)
+ PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}

One line comment atop of these functions will be good.  It will be
better if we place these functions at the end of file or someplace
else, as right now they are between prepare* and insert* function
calls which makes code flow bit awkward.

15.
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{

Why only for persistent undo this step should be performed in the
critical section?  I think as this function operates on shred buffer,
even for unlogged undo, it should be done in a critical section.

16.
+InsertPreparedUndo(void)
{
..
/* if starting a new log then there is no prevlen to store */
+ if (offset == UndoLogBlockHeaderSize)
+ {
+ if (log->meta.prevlogno != InvalidUndoLogNumber)
+ {
+ UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+
+ uur->uur_prevlen = prevlog->meta.prevlen;
+ }
..
}

The comment here suggests that for new logs, we don't need prevlen,
but still, in one case you are maintaining the length, can you add few
comments to explain why?

17.
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+ UndoPersistence persistence)
{
..
+ /*
+ * FIXME: This can be optimized to just fetch header first and only if
+ * matches with block number and offset then fetch the complete
+ * record.
+ */
+ if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+ break;
..
}

I don't know how much it matters if we fetch complete record or just
it's header unless the record is big or it falls in two pages.  I
think both are boundary cases and I couldn't see this part much in
perf profiles.  There is nothing to fix here if you want you can a XXX
comment or maybe suggest it as a future optimization.

18.
+UndoFetchRecord()
{
...
+ /*
+ * Prevent UndoDiscardOneLog() from discarding data while we try to
+ * read it.  Usually we would acquire log->mutex to read log->meta
+ * members, but in this case we know that discard can't move without
+ * also holding log->discard_lock.
+ */
+ LWLockAcquire(&log->discard_lock, LW_SHARED);
+ if (!UndoRecPtrIsValid(log->oldest_data))
+ {
+ /*
+ * UndoDiscardInfo is not yet initialized. Hence, we've to check
+ * UndoLogIsDiscarded and if it's already discarded then we have
+ * nothing to do.
+ */
+ LWLockRelease(&log->discard_lock);
+ if (UndoLogIsDiscarded(urp))
+ {
+ if (BufferIsValid(urec->uur_buffer))
+ ReleaseBuffer(urec->uur_buffer);
+ return NULL;
+ }
+
+ LWLockAcquire(&log->discard_lock, LW_SHARED);
+ }
+
+ /* Check if it's already discarded. */
+ if (urp < log->oldest_data)
+ {
+ LWLockRelease(&log->discard_lock);
+ if (BufferIsValid(urec->uur_buffer))
+ ReleaseBuffer(urec->uur_buffer);
+ return NULL;
+ }
..
}

Can't we replace this logic with UndoRecordIsValid?

19.
UndoFetchRecord()
{
..
while(true)
{
..
/*
+ * If we have a valid buffer pinned then just ensure that we want to
+ * find the next tuple from the same block.  Otherwise release the
+ * buffer and set it invalid
+ */
+ if (BufferIsValid(urec->uur_buffer))
+ {
+ /*
+ * Undo buffer will be changed if the next undo record belongs to
+ * a different block or undo log.
+ */
+ if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+ (prevrnode.relNode != rnode.relNode))
+ {
+ ReleaseBuffer(urec->uur_buffer);
+ urec->uur_buffer = InvalidBuffer;
+ }
+ }
+ else
+ {
+ /*
+ * If there is not a valid buffer in urec->uur_buffer that means
+ * we had copied the payload data and tuple data so free them.
+ */
+ if (urec->uur_payload.data)
+ pfree(urec->uur_payload.data);
+ if (urec->uur_tuple.data)
+ pfree(urec->uur_tuple.data);
+ }
+
+ /* Reset the urec before fetching the tuple */
+ urec->uur_tuple.data = NULL;
+ urec->uur_tuple.len = 0;
+ urec->uur_payload.data = NULL;
+ urec->uur_payload.len = 0;
+ prevrnode = rnode;
..
}

Can't we move this logic after getting a record with UndoGetOneRecord
and matching with a callback?  This is certainly required after the
first record, so it looks a bit odd here. Also, if possible can we
move it to a separate function as this is not the main logic and makes
the main logic difficult to follow.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

0003-undo-interface-v9-delta-amit.patch

Re: Undo logs

From

Dilip Kumar

Date:

04 December 2018, 09:30:44

On Sat, Dec 1, 2018 at 12:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Nov 29, 2018 at 6:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Mon, Nov 26, 2018 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > 10.
> > > + if (!UndoRecPtrIsValid(multi_prep_urp))
> > > + urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
> > > + else
> > > + urecptr = multi_prep_urp;
> > > +
> > > + size = UndoRecordExpectedSize(urec);
> > > ..
> > > ..
> > > + if (UndoRecPtrIsValid(multi_prep_urp))
> > > + {
> > > + UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
> > > + insert = UndoLogOffsetPlusUsableBytes(insert, size);
> > > + multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
> > > + }
> > >
> > > Can't we use urecptr instead of multi_prep_urp in above code after
> > > urecptr is initialized?
> >
> > I think we can't because urecptr is just the current pointer we are
> > going to return but multi_prep_urp is the static variable we need to
> > update so that
> > the next prepare can calculate urecptr from this location.
> > >
>
> Okay, but that was not apparent from the code, so added a comment in
> the attached delta patch.  BTW, wouldn't it be better to move this
> code to the end of function once prepare for current record is
> complete.
>
> More comments
> ----------------------------
> 1.
> * We can consider that the log as switched if
>
> /that/ needs to be removed.
>
> 2.
> + if (prepare_idx == max_prepared_undo)
> + elog(ERROR, "Already reached the maximum prepared limit.");
>
> a. /Already/already
> b. we don't use full-stop (.) in error

Merged your change
>
> 3.
> + * also stores the dbid and the  progress of the undo apply during rollback.
>
> /the  progress/  extra space.
>
Merged your change
> 4.
> +UndoSetPrepareSize(int nrecords, UnpackedUndoRecord *undorecords,
> +    TransactionId xid, UndoPersistence upersistence)
> +{
>
> nrecords should be a second parameter.
Merged your change
>
> 5.
> +UndoRecPtr
> +PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
> +   TransactionId xid)
>
> It seems better to have xid parameter before UndoPersistence.
Merged your change. Done same changes in UndoRecordAllocate as well
>
> 6.
> + /* FIXME: Should we just report error ? */
> + Assert(index < MAX_BUFFER_PER_UNDO);
>
> No need of this Fixme.
Merged your change
>
> 7.
> PrepareUndoInsert()
> {
> ..
> do
> {
> ..
> + /* Undo record can not fit into this block so go to the next block. */
> + cur_blk++;
> ..
> } while (cur_size < size);
> ..
> }
>
> This comment was making me uneasy, so slightly adjusted the code.
> Basically, by that time it was not decided whether undo record can fit
> in current buffer or not.
Merged your change
>
> 8.
> + /*
> + * Save referenced of undo record pointer as well as undo record.
> + * InsertPreparedUndo will use these to insert the prepared record.
> + */
> + prepared_undo[prepare_idx].urec = urec;
> + prepared_undo[prepare_idx].urp = urecptr;
>
> Slightly adjust the above comment.
Merged your change
>
> 9.
> +InsertPreparedUndo(void)
> {
> ..
> + Assert(prepare_idx > 0);
> +
> + /* This must be called under a critical section. */
> + Assert(InRecovery || CritSectionCount > 0);
> ..
> }
>
> I have added a few more comments for above assertions, see if those are correct.

Merged your change
>
> 10.
> +InsertPreparedUndo(void)
> {
> ..
> + prev_undolen = undo_len;
> +
> + UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
> ..
> }
>
> There is no need to use an additional variable prev_undolen in the
> above code. I have modified the code to remove it's usage, check if
> that is correct.

looks fine to me.
>
> 11.
> + * Fetch the next undo record for given blkno, offset and transaction id (if
> + * valid).  We need to match transaction id along with block number and offset
> + * because in some cases (like reuse of slot for committed transaction), we
> + * need to skip the record if it is modified by a transaction later than the
> + * transaction indicated by previous undo record.  For example, consider a
> + * case where tuple (ctid - 0,1) is modified by transaction id 500 which
> + * belongs to transaction slot 0. Then, the same tuple is modified by
> + * transaction id 501 which belongs to transaction slot 1.  Then, both the
> + * transaction slots are marked for reuse. Then, again the same tuple is
> + * modified by transaction id 502 which has used slot 0.  Now, some
> + * transaction which has started before transaction 500 wants to traverse the
> + * chain to find visible tuple will keep on rotating infinitely between undo
> + * tuple written by 502 and 501.  In such a case, we need to skip the undo
> + * tuple written by transaction 502 when we want to find the undo record
> + * indicated by the previous pointer of undo tuple written by transaction 501.
> + * Start the search from urp.  Caller need to call UndoRecordRelease
> to release the
> + * resources allocated by this function.
> + *
> + * urec_ptr_out is undo record pointer of the qualified undo record if valid
> + * pointer is passed.
> + */
> +UnpackedUndoRecord *
> +UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
> + TransactionId xid, UndoRecPtr * urec_ptr_out,
> + SatisfyUndoRecordCallback callback)
>
>
> The comment above UndoFetchRecord is very zheap specific, so I have
> tried to simplify it.  I think we can give so much detailed examples
> only when we introduce zheap code.

Make sense.
>
> Apart from above, there are miscellaneous comments and minor code
> edits in the attached delta patch.

I have merged your changes.
>
> 12.
> PrepareUndoInsert()
> {
> ..
> + /*
> + * If this is the first undo record for this top transaction add the
> + * transaction information to the undo record.
> + *
> + * XXX there is also an option that instead of adding the information to
> + * this record we can prepare a new record which only contain transaction
> + * informations.
> + */
> + if (xid == InvalidTransactionId)
>
> The above comment seems to be out of place, we are doing nothing like
> that here.  This work is done in UndoRecordAllocate, may be you can
> move 'XXX ..' part of the comment in that function.

Done
>
> 13.
> PrepareUndoInsert()
> {
> ..
> if (!UndoRecPtrIsValid(prepared_urec_ptr))
> + urecptr = UndoRecordAllocate(urec, 1, upersistence, txid);
> + else
> + urecptr = prepared_urec_ptr;
> +
> + size = UndoRecordExpectedSize(urec);
> ..
>
> I think we should make above code a bit more bulletproof.  As it is
> written, there is no guarantee the size we have allocated is same as
> we are using in this function.
I agree
How about if we take 'size' as output
> parameter from  UndoRecordAllocate and then use it in this function?
> Additionally, we can have an Assert that the size returned by
> UndoRecordAllocate is same as UndoRecordExpectedSize.
With this change we will be able to guarantee when we are allocating
single undo record
but multi prepare will still be a problem.  I haven't fix this as of
now.  I will think on how
to handle both the cases when we have to prepare one time or when we
have to allocate
once and prepare multiple time.

>
> 14.
> +
> +void
> +RegisterUndoLogBuffers(uint8 first_block_id)
> +{
> + int idx;
> + int flags;
> +
> + for (idx = 0; idx < buffer_idx; idx++)
> + {
> + flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
> + XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
> + }
> +}
> +
> +void
> +UndoLogBuffersSetLSN(XLogRecPtr recptr)
> +{
> + int idx;
> +
> + for (idx = 0; idx < buffer_idx; idx++)
> + PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
> +}
>
> One line comment atop of these functions will be good.  It will be
> better if we place these functions at the end of file or someplace
> else, as right now they are between prepare* and insert* function
> calls which makes code flow bit awkward.
Done
>
> 15.
> + * and mark them dirty.  For persistent undo, this step should be performed
> + * after entering a critical section; it should never fail.
> + */
> +void
> +InsertPreparedUndo(void)
> +{
>
> Why only for persistent undo this step should be performed in the
> critical section?  I think as this function operates on shred buffer,
> even for unlogged undo, it should be done in a critical section.

I think we can remove this comment? I have removed that in the current patch.
>
> 16.
> +InsertPreparedUndo(void)
> {
> ..
> /* if starting a new log then there is no prevlen to store */
> + if (offset == UndoLogBlockHeaderSize)
> + {
> + if (log->meta.prevlogno != InvalidUndoLogNumber)
> + {
> + UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
> +
> + uur->uur_prevlen = prevlog->meta.prevlen;
> + }
> ..
> }
>
> The comment here suggests that for new logs, we don't need prevlen,
> but still, in one case you are maintaining the length, can you add few
> comments to explain why?

Done
>
> 17.
> +UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
> + UndoPersistence persistence)
> {
> ..
> + /*
> + * FIXME: This can be optimized to just fetch header first and only if
> + * matches with block number and offset then fetch the complete
> + * record.
> + */
> + if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
> + break;
> ..
> }
>
> I don't know how much it matters if we fetch complete record or just
> it's header unless the record is big or it falls in two pages.  I
> think both are boundary cases and I couldn't see this part much in
> perf profiles.  There is nothing to fix here if you want you can a XXX
> comment or maybe suggest it as a future optimization.

Changed
>
> 18.
> +UndoFetchRecord()
> {
> ...
> + /*
> + * Prevent UndoDiscardOneLog() from discarding data while we try to
> + * read it.  Usually we would acquire log->mutex to read log->meta
> + * members, but in this case we know that discard can't move without
> + * also holding log->discard_lock.
> + */
> + LWLockAcquire(&log->discard_lock, LW_SHARED);
> + if (!UndoRecPtrIsValid(log->oldest_data))
> + {
> + /*
> + * UndoDiscardInfo is not yet initialized. Hence, we've to check
> + * UndoLogIsDiscarded and if it's already discarded then we have
> + * nothing to do.
> + */
> + LWLockRelease(&log->discard_lock);
> + if (UndoLogIsDiscarded(urp))
> + {
> + if (BufferIsValid(urec->uur_buffer))
> + ReleaseBuffer(urec->uur_buffer);
> + return NULL;
> + }
> +
> + LWLockAcquire(&log->discard_lock, LW_SHARED);
> + }
> +
> + /* Check if it's already discarded. */
> + if (urp < log->oldest_data)
> + {
> + LWLockRelease(&log->discard_lock);
> + if (BufferIsValid(urec->uur_buffer))
> + ReleaseBuffer(urec->uur_buffer);
> + return NULL;
> + }
> ..
> }
>
> Can't we replace this logic with UndoRecordIsValid?

Done
>
> 19.
> UndoFetchRecord()
> {
> ..
> while(true)
> {
> ..
> /*
> + * If we have a valid buffer pinned then just ensure that we want to
> + * find the next tuple from the same block.  Otherwise release the
> + * buffer and set it invalid
> + */
> + if (BufferIsValid(urec->uur_buffer))
> + {
> + /*
> + * Undo buffer will be changed if the next undo record belongs to
> + * a different block or undo log.
> + */
> + if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
> + (prevrnode.relNode != rnode.relNode))
> + {
> + ReleaseBuffer(urec->uur_buffer);
> + urec->uur_buffer = InvalidBuffer;
> + }
> + }
> + else
> + {
> + /*
> + * If there is not a valid buffer in urec->uur_buffer that means
> + * we had copied the payload data and tuple data so free them.
> + */
> + if (urec->uur_payload.data)
> + pfree(urec->uur_payload.data);
> + if (urec->uur_tuple.data)
> + pfree(urec->uur_tuple.data);
> + }
> +
> + /* Reset the urec before fetching the tuple */
> + urec->uur_tuple.data = NULL;
> + urec->uur_tuple.len = 0;
> + urec->uur_payload.data = NULL;
> + urec->uur_payload.len = 0;
> + prevrnode = rnode;
> ..
> }
>
> Can't we move this logic after getting a record with UndoGetOneRecord
> and matching with a callback?  This is certainly required after the
> first record, so it looks a bit odd here. Also, if possible can we
> move it to a separate function as this is not the main logic and makes
> the main logic difficult to follow.
>
Fixed

Apart from fixing these comments I have also rebased Thomas' undo log
patches on the current head.




--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Undo logs

From

Amit Kapila

Date:

08 December 2018, 10:52:31

On Tue, Dec 4, 2018 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Dec 1, 2018 at 12:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > 13.
> > PrepareUndoInsert()
> > {
> > ..
> > if (!UndoRecPtrIsValid(prepared_urec_ptr))
> > + urecptr = UndoRecordAllocate(urec, 1, upersistence, txid);
> > + else
> > + urecptr = prepared_urec_ptr;
> > +
> > + size = UndoRecordExpectedSize(urec);
> > ..
> >
> > I think we should make above code a bit more bulletproof.  As it is
> > written, there is no guarantee the size we have allocated is same as
> > we are using in this function.
> I agree
> How about if we take 'size' as output
> > parameter from  UndoRecordAllocate and then use it in this function?
> > Additionally, we can have an Assert that the size returned by
> > UndoRecordAllocate is same as UndoRecordExpectedSize.
> With this change we will be able to guarantee when we are allocating
> single undo record
> but multi prepare will still be a problem.  I haven't fix this as of
> now.  I will think on how
> to handle both the cases when we have to prepare one time or when we
> have to allocate
> once and prepare multiple time.
>

Yeah, this appears tricky.  I think we can leave it as it is unless we
get some better idea.

1.
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+ prev_txid[persistence] = InvalidTransactionId;
+}

Are we using this API in this or any other patch or in zheap?  I see
this can be useful API, but unless we have a plan to use it, I don't
think we should retain it.

2.
+ * Handling multi log:
+ *
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that
will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.

I think I see the problem in the discard mechanism when the log is
spread across multiple logs.  Basically, if the second log contains
undo of some transaction prior to the transaction which has just
decided to spread it's undo in the chosen undo log, then we might
discard the undo log of some transaction(s) inadvertently.  Am, I
missing something?  If not, then I guess we need to ensure that we
don't immediately discard the undo in the second log when a single
transactions undo is spreaded across two logs

Before choosing a new undo log to span the undo for a transaction, do
we ensure that it is not already linked with some other undo log for a
similar reason?

One more thing in this regard:
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+    TransactionId txid, UndoPersistence upersistence)
{
..
..
+ if (InRecovery)
+ urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+ else
+ urecptr = UndoLogAllocate(size, upersistence);
+
+ log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+ /*
+ * By now, we must be attached to some undo log unless we are in recovery.
+ */
+ Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+ /*
+ * We can consider the log as switched if this is the first record of the
+ * log and not the first record of the transaction i.e. same transaction
+ * continued from the previous log.
+ */
+ if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+ log->meta.prevlogno != InvalidUndoLogNumber)
+ log_switched = true;
..
..
}

Isn't there a hidden assumption in the above code that you will always
get a fresh undo log if the undo doesn't fit in the currently attached
log?  What is the guarantee of same?

3.
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
+   UndoPersistence);
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);

This text is duplicate of what is mentioned in .c file, so I have
removed it in delta patch.  Similarly, I have removed duplicate text
atop other functions exposed via undorecord.h

4.
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);

This API is nowhere defined or used.  What is the intention?

5.
+typedef struct UndoRecordHeader
+{
..
+ /*
+ * Transaction id that has modified the tuple present in this undo record.
+ * If this is older then RecentGlobalXmin, then we can consider the tuple
+ * in this undo record as visible.
+ */
+ TransactionId urec_prevxid;
..

/then/than

I think we need to mention oldestXidWithEpochHavingUndo instead of
RecentGlobalXmin.

6.
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS 0x01

I have expanded this comments in the attached delta patch.  I think we
should remove the define UREC_INFO_PAYLOAD_CONTAINS_SLOT from the
patch as this is zheap specific and should be added later along with
the zheap code.

7.
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+ ForkNumber urec_fork; /* fork number */
+} UndoRecordRelationDetails;

This comment seems to be out-dated, so modified in the attached delta patch.

8.
+typedef struct UndoRecordTransaction
+{
+ uint32 urec_progress; /* undo applying progress. */
+ uint32 urec_xidepoch; /* epoch of the current transaction */

Can you expand comments about how the progress is defined and used?
Also, write a few sentences about why epoch is captured and or used?

9.
+#define urec_next_pos \
+ (SizeOfUndoRecordTransaction - SizeOfUrecNext)

What is its purpose?


10.
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);

Again, I see duplicate text in .h and .c files, so removed this and
similar comments from .h files.  I have tried to move some part of
comments from .h to .c file, so that it is easier to read from one
place rather than referring at  two places.  See, if I have missed
anything.

Apart from the above, I have done few other cosmetic changes in the
attached delta patch, see, if you like those, kindly  include it in
the main patch.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

0003-undo-interface-v10-delta-amit.patch

Re: Undo logs

From

Dilip Kumar

Date:

12 December 2018, 05:47:40

On Sat, Dec 8, 2018 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 4, 2018 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Sat, Dec 1, 2018 at 12:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> >
> > 13.
> > PrepareUndoInsert()
> > {
> > ..
> > if (!UndoRecPtrIsValid(prepared_urec_ptr))
> > + urecptr = UndoRecordAllocate(urec, 1, upersistence, txid);
> > + else
> > + urecptr = prepared_urec_ptr;
> > +
> > + size = UndoRecordExpectedSize(urec);
> > ..
> >
> > I think we should make above code a bit more bulletproof. As it is
> > written, there is no guarantee the size we have allocated is same as
> > we are using in this function.
> I agree
> How about if we take 'size' as output
> > parameter from UndoRecordAllocate and then use it in this function?
> > Additionally, we can have an Assert that the size returned by
> > UndoRecordAllocate is same as UndoRecordExpectedSize.
> With this change we will be able to guarantee when we are allocating
> single undo record
> but multi prepare will still be a problem. I haven't fix this as of
> now. I will think on how
> to handle both the cases when we have to prepare one time or when we
> have to allocate
> once and prepare multiple time.
>

Yeah, this appears tricky. I think we can leave it as it is unless we
get some better idea.

1.
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+ prev_txid[persistence] = InvalidTransactionId;
+}

Are we using this API in this or any other patch or in zheap? I see
this can be useful API, but unless we have a plan to use it, I don't
think we should retain it.

Currently, we are not using it so removed.

2.
+ * Handling multi log:
+ *
+ * It is possible that the undo record of a transaction can be spread across
+ * multiple undo log. And, we need some special handling while inserting the
+ * undo for discard and rollback to work sanely.
+ *
+ * If the undorecord goes to next log then we insert a transaction header for
+ * the first record in the new log and update the transaction header with this
+ * new log's location. This will allow us to connect transactions across logs
+ * when the same transaction span across log (for this we keep track of the
+ * previous logno in undo log meta) which is required to find the latest undo
+ * record pointer of the aborted transaction for executing the undo actions
+ * before discard. If the next log get processed first in that case we
+ * don't need to trace back the actual start pointer of the transaction,
+ * in such case we can only execute the undo actions from the current log
+ * because the undo pointer in the slot will be rewound and that
will be enough
+ * to avoid executing same actions. However, there is possibility that after
+ * executing the undo actions the undo pointer got discarded, now in later
+ * stage while processing the previous log it might try to fetch the undo
+ * record in the discarded log while chasing the transaction header chain.
+ * To avoid this situation we first check if the next_urec of the transaction
+ * is already discarded then no need to access that and start executing from
+ * the last undo record in the current log.

I think I see the problem in the discard mechanism when the log is
spread across multiple logs. Basically, if the second log contains
undo of some transaction prior to the transaction which has just
decided to spread it's undo in the chosen undo log, then we might
discard the undo log of some transaction(s) inadvertently. Am, I
missing something?

Actually, I don't see exactly this problem here because we only process one undo log at a time, so we will not got to the next undo log and discard some transaction for which we supposed to retain the undo.

If not, then I guess we need to ensure that we
don't immediately discard the undo in the second log when a single
transactions undo is spreaded across two logs

Before choosing a new undo log to span the undo for a transaction, do
we ensure that it is not already linked with some other undo log for a
similar reason?

One more thing in this regard:
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+ TransactionId txid, UndoPersistence upersistence)
{
..
..
+ if (InRecovery)
+ urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+ else
+ urecptr = UndoLogAllocate(size, upersistence);
+
+ log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+ /*
+ * By now, we must be attached to some undo log unless we are in recovery.
+ */
+ Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+ /*
+ * We can consider the log as switched if this is the first record of the
+ * log and not the first record of the transaction i.e. same transaction
+ * continued from the previous log.
+ */
+ if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+ log->meta.prevlogno != InvalidUndoLogNumber)
+ log_switched = true;
..
..
}

Isn't there a hidden assumption in the above code that you will always
get a fresh undo log if the undo doesn't fit in the currently attached
log? What is the guarantee of same?

Yeah it's a problem, we might get the undo which may not be empty. One way to avoid this could be that instead of relying on the check "UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize". We can add some flag to the undo log meta data that whether it's the first record after attach or not and based on that we can decide. But, I want to think some better solution where we can identify without adding anything extra to undo meta.

3.
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert. Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
+ UndoPersistence);
+
+/*
+ * Insert a previously-prepared undo record. This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty. For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);

This text is duplicate of what is mentioned in .c file, so I have
removed it in delta patch. Similarly, I have removed duplicate text
atop other functions exposed via undorecord.h

4.
+/*
+ * Forget about any previously-prepared undo record. Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);

This API is nowhere defined or used. What is the intention?

Not required

5.
+typedef struct UndoRecordHeader
+{
..
+ /*
+ * Transaction id that has modified the tuple present in this undo record.
+ * If this is older then RecentGlobalXmin, then we can consider the tuple
+ * in this undo record as visible.
+ */
+ TransactionId urec_prevxid;
..

/then/than

Done

I think we need to mention oldestXidWithEpochHavingUndo instead of
RecentGlobalXmin.

Merged

6.
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here. That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS 0x01

I have expanded this comments in the attached delta patch.

Merged

I think we
should remove the define UREC_INFO_PAYLOAD_CONTAINS_SLOT from the
patch as this is zheap specific and should be added later along with
the zheap code.

Yeah we can, so removed in my new patch.

7.
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number. If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+ ForkNumber urec_fork; /* fork number */
+} UndoRecordRelationDetails;

This comment seems to be out-dated, so modified in the attached delta patch.

Merged

8.
+typedef struct UndoRecordTransaction
+{
+ uint32 urec_progress; /* undo applying progress. */
+ uint32 urec_xidepoch; /* epoch of the current transaction */

Can you expand comments about how the progress is defined and used?

Moved your comment from UnpackedUndoRecord to this structure and in UnpackedUndoRecord I have mentioned that we can refer detailed comment in this structure.

Also, write a few sentences about why epoch is captured and or used?

urec_xidepoch is captured mainly for the zheap visibility purpose so isn't it good that we mention it there?

9.
+#define urec_next_pos \
+ (SizeOfUndoRecordTransaction - SizeOfUrecNext)

What is its purpose?

It's not required so removed

10.
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record. Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);

Again, I see duplicate text in .h and .c files, so removed this and
similar comments from .h files. I have tried to move some part of
comments from .h to .c file, so that it is easier to read from one
place rather than referring at two places. See, if I have missed
anything.

Apart from the above, I have done few other cosmetic changes in the
attached delta patch, see, if you like those, kindly include it in
the main patch.

Done

Attachment

0003-undo-interface-v11.patch

Re: Undo logs

From

Amit Kapila

Date:

12 December 2018, 09:33:46

On Wed, Dec 12, 2018 at 11:18 AM Dilip Kumar
<dilip.kumar@enterprisedb.com> wrote:
>
> On Sat, Dec 8, 2018 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>>
>> I think I see the problem in the discard mechanism when the log is
>> spread across multiple logs.  Basically, if the second log contains
>> undo of some transaction prior to the transaction which has just
>> decided to spread it's undo in the chosen undo log, then we might
>> discard the undo log of some transaction(s) inadvertently.  Am, I
>> missing something?
>
> Actually, I don't see exactly this problem here because we only process one undo log at a time, so we will not got to
thenext undo log and discard some transaction for which we supposed to retain the undo. 
>

How will rollbacks work for such a case?  I have more to say about
this, see below.

>>
>> If not, then I guess we need to ensure that we
>> don't immediately discard the undo in the second log when a single
>> transactions undo is spreaded across two logs
>>
>> Before choosing a new undo log to span the undo for a transaction, do
>> we ensure that it is not already linked with some other undo log for a
>> similar reason?
>>

You seem to forget answering this.

>> One more thing in this regard:
>> +UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
>> +    TransactionId txid, UndoPersistence upersistence)
..
>>
>> Isn't there a hidden assumption in the above code that you will always
>> get a fresh undo log if the undo doesn't fit in the currently attached
>> log?  What is the guarantee of same?
>
>
> Yeah it's a problem, we might get the undo which may not be empty.  One way to avoid this could be that instead of
relyingon  the check "UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize". We can add some flag to the undo log
metadata that whether it's the first record after attach or not and based on that we can decide.  But, I want to think
somebetter solution where we can identify without adding anything extra to undo meta. 
>

I think what we need to determine here is whether we have switched the
log for some non-first record of the transaction.  If so, can't we
detect by something like:

log = GetLatestUndoLogAmAttachedTo();
UndoLogAllocate();
if (!need_xact_hdr)
{
currnet_log = GetLatestUndoLogAmAttachedTo();
if (currnet_log is not same as log)
{
Assert(currnet_log->meta.prevlogno == log->logno);
log_switched = true;
}
}


Won't the similar problem happens when we read undo records during
rollback?  In code below:

+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+ UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+ UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+ /*
+ * We have reached to the first undo record of this undo log, so fetch the
+ * previous undo record of the transaction from the previous log.
+ */
+ if (offset == UndoLogBlockHeaderSize)
+ {
+ UndoLogControl *prevlog,

We seem to be assuming here that the new log starts from the
beginning.  IIUC, we can read the record of some other transaction if
the transaction's log spans across two logs and the second log
contains some other transaction's log in the beginning.

>>
>> 8.
>> +typedef struct UndoRecordTransaction
>> +{
>> + uint32 urec_progress; /* undo applying progress. */
>> + uint32 urec_xidepoch; /* epoch of the current transaction */
>>
>> Can you expand comments about how the progress is defined and used?
>
> Moved your comment from UnpackedUndoRecord to this structure and in UnpackedUndoRecord I have mentioned that we can
referdetailed comment in this structure. 
>
>>
>> Also, write a few sentences about why epoch is captured and or used?
>
> urec_xidepoch is captured mainly for the zheap visibility purpose so isn't it good that we mention it there?
>

Okay, you can leave it here as it is.  One small point about this structure:

+ uint64 urec_next; /* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+ (offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)

Isn't it better to define urec_next as UndoRecPtr, even though it is
technically the same as per the current code?


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Undo logs

From

Dilip Kumar

Date:

23 December 2018, 10:18:52

On Wed, Dec 12, 2018 at 3:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Dec 12, 2018 at 11:18 AM Dilip Kumar
> <dilip.kumar@enterprisedb.com> wrote:
> >
> > On Sat, Dec 8, 2018 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >>
> >>
> >> I think I see the problem in the discard mechanism when the log is
> >> spread across multiple logs.  Basically, if the second log contains
> >> undo of some transaction prior to the transaction which has just
> >> decided to spread it's undo in the chosen undo log, then we might
> >> discard the undo log of some transaction(s) inadvertently.  Am, I
> >> missing something?
> >
> > Actually, I don't see exactly this problem here because we only process one undo log at a time, so we will not got
tothe next undo log and discard some transaction for which we supposed to retain the undo. 
> >
>
> How will rollbacks work for such a case?  I have more to say about
> this, see below.
Yeah, I agree rollback will have the problem.

>
> >>
> >> If not, then I guess we need to ensure that we
> >> don't immediately discard the undo in the second log when a single
> >> transactions undo is spreaded across two logs
> >>
> >> Before choosing a new undo log to span the undo for a transaction, do
> >> we ensure that it is not already linked with some other undo log for a
> >> similar reason?
> >>
>
> You seem to forget answering this.
In current patch I have removed the concept of prevlogno in undo log meta.
>
> >> One more thing in this regard:
> >> +UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
> >> +    TransactionId txid, UndoPersistence upersistence)
> ..
> >>
> >> Isn't there a hidden assumption in the above code that you will always
> >> get a fresh undo log if the undo doesn't fit in the currently attached
> >> log?  What is the guarantee of same?
> >
> >
> > Yeah it's a problem, we might get the undo which may not be empty.  One way to avoid this could be that instead of
relyingon  the check "UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize". We can add some flag to the undo log
metadata that whether it's the first record after attach or not and based on that we can decide.  But, I want to think
somebetter solution where we can identify without adding anything extra to undo meta. 
> >
>
> I think what we need to determine here is whether we have switched the
> log for some non-first record of the transaction.  If so, can't we
> detect by something like:
>
> log = GetLatestUndoLogAmAttachedTo();
> UndoLogAllocate();
> if (!need_xact_hdr)
> {
> currnet_log = GetLatestUndoLogAmAttachedTo();
> if (currnet_log is not same as log)
> {
> Assert(currnet_log->meta.prevlogno == log->logno);
> log_switched = true;
> }
> }
>
>
> Won't the similar problem happens when we read undo records during
> rollback?  In code below:
>
> +UndoRecPtr
> +UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
> +{
> + UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
> + UndoLogOffset offset = UndoRecPtrGetOffset(urp);
> +
> + /*
> + * We have reached to the first undo record of this undo log, so fetch the
> + * previous undo record of the transaction from the previous log.
> + */
> + if (offset == UndoLogBlockHeaderSize)
> + {
> + UndoLogControl *prevlog,
>
> We seem to be assuming here that the new log starts from the
> beginning.  IIUC, we can read the record of some other transaction if
> the transaction's log spans across two logs and the second log
> contains some other transaction's log in the beginning.

For addressing these issues related to multilog I have changed the
design as we discussed offlist.
1) Now, at Do time we identify the log switch as you mentioned above
(identify which log we are attached to before and after allocate).
And, if the log is switched we write a WAL for the same and during
recovery whenever this WAL is replayed we stored the undo record
pointer of the transaction header (which is in the previous undo log)
in UndoLogStateData and read it while allocating space the undo record
and immediately reset it.

2) For handling the discard issue, along with updating the current
transaction's start header in the previous undo log we also update the
previous transaction's start header in the current log if we get
assigned with the non-empty undo log.

3) For, identifying the previous undo record of the transaction during
rollback (when undo log is switched), we store the transaction's last
record's (in previous undo log) undo record pointer in the transaction
header of the first undo record in the new undo log.

>
> >>
> >> 8.
> >> +typedef struct UndoRecordTransaction
> >> +{
> >> + uint32 urec_progress; /* undo applying progress. */
> >> + uint32 urec_xidepoch; /* epoch of the current transaction */
> >>
> >> Can you expand comments about how the progress is defined and used?
> >
> > Moved your comment from UnpackedUndoRecord to this structure and in UnpackedUndoRecord I have mentioned that we can
referdetailed comment in this structure. 
> >
> >>
> >> Also, write a few sentences about why epoch is captured and or used?
> >
> > urec_xidepoch is captured mainly for the zheap visibility purpose so isn't it good that we mention it there?
> >
>
> Okay, you can leave it here as it is.  One small point about this structure:
>
> + uint64 urec_next; /* urec pointer of the next transaction */
> +} UndoRecordTransaction;
> +
> +#define SizeOfUrecNext (sizeof(UndoRecPtr))
> +#define SizeOfUndoRecordTransaction \
> + (offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
>
> Isn't it better to define urec_next as UndoRecPtr, even though it is
> technically the same as per the current code?

While replying I noticed that I haven't address this comment, I will
handle this in next patch.  I have to change this at couple of place.

Handling multilog, needed some changes in undo-log-manager patch so
attaching the updated version of the undo-log patches as well.

Commit on which patches created (bf491a9073e12ce1fc3e6facd0ae1308534df570)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Undo logs

From

Amit Kapila

Date:

01 January 2019, 11:06:57

On Sun, Dec 23, 2018 at 3:49 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Wed, Dec 12, 2018 at 3:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> For addressing these issues related to multilog I have changed the
> design as we discussed offlist.
> 1) Now, at Do time we identify the log switch as you mentioned above
> (identify which log we are attached to before and after allocate).
> And, if the log is switched we write a WAL for the same and during
> recovery whenever this WAL is replayed we stored the undo record
> pointer of the transaction header (which is in the previous undo log)
> in UndoLogStateData and read it while allocating space the undo record
> and immediately reset it.
>
> 2) For handling the discard issue, along with updating the current
> transaction's start header in the previous undo log we also update the
> previous transaction's start header in the current log if we get
> assigned with the non-empty undo log.
>
> 3) For, identifying the previous undo record of the transaction during
> rollback (when undo log is switched), we store the transaction's last
> record's (in previous undo log) undo record pointer in the transaction
> header of the first undo record in the new undo log.
>

Thanks, the new changes look mostly okay to me, but I have few comments:
1.
+ /*
+ * WAL log, for log switch.  This is required to identify the log switch
+ * during recovery.
+ */
+ if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT)
+ {
+ XLogBeginInsert();
+ XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr));
+ XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH);
+ }
+

Don't we want to do this under critical section?

2.
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+    TransactionId txid, UndoPersistence upersistence)
{
..
+ if (log_switched)
+ {
+ /*
+ * If undo log is switched then during rollback we can not go
+ * to the previous undo record of the transaction by prevlen
+ * so we store the previous undo record pointer in the
+ * transaction header.
+ */
+ log = UndoLogGet(prevlogno, false);
+ urec->uur_prevurp = MakeUndoRecPtr(prevlogno,
+    log->meta.insert -
log->meta.prevlen);
+ }
..
}

Can we have an Assert for a valid prevlogno in the above condition?

> > + uint64 urec_next; /* urec pointer of the next transaction */
> > +} UndoRecordTransaction;
> > +
> > +#define SizeOfUrecNext (sizeof(UndoRecPtr))
> > +#define SizeOfUndoRecordTransaction \
> > + (offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
> >
> > Isn't it better to define urec_next as UndoRecPtr, even though it is
> > technically the same as per the current code?
>
> While replying I noticed that I haven't address this comment, I will
> handle this in next patch.  I have to change this at couple of place.
>

Okay, I think the new variable (uur_prevurp) introduced by this
version of the patch also needs to be changed in a similar way.

Apart from the above, I have made quite a few cosmetic changes and
modified few comments, most notably, I have updated the comments
related to the handling of multiple logs at the beginning of
undoinsert.c file.  Kindly, include these changes in your next
patchset, if they look okay to you.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

0003-undo-interface-v12-delta-amit.patch

Re: Undo logs

From

Dilip Kumar

Date:

05 January 2019, 05:59:05

On Tue, Jan 1, 2019 at 4:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> Thanks, the new changes look mostly okay to me, but I have few comments:
> 1.
> + /*
> + * WAL log, for log switch.  This is required to identify the log switch
> + * during recovery.
> + */
> + if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT)
> + {
> + XLogBeginInsert();
> + XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr));
> + XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH);
> + }
> +
>
> Don't we want to do this under critical section?
I think we are not making any buffer changes here and just inserting a
WAL, IMHO we don't need any critical section.  Am I missing
something?.

>
> 2.
> +UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
> +    TransactionId txid, UndoPersistence upersistence)
> {
> ..
> + if (log_switched)
> + {
> + /*
> + * If undo log is switched then during rollback we can not go
> + * to the previous undo record of the transaction by prevlen
> + * so we store the previous undo record pointer in the
> + * transaction header.
> + */
> + log = UndoLogGet(prevlogno, false);
> + urec->uur_prevurp = MakeUndoRecPtr(prevlogno,
> +    log->meta.insert -
> log->meta.prevlen);
> + }
> ..
> }
>
> Can we have an Assert for a valid prevlogno in the above condition?

Done
>
> > > + uint64 urec_next; /* urec pointer of the next transaction */
> > > +} UndoRecordTransaction;
> > > +
> > > +#define SizeOfUrecNext (sizeof(UndoRecPtr))
> > > +#define SizeOfUndoRecordTransaction \
> > > + (offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
> > >
> > > Isn't it better to define urec_next as UndoRecPtr, even though it is
> > > technically the same as per the current code?
> >
> > While replying I noticed that I haven't address this comment, I will
> > handle this in next patch.  I have to change this at couple of place.
> >
>
> Okay, I think the new variable (uur_prevurp) introduced by this
> version of the patch also needs to be changed in a similar way.

DOne
>
> Apart from the above, I have made quite a few cosmetic changes and
> modified few comments, most notably, I have updated the comments
> related to the handling of multiple logs at the beginning of
> undoinsert.c file.  Kindly, include these changes in your next
> patchset, if they look okay to you.
>
I have taken all changes except this one

if (xact_urec_info_idx > 0)
  {
- int i = 0;
+ int i = 0;   --> pgindent changed it back to the above one.

  for (i = 0; i < xact_urec_info_idx; i++)
- UndoRecordUpdateTransInfo(i);
+ UndoRecordUpdateTransInfo(i);  -- This induce extra space so I ignored this
  }


Undo-log patches need rebased so I have done that as well along with
the changes mentioned above.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Undo logs

From

Amit Kapila

Date:

08 January 2019, 08:40:52

On Sat, Jan 5, 2019 at 11:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Jan 1, 2019 at 4:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > Thanks, the new changes look mostly okay to me, but I have few comments:
> > 1.
> > + /*
> > + * WAL log, for log switch.  This is required to identify the log switch
> > + * during recovery.
> > + */
> > + if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT)
> > + {
> > + XLogBeginInsert();
> > + XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr));
> > + XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH);
> > + }
> > +
> >
> > Don't we want to do this under critical section?
> I think we are not making any buffer changes here and just inserting a
> WAL, IMHO we don't need any critical section.  Am I missing
> something?.
>

No, you are correct.

Few more comments:
--------------------------------
Few comments:
----------------
1.
+ * undorecord.c
+ *   encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group

Change the year in Copyright notice for all new files?

2.
+ * This function sets uur->uur_info as a side effect.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+ int starting_byte, int *already_written, bool header_only)

Is the above part of comment still correct? I don't see uur_info being set here.

3.
+ work_txn.urec_next = uur->uur_next;
+ work_txn.urec_xidepoch = uur->uur_xidepoch;
+ work_txn.urec_progress = uur->uur_progress;
+ work_txn.urec_prevurp = uur->uur_prevurp;
+ work_txn.urec_dbid = uur->uur_dbid;

It would be better if we initialize these members in the order in
which they appear in the actual structure.  All other undo header
structures are initialized that way, so this looks out-of-place.

4.
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
..
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+ char **writeptr, char *endptr,
+ int *my_bytes_written, int *total_bytes_written)
+{
..
+
+ /* Update bookkeeeping infrormation. */
+ *writeptr += can_write;
+ *total_bytes_written += can_write;
+ *my_bytes_written = 0;

I don't understand the above comment where it is written: "We must
update it for the bytes we write.".  We always set  'my_bytes_written'
as 0 if we write.  Can you clarify?  I guess this part of the comment
is about total_bytes_written or here does it mean to say that caller
should update it.  I think some wording change might be required based
on what we intend to say here.

Similar to above, there is a confusion in the description of
my_bytes_read atop ReadUndoBytes.

5.
+uint32
+GetEpochForXid(TransactionId xid)
{
..
+ /*
+ * Xid can be on either side when near wrap-around.  Xid is certainly
+ * logically later than ckptXid.
..

From the usage of this function in the patch, can we say that Xid is
always later than ckptxid, if so, how?  Also, I think you previously
told in this thread that usage of uur_xidepoch is mainly for zheap, so
we might want to postpone the inclusion of this undo records. On
thinking again, I think we should follow your advice as I think the
correct usage here would require the patch by Thomas to fix our epoch
stuff [1]?  Am, I correct, if so, I think we should postpone it for
now.

6.
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
{
..
}

I think you can use some comments atop this function to explain the
usage of this function or how will callers use it.

I am done with the first level of code-review for this patch.  I am
sure we might need few interface changes here and there while
integrating and testing this with other patches, but the basic idea
and code look reasonable to me.  I have modified the proposed commit
message in the attached patch, see if that looks fine to you.

To be clear, this patch can't be independently committed/tested, we
need undo logs and undo worker machinery patches to be ready as well.
I will review those next.

[1] -
https://www.postgresql.org/message-id/CAEepm%3D2YYAtkSnens%3DTR2S%3DoRcAF9%3D2P7GPMK0wMJtxKF1QRig%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachment

0003-Provide-interfaces-to-store-and-fetch-undo-records-v14.patch

Re: Undo logs

From

Dilip Kumar

Date:

09 January 2019, 06:10:24

On Tue, Jan 8, 2019 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few more comments:
--------------------------------
Few comments:
----------------
1.
+ * undorecord.c
+ * encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group

Change the year in Copyright notice for all new files?

Done

2.
+ * This function sets uur->uur_info as a side effect.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+ int starting_byte, int *already_written, bool header_only)

Is the above part of comment still correct? I don't see uur_info being set here.

Changed

3.
+ work_txn.urec_next = uur->uur_next;
+ work_txn.urec_xidepoch = uur->uur_xidepoch;
+ work_txn.urec_progress = uur->uur_progress;
+ work_txn.urec_prevurp = uur->uur_prevurp;
+ work_txn.urec_dbid = uur->uur_dbid;

It would be better if we initialize these members in the order in
which they appear in the actual structure. All other undo header
structures are initialized that way, so this looks out-of-place.

Fixed

4.
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out. We must update it for the bytes we write.
+ *
..
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+ char **writeptr, char *endptr,
+ int *my_bytes_written, int *total_bytes_written)
+{
..
+
+ /* Update bookkeeeping infrormation. */
+ *writeptr += can_write;
+ *total_bytes_written += can_write;
+ *my_bytes_written = 0;

I don't understand the above comment where it is written: "We must
update it for the bytes we write.". We always set 'my_bytes_written'
as 0 if we write. Can you clarify? I guess this part of the comment
is about total_bytes_written or here does it mean to say that caller
should update it. I think some wording change might be required based
on what we intend to say here.

Similar to above, there is a confusion in the description of
my_bytes_read atop ReadUndoBytes.

Fixed

5.
+uint32
+GetEpochForXid(TransactionId xid)
{
..
+ /*
+ * Xid can be on either side when near wrap-around. Xid is certainly
+ * logically later than ckptXid.
..

From the usage of this function in the patch, can we say that Xid is
always later than ckptxid, if so, how? Also, I think you previously
told in this thread that usage of uur_xidepoch is mainly for zheap, so
we might want to postpone the inclusion of this undo records. On
thinking again, I think we should follow your advice as I think the
correct usage here would require the patch by Thomas to fix our epoch
stuff [1]? Am, I correct, if so, I think we should postpone it for
now.

Removed

6.
/*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
{
..
}

I think you can use some comments atop this function to explain the
usage of this function or how will callers use it.

Done

I am done with the first level of code-review for this patch. I am
sure we might need few interface changes here and there while
integrating and testing this with other patches, but the basic idea
and code look reasonable to me. I have modified the proposed commit
message in the attached patch, see if that looks fine to you.

To be clear, this patch can't be independently committed/tested, we
need undo logs and undo worker machinery patches to be ready as well.
I will review those next.

Make sense

[1] - https://www.postgresql.org/message-id/CAEepm%3D2YYAtkSnens%3DTR2S%3DoRcAF9%3D2P7GPMK0wMJtxKF1QRig%40mail.gmail.com

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

0003-Provide-interfaces-to-store-and-fetch-undo-records_v15.patch

Re: Undo logs

From

Dilip Kumar

Date:

10 January 2019, 05:46:16

On Wed, Jan 9, 2019 at 11:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 8, 2019 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

3.
+ work_txn.urec_next = uur->uur_next;
+ work_txn.urec_xidepoch = uur->uur_xidepoch;
+ work_txn.urec_progress = uur->uur_progress;
+ work_txn.urec_prevurp = uur->uur_prevurp;
+ work_txn.urec_dbid = uur->uur_dbid;

It would be better if we initialize these members in the order in
which they appear in the actual structure. All other undo header
structures are initialized that way, so this looks out-of-place.

One more change in ReadUndoByte on same line.

Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

0003-Provide-interfaces-to-store-and-fetch-undo-records_v16.patch

Re: Undo logs

From

Andres Freund

Date:

03 February 2019, 10:23:16

Hi,

This thread is curently marked as returned with feedback, set so
2018-12-01. Given there've been several new versions submitted since, is
that accurate?

- Andres

Re: Undo logs

From

Michael Paquier

Date:

04 February 2019, 04:54:56

On Sun, Feb 03, 2019 at 02:23:16AM -0800, Andres Freund wrote:
> This thread is curently marked as returned with feedback, set so
> 2018-12-01. Given there've been several new versions submitted since, is
> that accurate?

From the latest status of this thread, there have been new patches but
no reviews on them, so moved to next CF.
--
Michael

Attachment

signature.asc

Re: Undo logs

From

Thomas Munro

Date:

04 February 2019, 05:14:02

On Mon, Feb 4, 2019 at 3:55 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Sun, Feb 03, 2019 at 02:23:16AM -0800, Andres Freund wrote:
> > This thread is curently marked as returned with feedback, set so
> > 2018-12-01. Given there've been several new versions submitted since, is
> > that accurate?
>
> From the latest status of this thread, there have been new patches but
> no reviews on them, so moved to next CF.

Thank you.  New patches coming soon.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Undo logs

From

Dilip Kumar

Date:

04 February 2019, 05:17:25

On Sun, Feb 3, 2019 at 3:53 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> This thread is curently marked as returned with feedback, set so
> 2018-12-01. Given there've been several new versions submitted since, is
> that accurate?
>
As part of this thread we have been reviewing and fixing the comment
for undo-interface patch.  Now,  Michael have already moved to new
commitfest with status need review so I guess as of now the status is
correct.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Undo logs

From

Alvaro Herrera

Date:

06 February 2019, 02:10:37

On 2019-Feb-04, Thomas Munro wrote:

> On Mon, Feb 4, 2019 at 3:55 PM Michael Paquier <michael@paquier.xyz> wrote:
> > On Sun, Feb 03, 2019 at 02:23:16AM -0800, Andres Freund wrote:
> > > This thread is curently marked as returned with feedback, set so
> > > 2018-12-01. Given there've been several new versions submitted since, is
> > > that accurate?
> >
> > From the latest status of this thread, there have been new patches but
> > no reviews on them, so moved to next CF.
> 
> Thank you.  New patches coming soon.

This series is for pg13, right?  We're not considering any of this for
pg12?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Undo logs

From

Thomas Munro

Date:

06 February 2019, 16:21:09

On Thu, Feb 7, 2019 at 1:16 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> On 2019-Feb-04, Thomas Munro wrote:
> > On Mon, Feb 4, 2019 at 3:55 PM Michael Paquier <michael@paquier.xyz> wrote:
> > > On Sun, Feb 03, 2019 at 02:23:16AM -0800, Andres Freund wrote:
> > > > This thread is curently marked as returned with feedback, set so
> > > > 2018-12-01. Given there've been several new versions submitted since, is
> > > > that accurate?
> > >
> > > From the latest status of this thread, there have been new patches but
> > > no reviews on them, so moved to next CF.
> >
> > Thank you.  New patches coming soon.
>
> This series is for pg13, right?  We're not considering any of this for
> pg12?

Correct.  Originally the target was 12 but that was a bit too ambitious.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: Undo logs

From

Michael Paquier

Date:

07 February 2019, 01:51:49

On Thu, Feb 07, 2019 at 03:21:09AM +1100, Thomas Munro wrote:
> Correct.  Originally the target was 12 but that was a bit too ambitious.

Could it be possible to move the patch set into the first PG-13 commit
fest then?  We could use this CF as recipient for now, even if the
schedule for next development cycle is not set in stone:
https://commitfest.postgresql.org/23/
--
Michael

Attachment

signature.asc

Re: Undo logs

From

Andres Freund

Date:

07 February 2019, 01:55:57


On February 7, 2019 7:21:49 AM GMT+05:30, Michael Paquier <michael@paquier.xyz> wrote:
>On Thu, Feb 07, 2019 at 03:21:09AM +1100, Thomas Munro wrote:
>> Correct.  Originally the target was 12 but that was a bit too
>ambitious.
>
>Could it be possible to move the patch set into the first PG-13 commit
>fest then?  We could use this CF as recipient for now, even if the
>schedule for next development cycle is not set in stone:
>https://commitfest.postgresql.org/23/

We now have the target version as a field, that should make such moves unnecessary, right?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Undo logs

From

Michael Paquier

Date:

07 February 2019, 02:04:11

On Thu, Feb 07, 2019 at 07:25:57AM +0530, Andres Freund wrote:
> We now have the target version as a field, that should make such
> moves unnecessary, right?

Oh, I missed this stuff.  Thanks for pointing it out.
--
Michael

Attachment

signature.asc

Re: Undo logs

From

Andres Freund

Date:

07 February 2019, 02:05:31


On February 7, 2019 7:34:11 AM GMT+05:30, Michael Paquier <michael@paquier.xyz> wrote:
>On Thu, Feb 07, 2019 at 07:25:57AM +0530, Andres Freund wrote:
>> We now have the target version as a field, that should make such
>> moves unnecessary, right?
>
>Oh, I missed this stuff.  Thanks for pointing it out.

It was JUST added ... :) thought I saw you reply on the other thread about it, but I was wrong...

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Undo logs

From

Michael Paquier

Date:

01 December 2019, 02:24:44

On Thu, Feb 07, 2019 at 07:35:31AM +0530, Andres Freund wrote:
> It was JUST added ... :) thought I saw you reply on the other thread
> about it, but I was wrong...

Six months later without any activity, I am marking this entry as
returned with feedback.  The latest patch set does not apply anymore,
so having a rebase would be nice if submitted again.
--
Michael

Attachment

signature.asc

Re: Undo logs

From

Robert Haas

Date:

02 December 2019, 15:08:51

On Sat, Nov 30, 2019 at 9:25 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Thu, Feb 07, 2019 at 07:35:31AM +0530, Andres Freund wrote:
> > It was JUST added ... :) thought I saw you reply on the other thread
> > about it, but I was wrong...
>
> Six months later without any activity, I am marking this entry as
> returned with feedback.  The latest patch set does not apply anymore,
> so having a rebase would be nice if submitted again.

Sounds fair, thanks.  Actually, we've rewritten large amounts of this,
but unfortunately not to the point where it's ready to post yet. If
anyone wants to see the development in progress, see
https://github.com/EnterpriseDB/zheap/commits/undo-record-set

This is not really an EnterpriseDB project any more because Andres and
Thomas decided to leave EnterpriseDB, but both expressed an intention
to continue working on the project. So hopefully we'll get there. That
being said, here's what the three of us are working towards:

- Undo locations are identified by a 64-bit UndoRecPtr, which is very
similar to a WAL LSN. However, each undo log (1TB of address space)
has its own insertion point, so that many backends can insert
simultaneously without contending on the insertion point. The code for
this is by Thomas and is mostly the same as before.

- To insert undo data, you create an UndoRecordSet, which has a record
set header followed by any number of records. In the common case, an
UndoRecordSet corresponds to the intersection of a transaction and a
persistence level - that is, XID 12345 could have up to 3
UndoRecordSets, one for permanent undo, one for unlogged undo, and one
for temporary undo. We might in the future have support for other
kinds of UndoRecordSets, e.g. for multixact-like things that are
associated with a group of transactions rather than just one. This
code is new, by Thomas with contributions from me.

- The records that get stored into an UndoRecordSet will be serialized
from an in-memory representation and then deserialized when the data
is read later. Andres is writing the code for this, but hasn't pushed
it to the branch yet. The idea here is to allow a lot of flexibility
about what gets stored, responding to criticisms of the earlier design
from Heikki, while still being efficient about what we actually write
on disk, since we know from testing that undo volume is a significant
performance concern.

- Each transaction that writes permanent or unlogged undo gets an
UndoRequest, which tracks the fact that there is work to do if the
transaction aborts. Undo can be applied either in the foreground right
after abort or in the background. The latter case is necessary because
crashes or FATAL errors can abort transactions, but the former case is
important as a way of keeping the undo work from ballooning out of
control in a workload where people just abort transactions nonstop; we
have to slow things down so that we can keep up. This code is by me,
based on a design sketch from Andres.

Getting all of this working has been harder and slower than I'd hoped,
but I think the new design fixes a lot of things that weren't right in
earlier iterations, so I feel like we are at least headed in the right
direction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company