Thread: WIP: Pg_upgrade - page layout converter (PLC) hook

WIP: Pg_upgrade - page layout converter (PLC) hook

From

Zdenek Kotala

Date:

15 April 2008, 07:48:34

I attached patch which implemented page layout converter (PLC) hook. It is base
stone for in-place upgrade.

How it works:

When PLC module is loaded, then for each page which does not have native page
version conversion routine is called. Buffer is mark as a dirty and upgraded
page is inserted into WAL.

Performance:

I executed "select count(*) from table" on 2,2GB table (4671039 rows) (without
any tunning) and with conversion 2033s (34min) and after conversion and server
restart 31s (0,5min).

Request for comments:

1) I not sure if calling log_newpage is correct.

   a) Calling from storage something in access method seems to me as bad think.
I'm thinking to move log_newpage to storage, but it invokes more question about
placement, RM ...

   b) log_newpage is used for new page logging, but I use it for storing
converted page. It seems to me that it safe and heap_xlog_newpage correctly
works for new and converted page. I have only doubt about assert macro
mdextend/mdwrite which checks extend vs.write.


2) PLC module placement. I'm looking for best place (directory) where I can put
  PLC code. One possibility is to put under contrib/pg_upgrade another
possibility is to put into backend/storage/upgrade/, but in this location it
will not be possible make it as a module.

3) data structures version tracking

For PLC I need to have old version of data structures like page header, tuple
header and so on. It is also useful for external tools to handle more version of
postgresql easily (e.g. pg_control should show data from all supported
postgresql versions).

My idea is to have for each structure version keep own header e.g. bufpage_03.h,
bufpage_04.h ... which will contain typedef struct PageHeaderData_03 ... and
generic bufpage.h  with following content:

...
#include "bufpage_04.h"
...
typedef PageHeaderData_04 PageHeaderData;

#define PageGetPageSize(page) PageGetPageSize_04(page)
...


4) how to handle corrupted page? If page is corrupted it could invoke false
calling of convert routine. It could hide problems and conversion could "fix" it
in wrong way. Probably we need to have PageHeaderIsValid for all page layout
version.



        Thanks for your comments







Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /zfs_data/cvs_pgsql/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.228
diff -c -r1.228 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c    1 Jan 2008 19:45:51 -0000    1.228
--- src/backend/storage/buffer/bufmgr.c    11 Apr 2008 15:30:28 -0000
***************
*** 41,46 ****
--- 41,47 ----
  #include "storage/proc.h"
  #include "storage/smgr.h"
  #include "utils/resowner.h"
+ #include "access/heapam.h"
  #include "pgstat.h"


***************
*** 67,72 ****
--- 68,75 ----
                                   * bufmgr */
  long        NDirectFileWrite;    /* e.g., I/O in psort and hashjoin. */

+ /* Hook for page layout convertor */
+ plc_hook_type plc_hook = NULL;

  /* local state for StartBufferIO and related functions */
  static volatile BufferDesc *InProgressBuf = NULL;
***************
*** 290,296 ****
--- 293,308 ----
          if (zeroPage)
              MemSet((char *) bufBlock, 0, BLCKSZ);
          else
+         {
              smgrread(reln->rd_smgr, blockNum, (char *) bufBlock);
+             /* Page Layout Convertor hook. We assume that page version is on same place. */
+             if( plc_hook &&  PageGetPageLayoutVersion(bufBlock) != PG_PAGE_LAYOUT_VERSION )
+             {
+                 plc_hook((char *)bufBlock);
+                 bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED);
+                 log_newpage(&reln->rd_node, blockNum ,bufBlock);
+             }
+         }
          /* check for garbage data */
          if (!PageHeaderIsValid((PageHeader) bufBlock))
          {
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /zfs_data/cvs_pgsql/cvsroot/pgsql/src/include/storage/bufmgr.h,v
retrieving revision 1.111
diff -c -r1.111 bufmgr.h
*** src/include/storage/bufmgr.h    1 Jan 2008 19:45:58 -0000    1.111
--- src/include/storage/bufmgr.h    28 Mar 2008 14:23:03 -0000
***************
*** 28,33 ****
--- 28,37 ----
      BAS_VACUUM                    /* VACUUM */
  } BufferAccessStrategyType;

+ /* Hook for page layout convertor plugin */
+ typedef void (*plc_hook_type)(char *buffer);
+ extern PGDLLIMPORT plc_hook_type plc_hook;
+
  /* in globals.c ... this duplicates miscadmin.h */
  extern PGDLLIMPORT int NBuffers;

Re: WIP: Pg_upgrade - page layout converter (PLC) hook

From

Heikki Linnakangas

Date:

15 April 2008, 12:58:20

Zdenek Kotala wrote:
> I attached patch which implemented page layout converter (PLC) hook. It
> is base stone for in-place upgrade.

I agree it's an important piece of the puzzle, but not the most 
complicated one by far. How will you deal with catalog changes for 
example? I'm going to assume that you'll use pg_migrator for that, and 
this page layout conversion is just the final step of the migration and 
all the other things like the catalog, clog, config file etc. have 
already been converted.

I would suggest starting with putting some serious effort into 
pg_migrator. This page layout conversion is actually pretty trivial 
compared to all that, and even more so if you can get the page layout 
conversion working in pg_migrator first.

Which versions do you plan to support, initially?

> How it works:
> 
> When PLC module is loaded, then for each page which does not have native 
> page version conversion routine is called. Buffer is mark as a dirty and 
> upgraded page is inserted into WAL.

I would suggest delegating the WAL logging to the PLC module. In some 
cases it's going to be just a matter of changing

> Performance:
> 
> I executed "select count(*) from table" on 2,2GB table (4671039 rows) 
> (without any tunning) and with conversion 2033s (34min) and after 
> conversion and server restart 31s (0,5min).

Wow, that's a big slowdown. Where's the time spent? Is it the extra I/O 
of rewriting the table?

> Request for comments:
> 
> 1) I not sure if calling log_newpage is correct.
> 
>   a) Calling from storage something in access method seems to me as bad 
> think. I'm thinking to move log_newpage to storage, but it invokes more 
> question about placement, RM ...

Yeah, it probably should be moved somewhere else. We already use it not 
only for heapam, but for indexes as well.

>   b) log_newpage is used for new page logging, but I use it for storing 
> converted page. It seems to me that it safe and heap_xlog_newpage 
> correctly works for new and converted page. I have only doubt about 
> assert macro mdextend/mdwrite which checks extend vs.write.

That should be fine. In WAL replay, we don't distinguish between write 
and extend.

> 3) data structures version tracking
> 
> For PLC I need to have old version of data structures like page header, 
> tuple header and so on. It is also useful for external tools to handle 
> more version of postgresql easily (e.g. pg_control should show data from 
> all supported postgresql versions).
> 
> My idea is to have for each structure version keep own header e.g. 
> bufpage_03.h, bufpage_04.h ... which will contain typedef struct 
> PageHeaderData_03 ... and generic bufpage.h  with following content:
> 
> ...
> #include "bufpage_04.h"
> ...
> typedef PageHeaderData_04 PageHeaderData;
> 
> #define PageGetPageSize(page) PageGetPageSize_04(page)
> ...

That's not enough, in general. There might be changes in other header 
files that affect the page layout. Like the packed varlen change, which 
affected c.h.

> + /* Hook for page layout convertor plugin */
> + typedef void (*plc_hook_type)(char *buffer);
> + extern PGDLLIMPORT plc_hook_type plc_hook;

That's not flexible enough. We've changed the internal representations 
of data types in the past, and will likely do that in the future. The 
hook therefore needs to have at least the tuple descriptor, to know how 
to convert the tuples. I would suggest passing Relation, we have that 
available at the call site, and that should contain all the necessary 
information.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: WIP: Pg_upgrade - page layout converter (PLC) hook

From

Tom Lane

Date:

15 April 2008, 13:15:51

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Zdenek Kotala wrote:
>> I attached patch which implemented page layout converter (PLC) hook. It
>> is base stone for in-place upgrade.

> I agree it's an important piece of the puzzle, but not the most 
> complicated one by far.

In fact, it's not even worth thinking about at this stage, because you
couldn't meaningfully use or test it before you have dealt with catalog
update issues.

Even more to the point, we already had consensus that we would try to
avoid page-level changes once we had a working migration solution,
so it's not clear that there will ever be a big need for page layout
conversion.

I think this patch is a complete waste of effort at this stage, and
should not be considered for application until we have a context in
which we could meaningfully test it.
        regards, tom lane

Re: WIP: Pg_upgrade - page layout converter (PLC) hook

From

Zdenek Kotala

Date:

15 April 2008, 16:38:29

Thanks for your comment. See my comments inline:

Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:
>> I attached patch which implemented page layout converter (PLC) hook. It
>> is base stone for in-place upgrade.
> 
> I agree it's an important piece of the puzzle, but not the most 
> complicated one by far. How will you deal with catalog changes for 
> example? I'm going to assume that you'll use pg_migrator for that, and 
> this page layout conversion is just the final step of the migration and 
> all the other things like the catalog, clog, config file etc. have 
> already been converted.

I'm sorry I forgot to link to my original proposal. 
http://archives.postgresql.org/pgsql-hackers/2007-07/msg00057.php

I agree is that it is easiest part, but it is base for my work and upgrade in 
place itself. Final state should have everything inside postgres include catalog 
conversion. New version should startup on old cluster and converted data "on the 
fly".

For testing I'm currently use my own ksh script which is similar to pg_migrator, 
because I need database with upgraded catalog. And you cannot upgrade catalog 
internally until you are not able to convert page layout. egg chicken problem

> I would suggest starting with putting some serious effort into 
> pg_migrator. This page layout conversion is actually pretty trivial 
> compared to all that, and even more so if you can get the page layout 
> conversion working in pg_migrator first.

By my opinion pg_migrator is good workaround but it is not suitable for real 
production. For example TOAST relid protection is dirty hack and you have 
problem with tablespaces and so on...

> Which versions do you plan to support, initially?

Currently I'm able to upgrade from 8.1, 8.2 to 8.3, 8.4. It means conversion 
between layout version 3 and 4. I'm verifying PLC now and when this part will be 
ready it is very easy to implement others as well.

>> How it works:
>>
>> When PLC module is loaded, then for each page which does not have 
>> native page version conversion routine is called. Buffer is mark as a 
>> dirty and upgraded page is inserted into WAL.
> 
> I would suggest delegating the WAL logging to the PLC module. In some 
> cases it's going to be just a matter of changing

I rejected this idea. PLC function should be used separately. Put WAL logging 
inside makes stronger dependency by my opinion.

>> Performance:
>>
>> I executed "select count(*) from table" on 2,2GB table (4671039 rows) 
>> (without any tunning) and with conversion 2033s (34min) and after 
>> conversion and server restart 31s (0,5min).
> 
> Wow, that's a big slowdown. Where's the time spent? Is it the extra I/O 
> of rewriting the table?

Not idea yet. I going to test it. But I guess that it is I/O problem. Each page 
is written twice.

>> Request for comments:
>>
>> 1) I not sure if calling log_newpage is correct.
>>
>>   a) Calling from storage something in access method seems to me as 
>> bad think. I'm thinking to move log_newpage to storage, but it invokes 
>> more question about placement, RM ...
> 
> Yeah, it probably should be moved somewhere else. We already use it not 
> only for heapam, but for indexes as well.

But question where is good place? Because it is related to whole page I suggest  to put it in smgr.c and under SGMR
RM.

>>   b) log_newpage is used for new page logging, but I use it for 
>> storing converted page. It seems to me that it safe and 
>> heap_xlog_newpage correctly works for new and converted page. I have 
>> only doubt about assert macro mdextend/mdwrite which checks extend 
>> vs.write.
> 
> That should be fine. In WAL replay, we don't distinguish between write 
> and extend.

But in this case the asserts macros in mdexted/mdwrite are odd and the should be 
removed.

>> 3) data structures version tracking
>>
>> For PLC I need to have old version of data structures like page 
>> header, tuple header and so on. It is also useful for external tools 
>> to handle more version of postgresql easily (e.g. pg_control should 
>> show data from all supported postgresql versions).
>>
>> My idea is to have for each structure version keep own header e.g. 
>> bufpage_03.h, bufpage_04.h ... which will contain typedef struct 
>> PageHeaderData_03 ... and generic bufpage.h  with following content:
>>
>> ...
>> #include "bufpage_04.h"
>> ...
>> typedef PageHeaderData_04 PageHeaderData;
>>
>> #define PageGetPageSize(page) PageGetPageSize_04(page)
>> ...
> 
> That's not enough, in general. There might be changes in other header 
> files that affect the page layout. Like the packed varlen change, which 
> affected c.h.

Yes, I know, tuple headers and so on. It is only example.

>> + /* Hook for page layout convertor plugin */
>> + typedef void (*plc_hook_type)(char *buffer);
>> + extern PGDLLIMPORT plc_hook_type plc_hook;
> 
> That's not flexible enough. We've changed the internal representations 
> of data types in the past, and will likely do that in the future. The 
> hook therefore needs to have at least the tuple descriptor, to know how 
> to convert the tuples. I would suggest passing Relation, we have that 
> available at the call site, and that should contain all the necessary 
> information.
> 

Good point, I thought about it. I currently don't use it but in the future it 
could be useful. I will extend it.
    thank Zdenek

Re: WIP: Pg_upgrade - page layout converter (PLC) hook

From

Zdenek Kotala

Date:

15 April 2008, 17:17:04

Tom Lane napsal(a):
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Zdenek Kotala wrote:
>>> I attached patch which implemented page layout converter (PLC) hook. It
>>> is base stone for in-place upgrade.
> 
>> I agree it's an important piece of the puzzle, but not the most 
>> complicated one by far.
> 
> In fact, it's not even worth thinking about at this stage, because you
> couldn't meaningfully use or test it before you have dealt with catalog
> update issues.

It is egg and chicken problem. I'm able to test it because I use own script for 
catalog upgrade (similar to pg_migrator).

> Even more to the point, we already had consensus that we would try to
> avoid page-level changes once we had a working migration solution,
> so it's not clear that there will ever be a big need for page layout
> conversion.

I'm really not convenient that page layout will never changed in the future. It 
is modified each second version. And you need way how to convert data from 
pre8.4 versions as well. And PLC convert page header, tuple headers and it could 
convert also data items.

> I think this patch is a complete waste of effort at this stage, and
> should not be considered for application until we have a context in
> which we could meaningfully test it.

I will send PLC code soon. I'm now testing it. For catalog upgrade you can use 
pg_migrator or pg_upgrade.sh script 
(http://src.opensolaris.org/source/xref/sfw/usr/src/cmd/postgres/postgresql-upgrade/)
    Zdenek

Re: WIP: Pg_upgrade - page layout converter (PLC) hook

From

Heikki Linnakangas

Date:

16 April 2008, 02:34:47

Zdenek Kotala wrote:
> Heikki Linnakangas napsal(a):
>> I would suggest starting with putting some serious effort into 
>> pg_migrator. This page layout conversion is actually pretty trivial 
>> compared to all that, and even more so if you can get the page layout 
>> conversion working in pg_migrator first.
> 
> By my opinion pg_migrator is good workaround but it is not suitable for 
> real production. For example TOAST relid protection is dirty hack and 
> you have problem with tablespaces and so on...

Sure, it's not perfect, that's exactly why I suggested working on it! If 
you have something else that works better, that's fine, but you will 
need to release it under the BSD license if you want help with it.

>> Which versions do you plan to support, initially?
> 
> Currently I'm able to upgrade from 8.1, 8.2 to 8.3, 8.4. It means 
> conversion between layout version 3 and 4. I'm verifying PLC now and 
> when this part will be ready it is very easy to implement others as well.

Hmm. I don't see any of that code in the directory you posted. ?

>>>   b) log_newpage is used for new page logging, but I use it for 
>>> storing converted page. It seems to me that it safe and 
>>> heap_xlog_newpage correctly works for new and converted page. I have 
>>> only doubt about assert macro mdextend/mdwrite which checks extend 
>>> vs.write.
>>
>> That should be fine. In WAL replay, we don't distinguish between write 
>> and extend.
> 
> But in this case the asserts macros in mdexted/mdwrite are odd and the 
> should be removed.

Those asserts are enforced during normal operation. It's just in WAL 
replay that we extend the relation automatically when full pages are 
restored. See XLogReadBuffer().

>>> + /* Hook for page layout convertor plugin */
>>> + typedef void (*plc_hook_type)(char *buffer);
>>> + extern PGDLLIMPORT plc_hook_type plc_hook;
>>
>> That's not flexible enough. We've changed the internal representations 
>> of data types in the past, and will likely do that in the future. The 
>> hook therefore needs to have at least the tuple descriptor, to know 
>> how to convert the tuples. I would suggest passing Relation, we have 
>> that available at the call site, and that should contain all the 
>> necessary information.
> 
> Good point, I thought about it. I currently don't use it but in the 
> future it could be useful. I will extend it.

Surely you need it already to do the packed varlen conversion?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: WIP: Pg_upgrade - page layout converter (PLC) hook

From

Zdenek Kotala

Date:

16 April 2008, 04:37:59

Heikki Linnakangas napsal(a):
> Zdenek Kotala wrote:
>> Heikki Linnakangas napsal(a):
>>> I would suggest starting with putting some serious effort into 
>>> pg_migrator. This page layout conversion is actually pretty trivial 
>>> compared to all that, and even more so if you can get the page layout 
>>> conversion working in pg_migrator first.
>>
>> By my opinion pg_migrator is good workaround but it is not suitable 
>> for real production. For example TOAST relid protection is dirty hack 
>> and you have problem with tablespaces and so on...
> 
> Sure, it's not perfect, that's exactly why I suggested working on it! If 
> you have something else that works better, that's fine, but you will 
> need to release it under the BSD license if you want help with it.

Sure, my upgrade script is in opensolaris and it is under CDDL license. By my 
meaning it is only workaround until PostgreSQL 8.x will not have upgrade 
integrated inside. After that this script will be removed.

Of course all work will be under BSD.

>>> Which versions do you plan to support, initially?
>>
>> Currently I'm able to upgrade from 8.1, 8.2 to 8.3, 8.4. It means 
>> conversion between layout version 3 and 4. I'm verifying PLC now and 
>> when this part will be ready it is very easy to implement others as well.
> 
> Hmm. I don't see any of that code in the directory you posted. ?

PLC is not there yet, because it is not reviewed, tested and verified. And I 
tested it only on SPARC where varlen works without any modification, but on x86 
it needs extra work yet.

>>>>   b) log_newpage is used for new page logging, but I use it for 
>>>> storing converted page. It seems to me that it safe and 
>>>> heap_xlog_newpage correctly works for new and converted page. I have 
>>>> only doubt about assert macro mdextend/mdwrite which checks extend 
>>>> vs.write.
>>>
>>> That should be fine. In WAL replay, we don't distinguish between 
>>> write and extend.
>>
>> But in this case the asserts macros in mdexted/mdwrite are odd and the 
>> should be removed.
> 
> Those asserts are enforced during normal operation. It's just in WAL 
> replay that we extend the relation automatically when full pages are 
> restored. See XLogReadBuffer().

OK

>>>> + /* Hook for page layout convertor plugin */
>>>> + typedef void (*plc_hook_type)(char *buffer);
>>>> + extern PGDLLIMPORT plc_hook_type plc_hook;
>>>
>>> That's not flexible enough. We've changed the internal 
>>> representations of data types in the past, and will likely do that in 
>>> the future. The hook therefore needs to have at least the tuple 
>>> descriptor, to know how to convert the tuples. I would suggest 
>>> passing Relation, we have that available at the call site, and that 
>>> should contain all the necessary information.
>>
>> Good point, I thought about it. I currently don't use it but in the 
>> future it could be useful. I will extend it.
> 
> Surely you need it already to do the packed varlen conversion?
> 

It works on SPARC without any varlen modification :-). I originally planed to 
implement varlen modification in another way but it seems to be better to do it 
in this place as well.
    Thanks for your comment

Re: WIP: Pg_upgrade - page layout converter (PLC) hook

From

Zdenek Kotala

Date:

17 May 2008, 17:06:54

Zdenek Kotala napsal(a):

<snip>

> How it works:
> 
> When PLC module is loaded, then for each page which does not have native 
> page version conversion routine is called. Buffer is mark as a dirty and 
> upgraded page is inserted into WAL.
> 

Unfortunately, this approach does not work between layout 3 and 4. It works only 
for heap on platfrom with maxallign=4.

The main problem is that PageHeader has been extended to 24 bytes and it 
requires reindexing, TOAST chunk resizing, converted tuples does not fit on page 
on platform where maxallign=8.

I'm now working on offline conversion method.
    Zdenek

Re: WIP: Pg_upgrade - page layout converter (PLC) hook

From

Josh Berkus

Date:

17 May 2008, 17:21:14

Zdenek,

> Unfortunately, this approach does not work between layout 3 and 4. It
> works only for heap on platfrom with maxallign=4.
>
> The main problem is that PageHeader has been extended to 24 bytes and it
> requires reindexing, TOAST chunk resizing, converted tuples does not fit
> on page on platform where maxallign=8.
>
> I'm now working on offline conversion method.

Hmmm.  I thought we determined earlier that the use case for *offline* 
binary conversion was rather limited.  It would have to be 10x faster than 
dump/reload to be worthwhile.  Do you think this is possible?

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: WIP: Pg_upgrade - page layout converter (PLC) hook

From

Zdenek Kotala

Date:

17 May 2008, 17:43:17

Josh Berkus napsal(a):
> Zdenek,
> 
>> Unfortunately, this approach does not work between layout 3 and 4. It
>> works only for heap on platfrom with maxallign=4.
>>
>> The main problem is that PageHeader has been extended to 24 bytes and it
>> requires reindexing, TOAST chunk resizing, converted tuples does not fit
>> on page on platform where maxallign=8.
>>
>> I'm now working on offline conversion method.
> 
> Hmmm.  I thought we determined earlier that the use case for *offline* 
> binary conversion was rather limited.  It would have to be 10x faster than 
> dump/reload to be worthwhile.  Do you think this is possible?
> 

It is difficult to say without prototype. However, binary conversion can be 
performed without (or minimal) WAL logging and without sync. I think async write  is a big improvement. And there is
notproblem convert tables in parallel mode.
 
And of course you don't need bin<->text<->bin conversion. Unfortunately reindex 
is necessary in both cases.
Zdenek