Thread: WIP: Pg_upgrade - page layout converter (PLC) hook
I attached patch which implemented page layout converter (PLC) hook. It is base stone for in-place upgrade. How it works: When PLC module is loaded, then for each page which does not have native page version conversion routine is called. Buffer is mark as a dirty and upgraded page is inserted into WAL. Performance: I executed "select count(*) from table" on 2,2GB table (4671039 rows) (without any tunning) and with conversion 2033s (34min) and after conversion and server restart 31s (0,5min). Request for comments: 1) I not sure if calling log_newpage is correct. a) Calling from storage something in access method seems to me as bad think. I'm thinking to move log_newpage to storage, but it invokes more question about placement, RM ... b) log_newpage is used for new page logging, but I use it for storing converted page. It seems to me that it safe and heap_xlog_newpage correctly works for new and converted page. I have only doubt about assert macro mdextend/mdwrite which checks extend vs.write. 2) PLC module placement. I'm looking for best place (directory) where I can put PLC code. One possibility is to put under contrib/pg_upgrade another possibility is to put into backend/storage/upgrade/, but in this location it will not be possible make it as a module. 3) data structures version tracking For PLC I need to have old version of data structures like page header, tuple header and so on. It is also useful for external tools to handle more version of postgresql easily (e.g. pg_control should show data from all supported postgresql versions). My idea is to have for each structure version keep own header e.g. bufpage_03.h, bufpage_04.h ... which will contain typedef struct PageHeaderData_03 ... and generic bufpage.h with following content: ... #include "bufpage_04.h" ... typedef PageHeaderData_04 PageHeaderData; #define PageGetPageSize(page) PageGetPageSize_04(page) ... 4) how to handle corrupted page? If page is corrupted it could invoke false calling of convert routine. It could hide problems and conversion could "fix" it in wrong way. Probably we need to have PageHeaderIsValid for all page layout version. Thanks for your comments Index: src/backend/storage/buffer/bufmgr.c =================================================================== RCS file: /zfs_data/cvs_pgsql/cvsroot/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.228 diff -c -r1.228 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 1 Jan 2008 19:45:51 -0000 1.228 --- src/backend/storage/buffer/bufmgr.c 11 Apr 2008 15:30:28 -0000 *************** *** 41,46 **** --- 41,47 ---- #include "storage/proc.h" #include "storage/smgr.h" #include "utils/resowner.h" + #include "access/heapam.h" #include "pgstat.h" *************** *** 67,72 **** --- 68,75 ---- * bufmgr */ long NDirectFileWrite; /* e.g., I/O in psort and hashjoin. */ + /* Hook for page layout convertor */ + plc_hook_type plc_hook = NULL; /* local state for StartBufferIO and related functions */ static volatile BufferDesc *InProgressBuf = NULL; *************** *** 290,296 **** --- 293,308 ---- if (zeroPage) MemSet((char *) bufBlock, 0, BLCKSZ); else + { smgrread(reln->rd_smgr, blockNum, (char *) bufBlock); + /* Page Layout Convertor hook. We assume that page version is on same place. */ + if( plc_hook && PageGetPageLayoutVersion(bufBlock) != PG_PAGE_LAYOUT_VERSION ) + { + plc_hook((char *)bufBlock); + bufHdr->flags |= (BM_DIRTY | BM_JUST_DIRTIED); + log_newpage(&reln->rd_node, blockNum ,bufBlock); + } + } /* check for garbage data */ if (!PageHeaderIsValid((PageHeader) bufBlock)) { Index: src/include/storage/bufmgr.h =================================================================== RCS file: /zfs_data/cvs_pgsql/cvsroot/pgsql/src/include/storage/bufmgr.h,v retrieving revision 1.111 diff -c -r1.111 bufmgr.h *** src/include/storage/bufmgr.h 1 Jan 2008 19:45:58 -0000 1.111 --- src/include/storage/bufmgr.h 28 Mar 2008 14:23:03 -0000 *************** *** 28,33 **** --- 28,37 ---- BAS_VACUUM /* VACUUM */ } BufferAccessStrategyType; + /* Hook for page layout convertor plugin */ + typedef void (*plc_hook_type)(char *buffer); + extern PGDLLIMPORT plc_hook_type plc_hook; + /* in globals.c ... this duplicates miscadmin.h */ extern PGDLLIMPORT int NBuffers;
Zdenek Kotala wrote: > I attached patch which implemented page layout converter (PLC) hook. It > is base stone for in-place upgrade. I agree it's an important piece of the puzzle, but not the most complicated one by far. How will you deal with catalog changes for example? I'm going to assume that you'll use pg_migrator for that, and this page layout conversion is just the final step of the migration and all the other things like the catalog, clog, config file etc. have already been converted. I would suggest starting with putting some serious effort into pg_migrator. This page layout conversion is actually pretty trivial compared to all that, and even more so if you can get the page layout conversion working in pg_migrator first. Which versions do you plan to support, initially? > How it works: > > When PLC module is loaded, then for each page which does not have native > page version conversion routine is called. Buffer is mark as a dirty and > upgraded page is inserted into WAL. I would suggest delegating the WAL logging to the PLC module. In some cases it's going to be just a matter of changing > Performance: > > I executed "select count(*) from table" on 2,2GB table (4671039 rows) > (without any tunning) and with conversion 2033s (34min) and after > conversion and server restart 31s (0,5min). Wow, that's a big slowdown. Where's the time spent? Is it the extra I/O of rewriting the table? > Request for comments: > > 1) I not sure if calling log_newpage is correct. > > a) Calling from storage something in access method seems to me as bad > think. I'm thinking to move log_newpage to storage, but it invokes more > question about placement, RM ... Yeah, it probably should be moved somewhere else. We already use it not only for heapam, but for indexes as well. > b) log_newpage is used for new page logging, but I use it for storing > converted page. It seems to me that it safe and heap_xlog_newpage > correctly works for new and converted page. I have only doubt about > assert macro mdextend/mdwrite which checks extend vs.write. That should be fine. In WAL replay, we don't distinguish between write and extend. > 3) data structures version tracking > > For PLC I need to have old version of data structures like page header, > tuple header and so on. It is also useful for external tools to handle > more version of postgresql easily (e.g. pg_control should show data from > all supported postgresql versions). > > My idea is to have for each structure version keep own header e.g. > bufpage_03.h, bufpage_04.h ... which will contain typedef struct > PageHeaderData_03 ... and generic bufpage.h with following content: > > ... > #include "bufpage_04.h" > ... > typedef PageHeaderData_04 PageHeaderData; > > #define PageGetPageSize(page) PageGetPageSize_04(page) > ... That's not enough, in general. There might be changes in other header files that affect the page layout. Like the packed varlen change, which affected c.h. > + /* Hook for page layout convertor plugin */ > + typedef void (*plc_hook_type)(char *buffer); > + extern PGDLLIMPORT plc_hook_type plc_hook; That's not flexible enough. We've changed the internal representations of data types in the past, and will likely do that in the future. The hook therefore needs to have at least the tuple descriptor, to know how to convert the tuples. I would suggest passing Relation, we have that available at the call site, and that should contain all the necessary information. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Zdenek Kotala wrote: >> I attached patch which implemented page layout converter (PLC) hook. It >> is base stone for in-place upgrade. > I agree it's an important piece of the puzzle, but not the most > complicated one by far. In fact, it's not even worth thinking about at this stage, because you couldn't meaningfully use or test it before you have dealt with catalog update issues. Even more to the point, we already had consensus that we would try to avoid page-level changes once we had a working migration solution, so it's not clear that there will ever be a big need for page layout conversion. I think this patch is a complete waste of effort at this stage, and should not be considered for application until we have a context in which we could meaningfully test it. regards, tom lane
Thanks for your comment. See my comments inline: Heikki Linnakangas napsal(a): > Zdenek Kotala wrote: >> I attached patch which implemented page layout converter (PLC) hook. It >> is base stone for in-place upgrade. > > I agree it's an important piece of the puzzle, but not the most > complicated one by far. How will you deal with catalog changes for > example? I'm going to assume that you'll use pg_migrator for that, and > this page layout conversion is just the final step of the migration and > all the other things like the catalog, clog, config file etc. have > already been converted. I'm sorry I forgot to link to my original proposal. http://archives.postgresql.org/pgsql-hackers/2007-07/msg00057.php I agree is that it is easiest part, but it is base for my work and upgrade in place itself. Final state should have everything inside postgres include catalog conversion. New version should startup on old cluster and converted data "on the fly". For testing I'm currently use my own ksh script which is similar to pg_migrator, because I need database with upgraded catalog. And you cannot upgrade catalog internally until you are not able to convert page layout. egg chicken problem > I would suggest starting with putting some serious effort into > pg_migrator. This page layout conversion is actually pretty trivial > compared to all that, and even more so if you can get the page layout > conversion working in pg_migrator first. By my opinion pg_migrator is good workaround but it is not suitable for real production. For example TOAST relid protection is dirty hack and you have problem with tablespaces and so on... > Which versions do you plan to support, initially? Currently I'm able to upgrade from 8.1, 8.2 to 8.3, 8.4. It means conversion between layout version 3 and 4. I'm verifying PLC now and when this part will be ready it is very easy to implement others as well. >> How it works: >> >> When PLC module is loaded, then for each page which does not have >> native page version conversion routine is called. Buffer is mark as a >> dirty and upgraded page is inserted into WAL. > > I would suggest delegating the WAL logging to the PLC module. In some > cases it's going to be just a matter of changing I rejected this idea. PLC function should be used separately. Put WAL logging inside makes stronger dependency by my opinion. >> Performance: >> >> I executed "select count(*) from table" on 2,2GB table (4671039 rows) >> (without any tunning) and with conversion 2033s (34min) and after >> conversion and server restart 31s (0,5min). > > Wow, that's a big slowdown. Where's the time spent? Is it the extra I/O > of rewriting the table? Not idea yet. I going to test it. But I guess that it is I/O problem. Each page is written twice. >> Request for comments: >> >> 1) I not sure if calling log_newpage is correct. >> >> a) Calling from storage something in access method seems to me as >> bad think. I'm thinking to move log_newpage to storage, but it invokes >> more question about placement, RM ... > > Yeah, it probably should be moved somewhere else. We already use it not > only for heapam, but for indexes as well. But question where is good place? Because it is related to whole page I suggest to put it in smgr.c and under SGMR RM. >> b) log_newpage is used for new page logging, but I use it for >> storing converted page. It seems to me that it safe and >> heap_xlog_newpage correctly works for new and converted page. I have >> only doubt about assert macro mdextend/mdwrite which checks extend >> vs.write. > > That should be fine. In WAL replay, we don't distinguish between write > and extend. But in this case the asserts macros in mdexted/mdwrite are odd and the should be removed. >> 3) data structures version tracking >> >> For PLC I need to have old version of data structures like page >> header, tuple header and so on. It is also useful for external tools >> to handle more version of postgresql easily (e.g. pg_control should >> show data from all supported postgresql versions). >> >> My idea is to have for each structure version keep own header e.g. >> bufpage_03.h, bufpage_04.h ... which will contain typedef struct >> PageHeaderData_03 ... and generic bufpage.h with following content: >> >> ... >> #include "bufpage_04.h" >> ... >> typedef PageHeaderData_04 PageHeaderData; >> >> #define PageGetPageSize(page) PageGetPageSize_04(page) >> ... > > That's not enough, in general. There might be changes in other header > files that affect the page layout. Like the packed varlen change, which > affected c.h. Yes, I know, tuple headers and so on. It is only example. >> + /* Hook for page layout convertor plugin */ >> + typedef void (*plc_hook_type)(char *buffer); >> + extern PGDLLIMPORT plc_hook_type plc_hook; > > That's not flexible enough. We've changed the internal representations > of data types in the past, and will likely do that in the future. The > hook therefore needs to have at least the tuple descriptor, to know how > to convert the tuples. I would suggest passing Relation, we have that > available at the call site, and that should contain all the necessary > information. > Good point, I thought about it. I currently don't use it but in the future it could be useful. I will extend it. thank Zdenek
Tom Lane napsal(a): > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> Zdenek Kotala wrote: >>> I attached patch which implemented page layout converter (PLC) hook. It >>> is base stone for in-place upgrade. > >> I agree it's an important piece of the puzzle, but not the most >> complicated one by far. > > In fact, it's not even worth thinking about at this stage, because you > couldn't meaningfully use or test it before you have dealt with catalog > update issues. It is egg and chicken problem. I'm able to test it because I use own script for catalog upgrade (similar to pg_migrator). > Even more to the point, we already had consensus that we would try to > avoid page-level changes once we had a working migration solution, > so it's not clear that there will ever be a big need for page layout > conversion. I'm really not convenient that page layout will never changed in the future. It is modified each second version. And you need way how to convert data from pre8.4 versions as well. And PLC convert page header, tuple headers and it could convert also data items. > I think this patch is a complete waste of effort at this stage, and > should not be considered for application until we have a context in > which we could meaningfully test it. I will send PLC code soon. I'm now testing it. For catalog upgrade you can use pg_migrator or pg_upgrade.sh script (http://src.opensolaris.org/source/xref/sfw/usr/src/cmd/postgres/postgresql-upgrade/) Zdenek
Zdenek Kotala wrote: > Heikki Linnakangas napsal(a): >> I would suggest starting with putting some serious effort into >> pg_migrator. This page layout conversion is actually pretty trivial >> compared to all that, and even more so if you can get the page layout >> conversion working in pg_migrator first. > > By my opinion pg_migrator is good workaround but it is not suitable for > real production. For example TOAST relid protection is dirty hack and > you have problem with tablespaces and so on... Sure, it's not perfect, that's exactly why I suggested working on it! If you have something else that works better, that's fine, but you will need to release it under the BSD license if you want help with it. >> Which versions do you plan to support, initially? > > Currently I'm able to upgrade from 8.1, 8.2 to 8.3, 8.4. It means > conversion between layout version 3 and 4. I'm verifying PLC now and > when this part will be ready it is very easy to implement others as well. Hmm. I don't see any of that code in the directory you posted. ? >>> b) log_newpage is used for new page logging, but I use it for >>> storing converted page. It seems to me that it safe and >>> heap_xlog_newpage correctly works for new and converted page. I have >>> only doubt about assert macro mdextend/mdwrite which checks extend >>> vs.write. >> >> That should be fine. In WAL replay, we don't distinguish between write >> and extend. > > But in this case the asserts macros in mdexted/mdwrite are odd and the > should be removed. Those asserts are enforced during normal operation. It's just in WAL replay that we extend the relation automatically when full pages are restored. See XLogReadBuffer(). >>> + /* Hook for page layout convertor plugin */ >>> + typedef void (*plc_hook_type)(char *buffer); >>> + extern PGDLLIMPORT plc_hook_type plc_hook; >> >> That's not flexible enough. We've changed the internal representations >> of data types in the past, and will likely do that in the future. The >> hook therefore needs to have at least the tuple descriptor, to know >> how to convert the tuples. I would suggest passing Relation, we have >> that available at the call site, and that should contain all the >> necessary information. > > Good point, I thought about it. I currently don't use it but in the > future it could be useful. I will extend it. Surely you need it already to do the packed varlen conversion? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas napsal(a): > Zdenek Kotala wrote: >> Heikki Linnakangas napsal(a): >>> I would suggest starting with putting some serious effort into >>> pg_migrator. This page layout conversion is actually pretty trivial >>> compared to all that, and even more so if you can get the page layout >>> conversion working in pg_migrator first. >> >> By my opinion pg_migrator is good workaround but it is not suitable >> for real production. For example TOAST relid protection is dirty hack >> and you have problem with tablespaces and so on... > > Sure, it's not perfect, that's exactly why I suggested working on it! If > you have something else that works better, that's fine, but you will > need to release it under the BSD license if you want help with it. Sure, my upgrade script is in opensolaris and it is under CDDL license. By my meaning it is only workaround until PostgreSQL 8.x will not have upgrade integrated inside. After that this script will be removed. Of course all work will be under BSD. >>> Which versions do you plan to support, initially? >> >> Currently I'm able to upgrade from 8.1, 8.2 to 8.3, 8.4. It means >> conversion between layout version 3 and 4. I'm verifying PLC now and >> when this part will be ready it is very easy to implement others as well. > > Hmm. I don't see any of that code in the directory you posted. ? PLC is not there yet, because it is not reviewed, tested and verified. And I tested it only on SPARC where varlen works without any modification, but on x86 it needs extra work yet. >>>> b) log_newpage is used for new page logging, but I use it for >>>> storing converted page. It seems to me that it safe and >>>> heap_xlog_newpage correctly works for new and converted page. I have >>>> only doubt about assert macro mdextend/mdwrite which checks extend >>>> vs.write. >>> >>> That should be fine. In WAL replay, we don't distinguish between >>> write and extend. >> >> But in this case the asserts macros in mdexted/mdwrite are odd and the >> should be removed. > > Those asserts are enforced during normal operation. It's just in WAL > replay that we extend the relation automatically when full pages are > restored. See XLogReadBuffer(). OK >>>> + /* Hook for page layout convertor plugin */ >>>> + typedef void (*plc_hook_type)(char *buffer); >>>> + extern PGDLLIMPORT plc_hook_type plc_hook; >>> >>> That's not flexible enough. We've changed the internal >>> representations of data types in the past, and will likely do that in >>> the future. The hook therefore needs to have at least the tuple >>> descriptor, to know how to convert the tuples. I would suggest >>> passing Relation, we have that available at the call site, and that >>> should contain all the necessary information. >> >> Good point, I thought about it. I currently don't use it but in the >> future it could be useful. I will extend it. > > Surely you need it already to do the packed varlen conversion? > It works on SPARC without any varlen modification :-). I originally planed to implement varlen modification in another way but it seems to be better to do it in this place as well. Thanks for your comment
Zdenek Kotala napsal(a): <snip> > How it works: > > When PLC module is loaded, then for each page which does not have native > page version conversion routine is called. Buffer is mark as a dirty and > upgraded page is inserted into WAL. > Unfortunately, this approach does not work between layout 3 and 4. It works only for heap on platfrom with maxallign=4. The main problem is that PageHeader has been extended to 24 bytes and it requires reindexing, TOAST chunk resizing, converted tuples does not fit on page on platform where maxallign=8. I'm now working on offline conversion method. Zdenek
Zdenek, > Unfortunately, this approach does not work between layout 3 and 4. It > works only for heap on platfrom with maxallign=4. > > The main problem is that PageHeader has been extended to 24 bytes and it > requires reindexing, TOAST chunk resizing, converted tuples does not fit > on page on platform where maxallign=8. > > I'm now working on offline conversion method. Hmmm. I thought we determined earlier that the use case for *offline* binary conversion was rather limited. It would have to be 10x faster than dump/reload to be worthwhile. Do you think this is possible? -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Josh Berkus napsal(a): > Zdenek, > >> Unfortunately, this approach does not work between layout 3 and 4. It >> works only for heap on platfrom with maxallign=4. >> >> The main problem is that PageHeader has been extended to 24 bytes and it >> requires reindexing, TOAST chunk resizing, converted tuples does not fit >> on page on platform where maxallign=8. >> >> I'm now working on offline conversion method. > > Hmmm. I thought we determined earlier that the use case for *offline* > binary conversion was rather limited. It would have to be 10x faster than > dump/reload to be worthwhile. Do you think this is possible? > It is difficult to say without prototype. However, binary conversion can be performed without (or minimal) WAL logging and without sync. I think async write is a big improvement. And there is notproblem convert tables in parallel mode. And of course you don't need bin<->text<->bin conversion. Unfortunately reindex is necessary in both cases. Zdenek