Re: [SQL] Duplicate tuples with unique index - Mailing list pgsql-sql
From | Tom Lane |
---|---|
Subject | Re: [SQL] Duplicate tuples with unique index |
Date | |
Msg-id | 16091.948865123@sss.pgh.pa.us Whole thread Raw |
In response to | Re: [SQL] Duplicate tuples with unique index (Palle Girgensohn <girgen@partitur.se>) |
Responses |
Re: [SQL] Duplicate tuples with unique index
|
List | pgsql-sql |
Palle Girgensohn <girgen@partitur.se> writes: >>>> Nope. pg_upgrade was "disabled in this release because the >>>> internal blahblahblah disk layout changed from previous versions". >> >> Sorry, you have to edit the script to reenable it. > That's OK. I didn't bother to check. the dump/restore sequence was pretty quick. > The layout hasn't really changed since 6.5, right? No; the disk layout is the same. There is room for trouble nonetheless. There is some doubt about whether pg_upgrade will work right under MVCC semantics: in MVCC, whether a tuple stored in a table is considered valid or not will depend on whether its creator transaction is marked committed in pg_log (and, if it is marked with a deletor transaction, whether the deletor is NOT committed). pg_upgrade tries to deal with this by copying the old database installation's pg_log into the new. OK, that ensures that all the user-table tuples preserve their commit state; but what about tuples in the system tables? The idea behind pg_upgrade is to paste together a set of user tables with a set of system tables --- but unless the transaction number history of the user tables is exactly the same as the transaction number history of the system tables, there is clearly a risk that committed tuples will suddenly be considered not-committed or vice versa. And ordinarily those histories will *not* be identical. The only reason pg_upgrade has any chance at all of working is that for efficiency reasons we don't want to go back and consult pg_log for every single tuple we read. So, there are really six states of a tuple on disk, which may be described as:1. I was created by transaction N, but I dunno if it committed2. I was createdby a transaction that definitely committed, so I'm good; no need to look at pg_log3. I was created by a transactionthat definitely aborted, so I'm dead; no need to look at pg_log4. I was deleted by transaction N, but I dunnoif it committed5. I was deleted by a transaction that definitely committed, so I'm dead; no need to look at pg_log6.I was deleted by a transaction that definitely aborted, so I'm still good; no need to look at pg_log (I'm fuzzing over some fine points that arise when multiple transactions try to delete the same tuple, but this level of detail will do for now.) Whenever any backend examines a tuple in state 1 or 4, it will consult pg_log to discover the state of the source transaction. If the source transaction has now committed or aborted, the tuple will be rewritten with the new state (2,3,5,6 as appropriate) so that future readers of the tuple don't have to look at pg_log again. Now states 2,3,5,6 do not depend on what pg_log says. Therefore, pg_upgrade's copy of an old pg_log file into a new database will work *if and only if* all the tuples in the system tables are in one of these states, and none of them are in states 1 or 4. We can survive with user-table tuples that are in the uncertain states, because the associated pg_log info will be copied over with them. We cannot afford to have any uncertainly-committed system-table tuples, because we will overwrite the pg_log data about their status. Bruce thinks that the pg_upgrade script will ensure that the system- table tuples are all in frozen states (by VACUUMing them). I don't trust it worth a dime, myself. Maybe it will work, but it hasn't been proven in the field. So, if you'd like to try it, by all means do so --- but make a pg_dump backup first! And let us know whether you have problems or not! regards, tom lane