BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated) - Mailing list pgsql-bugs
From | tgarnett@panjiva.com |
---|---|
Subject | BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated) |
Date | |
Msg-id | 20150406192130.2573.22545@wrigleys.postgresql.org Whole thread Raw |
Responses |
Re: BUG #12990: Missing pg_multixact/members files (appears
to have wrapped, then truncated)
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated) Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated) |
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 12990 Logged by: Timothy Garnett Email address: tgarnett@panjiva.com PostgreSQL version: 9.3.5 Operating system: Ubuntu Linux x86_64 12.04.5 LTS Description: At 4/1 5:10pm our production database started throwing these kinds of errors (all of our hot spares were also similarly corrupted): ERROR: could not access status of transaction 303450738 DETAIL: Could not open file "pg_multixact/members/7B49": No such file or directory. This is possibly related to bug 8673 (though the mailing list sounds like most issues there were resolved by 9.3.4 and we are on 9.3.5) and also of note this db cluster was pg_upgraded from 9.2 at one point (to 9.3.x then further upgraded to 9.3.5)(see http://www.postgresql.org/message-id/CAAS3ty+2ynCyf_YmRn6WuqSF8EmJMDypAkc7uD_EXTZJ7usOSg@mail.gmail.com , though it doesn't seem like that's involved here). We have a file-system level snapshot of a hot spare of the db from about 28 hours before the errors started and sufficient wal files to do point in time recovery. We recovered a copy of the database to about 1 hour before the errors started and promoted that to our production db (losing a couple of hours of commits). Walking through the point in time recovery we see the pg_multixact/members folder fill up and eventually suffer a massive truncation when the newest file overtakes the oldest file. The range from oldest to newest (through the wrap point 14078) appears to be continuous. Time Newest-File Oldest-File 1 am 72BA 98A4(last Oct.) 10am 72BC 98A4 11am 72BD 98A4 12pm 7C4E 98A4 1 pm 7E88 98A4 2 pm 7FFF 98A4 3 pm 884E 98A4 4 pm 905F 98A4 * here we forked our new production 4:30 94AF 98D2 5pm 984D 98D2 5:15 98D2 9900 * now errors, all files outside of 98D2-9900 are gone During the rapid growth we were generating 10-12 pg_multixact/members files per minute. After we forked off a new production we identified the workload that was causing this rapid growth and stopped it. We then used vacuumdb -a -F to vacuum freeze the whole database cluster. Our presumption was that this would free up the members files, but it only freed up a few. The current oldest file is 9E30 (~Jan. 1st) so moved up from 98A4 and the newest file is 906A. We're concerned this might be a time-bomb waiting for us in the future (though at the slower growth rate of 16 files a day or so potentially a while in the future) as the members file namespace is still 95+% consumed post vacuum freeze. We do plan to upgrade to 9.4 sometime in the next couple of months and are curious if we can use pg_upgrade or if we will need to dump / restore the full (multi TiB) cluster. As for the workload causing the rapid growth, it involved something like: while [many millions of things to update / insert] BEGIN; SELECT state FROM table_A WHERE id = 1 FOR SHARE; if state != 'blocked' update / insert 2000-10000 rows in table_B, other stuff COMMIT; else COMMIT; sleep wait being run in a bunch of connections (26) to the db. Row 1 of table A was being used effectively as a share / exclusive lock as another different process would update the state to 'blocked' to block the first process in order to manage some shared outside the db state. We've retained the pre-failure snapshot and wal files for now (though we will need to free them up at some point) so we can point in time recover to any point from 28 hours before to several hours after the problem surfaced if that's helpful at all. Bugs / Questions: - the members files wrapped over themselves leading to corruption - why didn't vacuum_db -a -F free up more members files?
pgsql-bugs by date: