Write Ahead Logging for Hash Indexes - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Write Ahead Logging for Hash Indexes |
Date | |
Msg-id | CAA4eK1JOBX=YU33631Qh-XivYXtPSALh514+jR8XeD7v+K3r_Q@mail.gmail.com Whole thread Raw |
Responses |
Re: Write Ahead Logging for Hash Indexes
Re: Write Ahead Logging for Hash Indexes Re: Write Ahead Logging for Hash Indexes Re: Write Ahead Logging for Hash Indexes |
List | pgsql-hackers |
$SUBJECT will make hash indexes reliable and usable on standby. AFAIU, currently hash indexes are not recommended to be used in production mainly because they are not crash-safe and with this patch, I hope we can address that limitation and recommend them for use in production. This patch is built on my earlier patch [1] of making hash indexes concurrent. The main reason for doing so is that the earlier patch allows to complete the split operation and used light-weight locking due to which operations can be logged at granular level. WAL for different operations: This has been explained in README as well, but I am again writing it here for the ease of people. ================================= Multiple WAL records are being written for create index operation, first for initializing the metapage, followed by one for each new bucket created during operation followed by one for initializing the bitmap page. If the system crashes after any operation, the whole operation is rolledback. I have considered to write a single WAL record for the whole operation, but for that we need to limit the number of initial buckets that can be created during the operation. As we can log only fixed number of pages XLR_MAX_BLOCK_ID (32) with current XLog machinery, it is better to write multiple WAL records for this operation. The downside of restricting the number of buckets is that we need to perform split operation if the number of tuples are more than what can be accommodated in initial set of buckets and it is not unusual to have large number of tuples during create index operation. Ordinary item insertions (that don't force a page split or need a new overflow page) are single WAL entries. They touch a single bucket page and meta page, metapage is updated during replay as it is updated during original operation. An insertion that causes an addition of an overflow page is logged as a single WAL entry preceded by a WAL entry for a new overflow page required to insert a tuple. There is a corner case where by the time we try to use newly allocated overflow page, it already gets used by concurrent insertions, for such a case, a new overflow page will be allocated and a separate WAL entry will be made for the same. An insertion that causes a bucket split is logged as a single WAL entry, followed by a WAL entry for allocating a new bucket, followed by a WAL entry for each overflow bucket page in the new bucket to which the tuples are moved from old bucket, followed by a WAL entry to indicate that split is complete for both old and new buckets. A split operation which requires overflow pages to complete the operation will need to write a WAL record for each new allocation of an overflow page. As splitting involves multiple atomic actions, it's possible that the system crashes between moving tuples from bucket pages of old bucket to new bucket. After recovery, both the old and new buckets will be marked with in_complete split flag. The reader algorithm works correctly, as it will scan both the old and new buckets. We finish the split at next insert or split operation on old bucket. It could be done during searches, too, but it seems best not to put any extra updates in what would otherwise be a read-only operation (updating is not possible in hot standby mode anyway). It would seem natural to complete the split in VACUUM, but since splitting a bucket might require to allocate a new page, it might fail if you run out of disk space. That would be bad during VACUUM - the reason for running VACUUM in the first place might be that you run out of disk space, and now VACUUM won't finish because you're out of disk space. In contrast, an insertion can require enlarging the physical file anyway. Deletion of tuples from a bucket is performed for two reasons, one for removing the dead tuples and other for removing the tuples that are moved by split. WAL entry is made for each bucket page from which tuples are removed, followed by a WAL entry to clear the garbage flag if the tuples moved by split are removed. Another separate WAL entry is made for updating the metapage if the deletion is performed for removing the dead tuples by vaccum. As deletion involves multiple atomic operations, it is quite possible that system crashes after (a) removing tuples from some of the bucket pages (b) before clearing the garbage flag (c) before updating the metapage. If the system crashes before completing (b), it will again try to clean the bucket during next vacuum or insert after recovery which can have some performance impact, but it will work fine. If the system crashes before completing (c), after recovery there could be some additional splits till the next vacuum updates the metapage, but the other operations like insert, delete and scan will work correctly. We can fix this problem by actually updating the metapage based on delete operation during replay, but not sure if it is worth the complication. Squeeze operation moves tuples from one of the buckets later in the chain to one of the bucket earlier in chain and writes WAL record when either the bucket to which it is writing tuples is filled or bucket from which it is removing the tuples becomes empty. As Squeeze operation involves writing multiple atomic operations, it is quite possible, that system crashes before completing the operation on entire bucket. After recovery, the operations will work correctly, but the index will remain bloated and can impact performance of read and insert operations until the next vacuum squeezes the bucket completely. ===================================== One of the challenge in writing this patch was that the current code was not written with a mindset that we need to write WAL for different operations. Typical example is _hash_addovflpage() where pages are modified across different function calls and all modifications needs to be done atomically, so I have to refactor some code so that the operations can be logged sensibly. Thanks to Ashutosh Sharma who has helped me in completing the patch by writing WAL for create index and delete operation and done the detailed testing of patch by using pg_filedump tool. I think it is better if he himself explains the testing he has done to ensure correctness of patch. Thoughts? Note - To use this patch, first apply latest version of concurrent hash index patch [2]. [1] - https://commitfest.postgresql.org/10/647/ [2] - https://www.postgresql.org/message-id/CAA4eK1LkQ_Udism-Z2Dq6cUvjH3dB5FNFNnEzZBPsRjw0haFqA@mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: