Re: Hash Indexes - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Hash Indexes |
Date | |
Msg-id | CAA4eK1+Yxoe4PWR8mKSKTNQE1v3xF=YEAnK==NixHbU-zi31=g@mail.gmail.com Whole thread Raw |
In response to | Re: Hash Indexes (Jeff Janes <jeff.janes@gmail.com>) |
Responses |
Re: Hash Indexes
|
List | pgsql-hackers |
On Tue, Sep 13, 2016 at 10:01 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Wed, Sep 7, 2016 at 9:32 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> > > >> >> Now, I think we shouldn't be >> inconsistent in using them. I will change to make it same if I find >> any inconsistency based on what you or other people think is the >> better way to refer overflow space. > > > I think just "overflow page" or "buffer containing the overflow page". > Okay changed to overflow page. > > Here are some more notes I've taken, mostly about the README and comments. > > It took me a while to understand that once a tuple is marked as moved by > split, it stays that way forever. It doesn't mean "recently moved by > split", but "ever moved by split". Which works, but is rather subtle. > Perhaps this deserves a parenthetical comment in the README the first time > the flag is mentioned. > I have added an additional paragraph explaining move-by-split flag along with the explanation of split operation. > ======== > > #define INDEX_SIZE_MASK 0x1FFF > /* bit 0x2000 is not used at present */ > > This is no longer true, maybe: > /* bit 0x2000 is reserved for index-AM specific usage */ > Changed as per suggestion. > ======== > > Note that this is designed to allow concurrent splits and scans. If a > split occurs, tuples relocated into the new bucket will be visited twice > by the scan, but that does no harm. As we are releasing the locks during > scan of a bucket, it will allow concurrent scan to start on a bucket and > ensures that scan will always be behind cleanup. > > Above, the abrupt transition from splits (first sentence) to cleanup is > confusing. If the cleanup referred to is vacuuming, it should be a new > paragraph or at least have a transition sentence. Or is it referring to > clean-up locks used for control purposes, rather than for actual vacuum > clean-up? I think it is the first one, the vacuum. > Yes, it is first one. > (I find the committed > version of this comment confusing as well--how in the committed code would a > tuple be visited twice, and why does that not do harm in the committed > coding? So maybe the issue here is me, not the comment.) > You have to read this scan as scan during vacuum. Whatever is written in committed code is right, let me try to explain with example. Suppose, there are two buckets at the start of vacuum, after it completes the vacuuming of first bucket and before or during vacuum for second bucket, a split for first bucket occurs. Now we have three buckets. If you notice in code (hashbulkdelete), after completing the vacuum for first and second bucket, if there is a split it will perform the vacuum for third bucket as well. This is the reason why readme mention's that tuples relocated into the new bucket will be visited twice. This whole explanation is in garbage collection section, so to me it looks clear. However, I have changed some wording, see if it makes sense to you now. > > ======= > > +Vacuum acquires cleanup lock on bucket to remove the dead tuples and or > tuples > +that are moved due to split. The need for cleanup lock to remove dead > tuples > +is to ensure that scans' returns correct results. Scan that returns > multiple > +tuples from the same bucket page always restart the scan from the previous > +offset number from which it has returned last tuple. > > Perhaps it would be better to teach scans to restart anywhere on the page, > than to force more cleanup locks to be taken? > Yeah, we can do that by making hash index scans work-a-page-at-a-time as we do for btree scans. However, as mentioned earlier, this is in my Todo list and I think it is better to do it as a separate patch based on this work. Do you think thats reasonable or do you have some strong reason why we should consider it as part of this patch only? > ======= > This comment no longer seems accurate (as long as it is just an ERROR and > not a PANIC): > > * XXX we have a problem here if we fail to get space for a > * new overflow page: we'll error out leaving the bucket > split > * only partially complete, meaning the index is corrupt, > * since searches may fail to find entries they should find. > > The split will still be marked as being in progress, so any scanner will > have to scan the old page and see the tuple there. > I have removed that part of comment. I think for PANIC case anyway hash index will be corrupt, so we might not need to mention anything about it. > ======== > in _hash_splitbucket comments, this needs updating: > > * The caller must hold exclusive locks on both buckets to ensure that > * no one else is trying to access them (see README). > > The true prereq here is a buffer clean up lock (pin plus exclusive buffer > content lock), correct? > Right and I have changed it accordingly. > And then: > > * Split needs to hold pin on primary bucket pages of both old and new > * buckets till end of operation. > > 'retain' is probably better than 'hold', to emphasize that we are dropping > the buffer content lock part of the clean-up lock, but that the pin part of > it is kept continuously (this also matches the variable name used in the > code). Okay, changed to retain. > Also, the paragraph after that one seems to be obsolete and > contradictory with the newly added comments. > Are you talking about: * In addition, the caller must have created the new bucket's base page, .. If yes, then I think that is valid. That paragraph mainly highlights two points. First is the new bucket's base page should be pinned, write-locked before calling this API and both will be released in this API. Second is we must do _hash_getnewbuf() before releasing the metapage write lock. Both the points still seems to be valid. > =========== > > /* > * Acquiring cleanup lock to clear the split-in-progress flag ensures > that > * there is no pending scan that has seen the flag after it is cleared. > */ > > But, we are not acquiring a clean up lock. We already have a pin, and we do > acquire a write buffer-content lock, but don't observe that our pin is the > only one. I don't see why it is necessary to have a clean up lock (what > harm is done if a under-way scan thinks it is scanning a bucket that is > being split when it actually just finished the split?), but if it is > necessary then I think this code is wrong. If not necessary, the comment is > wrong. > The comment is wrong and I have removed it. This is ramanant of some previous idea which I wanted to try but found problems in it and didin't pursued it. > Also, why must we hold a write lock on both old and new primary bucket pages > simultaneously? Is this in anticipation of the WAL patch? Yes, clearing the flag on both the buckets needs to be an atomic operation. Otherwise also, it is not good to write two different WAL records (one for clearing the flag on old bucket and other on new bucket). > The contract for > the function does say that it returns both pages write locked, but I don't > see a reason for that part of the contract at the moment. > Just refer it's usage in _hash_finish_split() cleanup flow. The reason is that we need to retain the lock in one of the buckets depending on the case. > ========= > > To avoid deadlock between readers and inserters, whenever there is a need > to lock multiple buckets, we always take in the order suggested in > Locking > Definitions above. This algorithm allows them a very high degree of > concurrency. > > The section referred to is actually spelled "Lock Definitions", no "ing". > > The Lock Definitions sections doesn't mention the meta page at all. Okay, changed. > I think > there needs be something added to it about how the meta page gets locked and > why that is deadlock free. (But we could be optimistic and assume the patch > to implement caching of the metapage will go in and will take care of that). > I don't think caching the meta page will eliminate the need to lock the meta page. However, this patch has not changed anything relavant in meta page locking that can impact deadlock detection. I have thought about it but not sure what more to write other than what is already mentioned at different places about meta page in README. Let me know, if you have something specific in mind. > ========= > > And an operational question on this: A lot of stuff is done conditionally > here. Under high concurrency, do splits ever actually occur? It seems like > they could easily be permanently starved. > May be, but the situation won't be worse than what we have in head. Under high concurrency also, it can arise only if there is always a reader for a bucket, before we try to split. Point to note here is once the split is started, concurrent readers are allowed which was not allowed previously. I think the same argument can be applied to other places where readers and writers contend for same lock, example procarraylock. In such cases theoretically readers can starve writers for ever, but practically such situations are rare. Apart from fixing above review comments, I have fixed the issue reported by Ashutosh Sharma [1]. Many thanks Jeff for the detailed review. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BfMUpJoAp5MXKRSv9193JXn25qtG%2BZrYUwb4dUuqmHrA%40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: