Re: Hash Indexes - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Hash Indexes |
Date | |
Msg-id | CAA4eK1LtONpESxHURD6KLLYxB0Pj8WXvxb=-VZg2_hEruW3BUg@mail.gmail.com Whole thread Raw |
In response to | Re: Hash Indexes (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: Hash Indexes
Re: Hash Indexes |
List | pgsql-hackers |
On Thu, Nov 17, 2016 at 3:08 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> You are right and I have changed the code as per your suggestion. > > So... > > + /* > + * We always maintain the pin on bucket page for whole scan operation, > + * so releasing the additional pin we have acquired here. > + */ > + if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE) > + _hash_dropbuf(rel, *bufp); > > This relies on the page contents to know whether we took a pin; that > seems like a bad plan. We need to know intrinsically whether we took > a pin. > Okay, changed to not rely on page contents. > + * If the bucket split is in progress, then we need to skip tuples that > + * are moved from old bucket. To ensure that vacuum doesn't clean any > + * tuples from old or new buckets till this scan is in progress, maintain > + * a pin on both of the buckets. Here, we have to be cautious about > > It wouldn't be a problem if VACUUM removed tuples from the new bucket, > because they'd have to be dead anyway. It also wouldn't be a problem > if it removed tuples from the old bucket that were actually dead. The > real issue isn't vacuum anyway, but the process of cleaning up after a > split. We need to hold the pin so that tuples being moved from the > old bucket to the new bucket by the split don't get removed from the > old bucket until our scan is done. > Updated comments to explain clearly. > + old_blkno = _hash_get_oldblock_from_newbucket(rel, > opaque->hasho_bucket); > > Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"? I > feel like I'm repeating this ad nauseum, but I really think it's bad > to rely on the special space instead of our own local variables! > Okay, changed as per suggestion. > - /* we ran off the end of the bucket without finding a match */ > + /* > + * We ran off the end of the bucket without finding a match. > + * Release the pin on bucket buffers. Normally, such pins are > + * released at end of scan, however scrolling cursors can > + * reacquire the bucket lock and pin in the same scan multiple > + * times. > + */ > *bufP = so->hashso_curbuf = InvalidBuffer; > ItemPointerSetInvalid(current); > + _hash_dropscanbuf(rel, so); > > I think this comment is saying that we'll release the pin on the > primary bucket page for now, and then reacquire it later if the user > reverses the scan direction. But that doesn't sound very safe, > because the bucket could be split in the meantime and the order in > which tuples are returned could change. I think we want that to > remain stable within a single query execution. > As explained [1], this shouldn't be a problem. > + _hash_readnext(rel, &buf, &page, &opaque, > + (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false); > > Same comment: don't rely on the special space to figure this out. > Keep track. Also != 0 would be better than ? true : false. > After gluing scan of old and new buckets in _hash_read* API's, this is no more required. > + /* > + * setting hashso_skip_moved_tuples to false > + * ensures that we don't check for tuples that are > + * moved by split in old bucket and it also > + * ensures that we won't retry to scan the old > + * bucket once the scan for same is finished. > + */ > + so->hashso_skip_moved_tuples = false; > > I think you've got a big problem here. Suppose the user starts the > scan in the new bucket and runs it forward until they end up in the > old bucket. Then they turn around and run the scan backward. When > they reach the beginning of the old bucket, they're going to stop, not > move back to the new bucket, AFAICS. Oops. > > _hash_first() has a related problem: a backward scan starts at the end > of the new bucket and moves backward, but it should start at the end > of the old bucket, and then when it reaches the beginning, flip to the > new bucket and move backward through that one. Otherwise, a backward > scan and a forward scan don't return tuples in opposite order, which > they should. > > I think what you need to do to fix both of these problems is a more > thorough job gluing the two buckets together. I'd suggest that the > responsibility for switching between the two buckets should probably > be given to _hash_readprev() and _hash_readnext(), because every place > that needs to advance to the next or previous page that cares about > this. Right now you are trying to handle it mostly in the functions > that call those functions, but that is prone to errors of omission. > Changed as per this idea to change the API's and fix the problem. > Also, I think that so->hashso_skip_moved_tuples is badly designed. > There are two separate facts you need to know: (1) whether you are > scanning a bucket that was still being populated at the start of your > scan and (2) if yes, whether you are scanning the bucket being > populated or whether you are instead scanning the corresponding "old" > bucket. You're trying to keep track of that using one Boolean, but > one Boolean only has two states and there are three possible states > here. > Updated patch is using two boolean variables to track the bucket state. > + if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf)) > + { > + > + /* release the lock on bucket buffer, before completing the split. */ > > Extra blank line. > Removed. > +moved-by-split flag on a tuple indicates that tuple is moved from old to new > +bucket. The concurrent scans can skip such tuples till the split operation is > +finished. Once the tuple is marked as moved-by-split, it will remain > so forever > +but that does no harm. We have intentionally not cleared it as that > can generate > +an additional I/O which is not necessary. > > The first sentence needs to start with "the" but the second sentence shouldn't. > Changed. > It would be good to adjust this part a bit to more clearly explain > that the split-in-progress and split-cleanup flags are bucket-level > flags, while moved-by-split is a per-tuple flag. It's possible to > figure this out from what you've written, but I think it could be more > clear. Another thing that is strange is that the code uses THREE > flags, bucket-being-split, bucket-being-populated, and > needs-split-cleanup, but the README conflates the first two and uses a > different name. > Updated patch to use bucket-being-split and bucket-being-populated to explain the split operation in README. I have also changed the readme to clearly indicate which the bucket and tuple level flags. > +previously-acquired content lock, but not pin and repeat the process using the > > s/but not pin/but not the pin,/ > Changed. > A problem is that if a split fails partway through (eg due to insufficient > -disk space) the index is left corrupt. The probability of that could be > -made quite low if we grab a free page or two before we update the meta > -page, but the only real solution is to treat a split as a WAL-loggable, > +disk space or crash) the index is left corrupt. The probability of that > +could be made quite low if we grab a free page or two before we update the > +meta page, but the only real solution is to treat a split as a WAL-loggable, > must-complete action. I'm not planning to teach hash about WAL in this > -go-round. > +go-round. However, we do try to finish the incomplete splits during insert > +and split. > > I think this paragraph needs a much heavier rewrite explaining the new > incomplete split handling. It's basically wrong now. Perhaps replace > it with something like this: > > -- > If a split fails partway through (e.g. due to insufficient disk space > or an interrupt), the index will not be corrupted. Instead, we'll > retry the split every time a tuple is inserted into the old bucket > prior to inserting the new tuple; eventually, we should succeed. The > fact that a split is left unfinished doesn't prevent subsequent > buckets from being split, but we won't try to split the bucket again > until the prior split is finished. In other words, a bucket can be in > the middle of being split for some time, but ti can't be in the middle > of two splits at the same time. > > Although we can survive a failure to split a bucket, a crash is likely > to corrupt the index, since hash indexes are not yet WAL-logged. > -- > s/ti/it Fixed the typo and used the suggested text in README. > + Acquire cleanup lock on target bucket > + Scan and remove tuples > + For overflow page, first we need to lock the next page and then > + release the lock on current bucket or overflow page > + Ensure to have buffer content lock in exclusive mode on bucket page > + If buffer pincount is one, then compact free space as needed > + Release lock > > I don't think this summary is particularly correct. You would never > guess from this that we lock each bucket page in turn and then go back > and try to relock the primary bucket page at the end. It's more like: > > acquire cleanup lock on primary bucket page > loop: > scan and remove tuples > if this is the last bucket page, break out of loop > pin and x-lock next page > release prior lock and pin (except keep pin on primary bucket page) > if the page we have locked is not the primary bucket page: > release lock and take exclusive lock on primary bucket page > if there are no other pins on the primary bucket page: > squeeze the bucket to remove free space > Yeah, it is clear, so I have used it in README. > Come to think of it, I'm a little worried about the locking in > _hash_squeezebucket(). It seems like we drop the lock on each "write" > bucket page before taking the lock on the next one. So a concurrent > scan could get ahead of the cleanup process. That would be bad, > wouldn't it? > As discussed [2], I have changed the code to use lock-chaining during squeeze phase. Apart from above, I have fixed a bug in calculation of lowmask in _hash_get_oldblock_from_newbucket(). [1] - https://www.postgresql.org/message-id/CAA4eK1JJDWFY0_Ezs4ZxXgnrGtTn48vFuXniOLmL7FOWX-tKNw%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1J%2B0OYWKswWYNEjrBk3LfGpGJ9iSV8bYPQ3M%3D-qpkMtwQ %40mail.gmail.com -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Attachment
pgsql-hackers by date: