Home > mailing lists
Re: Hash Indexes - Mailing list pgsql-hackers

From	Amit Kapila
Subject	Re: Hash Indexes
Date	November 23, 2016 13:50:18
Msg-id	CAA4eK1LtONpESxHURD6KLLYxB0Pj8WXvxb=-VZg2_hEruW3BUg@mail.gmail.com Whole thread Raw
In response to	Re: Hash Indexes (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: Hash Indexes Re: Hash Indexes
List	pgsql-hackers
Tree view
On Thu, Nov 17, 2016 at 3:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Nov 12, 2016 at 12:49 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> You are right and I have changed the code as per your suggestion.
>
> So...
>
> +        /*
> +         * We always maintain the pin on bucket page for whole scan operation,
> +         * so releasing the additional pin we have acquired here.
> +         */
> +        if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
> +            _hash_dropbuf(rel, *bufp);
>
> This relies on the page contents to know whether we took a pin; that
> seems like a bad plan.  We need to know intrinsically whether we took
> a pin.
>

Okay, changed to not rely on page contents.

> +     * If the bucket split is in progress, then we need to skip tuples that
> +     * are moved from old bucket.  To ensure that vacuum doesn't clean any
> +     * tuples from old or new buckets till this scan is in progress, maintain
> +     * a pin on both of the buckets.  Here, we have to be cautious about
>
> It wouldn't be a problem if VACUUM removed tuples from the new bucket,
> because they'd have to be dead anyway.   It also wouldn't be a problem
> if it removed tuples from the old bucket that were actually dead.  The
> real issue isn't vacuum anyway, but the process of cleaning up after a
> split.  We need to hold the pin so that tuples being moved from the
> old bucket to the new bucket by the split don't get removed from the
> old bucket until our scan is done.
>

Updated comments to explain clearly.

> +        old_blkno = _hash_get_oldblock_from_newbucket(rel,
> opaque->hasho_bucket);
>
> Couldn't you pass "bucket" here instead of "hasho->opaque_bucket"?  I
> feel like I'm repeating this ad nauseum, but I really think it's bad
> to rely on the special space instead of our own local variables!
>

Okay, changed as per suggestion.

> -            /* we ran off the end of the bucket without finding a match */
> +            /*
> +             * We ran off the end of the bucket without finding a match.
> +             * Release the pin on bucket buffers.  Normally, such pins are
> +             * released at end of scan, however scrolling cursors can
> +             * reacquire the bucket lock and pin in the same scan multiple
> +             * times.
> +             */
>              *bufP = so->hashso_curbuf = InvalidBuffer;
>              ItemPointerSetInvalid(current);
> +            _hash_dropscanbuf(rel, so);
>
> I think this comment is saying that we'll release the pin on the
> primary bucket page for now, and then reacquire it later if the user
> reverses the scan direction.  But that doesn't sound very safe,
> because the bucket could be split in the meantime and the order in
> which tuples are returned could change.  I think we want that to
> remain stable within a single query execution.
>

As explained [1], this shouldn't be a problem.

> +            _hash_readnext(rel, &buf, &page, &opaque,
> +                       (opaque->hasho_flag & LH_BUCKET_PAGE) ? true : false);
>
> Same comment: don't rely on the special space to figure this out.
> Keep track.  Also != 0 would be better than ? true : false.
>

After gluing scan of old and new buckets in _hash_read* API's, this is
no more required.

> +                            /*
> +                             * setting hashso_skip_moved_tuples to false
> +                             * ensures that we don't check for tuples that are
> +                             * moved by split in old bucket and it also
> +                             * ensures that we won't retry to scan the old
> +                             * bucket once the scan for same is finished.
> +                             */
> +                            so->hashso_skip_moved_tuples = false;
>
> I think you've got a big problem here.  Suppose the user starts the
> scan in the new bucket and runs it forward until they end up in the
> old bucket.  Then they turn around and run the scan backward.  When
> they reach the beginning of the old bucket, they're going to stop, not
> move back to the new bucket, AFAICS.  Oops.
>
> _hash_first() has a related problem: a backward scan starts at the end
> of the new bucket and moves backward, but it should start at the end
> of the old bucket, and then when it reaches the beginning, flip to the
> new bucket and move backward through that one.  Otherwise, a backward
> scan and a forward scan don't return tuples in opposite order, which
> they should.
>
> I think what you need to do to fix both of these problems is a more
> thorough job gluing the two buckets together.  I'd suggest that the
> responsibility for switching between the two buckets should probably
> be given to _hash_readprev() and _hash_readnext(), because every place
> that needs to advance to the next or previous page that cares about
> this.  Right now you are trying to handle it mostly in the functions
> that call those functions, but that is prone to errors of omission.
>

Changed as per this idea to change the API's and fix the problem.

> Also, I think that so->hashso_skip_moved_tuples is badly designed.
> There are two separate facts you need to know: (1) whether you are
> scanning a bucket that was still being populated at the start of your
> scan and (2) if yes, whether you are scanning the bucket being
> populated or whether you are instead scanning the corresponding "old"
> bucket.  You're trying to keep track of that using one Boolean, but
> one Boolean only has two states and there are three possible states
> here.
>

Updated patch is using two boolean variables to track the bucket state.

> +    if (H_BUCKET_BEING_SPLIT(pageopaque) && IsBufferCleanupOK(buf))
> +    {
> +
> +        /* release the lock on bucket buffer, before completing the split. */
>
> Extra blank line.
>

Removed.

> +moved-by-split flag on a tuple indicates that tuple is moved from old to new
> +bucket.  The concurrent scans can skip such tuples till the split operation is
> +finished.  Once the tuple is marked as moved-by-split, it will remain
> so forever
> +but that does no harm.  We have intentionally not cleared it as that
> can generate
> +an additional I/O which is not necessary.
>
> The first sentence needs to start with "the" but the second sentence shouldn't.
>

Changed.

> It would be good to adjust this part a bit to more clearly explain
> that the split-in-progress and split-cleanup flags are bucket-level
> flags, while moved-by-split is a per-tuple flag.  It's possible to
> figure this out from what you've written, but I think it could be more
> clear.  Another thing that is strange is that the code uses THREE
> flags, bucket-being-split, bucket-being-populated, and
> needs-split-cleanup, but the README conflates the first two and uses a
> different name.
>

Updated patch to use bucket-being-split and bucket-being-populated to
explain the split operation in README.  I have also changed the readme
to clearly indicate which the bucket and tuple level flags.

> +previously-acquired content lock, but not pin and repeat the process using the
>
> s/but not pin/but not the pin,/
>

Changed.

>  A problem is that if a split fails partway through (eg due to insufficient
> -disk space) the index is left corrupt.  The probability of that could be
> -made quite low if we grab a free page or two before we update the meta
> -page, but the only real solution is to treat a split as a WAL-loggable,
> +disk space or crash) the index is left corrupt.  The probability of that
> +could be made quite low if we grab a free page or two before we update the
> +meta page, but the only real solution is to treat a split as a WAL-loggable,
>  must-complete action.  I'm not planning to teach hash about WAL in this
> -go-round.
> +go-round.  However, we do try to finish the incomplete splits during insert
> +and split.
>
> I think this paragraph needs a much heavier rewrite explaining the new
> incomplete split handling.  It's basically wrong now.  Perhaps replace
> it with something like this:
>
> --
> If a split fails partway through (e.g. due to insufficient disk space
> or an interrupt), the index will not be corrupted.  Instead, we'll
> retry the split every time a tuple is inserted into the old bucket
> prior to inserting the new tuple; eventually, we should succeed.  The
> fact that a split is left unfinished doesn't prevent subsequent
> buckets from being split, but we won't try to split the bucket again
> until the prior split is finished.  In other words, a bucket can be in
> the middle of being split for some time, but ti can't be in the middle
> of two splits at the same time.
>
> Although we can survive a failure to split a bucket, a crash is likely
> to corrupt the index, since hash indexes are not yet WAL-logged.
> --
>

s/ti/it
Fixed the typo and used the suggested text in README.

> +        Acquire cleanup lock on target bucket
> +        Scan and remove tuples
> +        For overflow page, first we need to lock the next page and then
> +        release the lock on current bucket or overflow page
> +        Ensure to have buffer content lock in exclusive mode on bucket page
> +        If buffer pincount is one, then compact free space as needed
> +        Release lock
>
> I don't think this summary is particularly correct.  You would never
> guess from this that we lock each bucket page in turn and then go back
> and try to relock the primary bucket page at the end.  It's more like:
>
> acquire cleanup lock on primary bucket page
> loop:
>     scan and remove tuples
>     if this is the last bucket page, break out of loop
>     pin and x-lock next page
>     release prior lock and pin (except keep pin on primary bucket page)
> if the page we have locked is not the primary bucket page:
>     release lock and take exclusive lock on primary bucket page
> if there are no other pins on the primary bucket page:
>     squeeze the bucket to remove free space
>

Yeah, it is clear, so I have used it in README.

> Come to think of it, I'm a little worried about the locking in
> _hash_squeezebucket().  It seems like we drop the lock on each "write"
> bucket page before taking the lock on the next one.  So a concurrent
> scan could get ahead of the cleanup process.  That would be bad,
> wouldn't it?
>

As discussed [2], I have changed the code to use lock-chaining during
squeeze phase.


Apart from above, I have fixed a bug in calculation of lowmask in
_hash_get_oldblock_from_newbucket().

[1] - https://www.postgresql.org/message-id/CAA4eK1JJDWFY0_Ezs4ZxXgnrGtTn48vFuXniOLmL7FOWX-tKNw%40mail.gmail.com
[2] - https://www.postgresql.org/message-id/CAA4eK1J%2B0OYWKswWYNEjrBk3LfGpGJ9iSV8bYPQ3M%3D-qpkMtwQ
%40mail.gmail.com


--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Attachment

concurrent_hash_index_v12.patch
pgsql-hackers by date:
From: "Karl O. Pinc"
Date: 23 November 2016, 13:47:18
Subject: Re: Patch to implement pg_current_logfile() function
From: Robert Haas
Date: 23 November 2016, 15:03:25
Subject: Re: UNDO and in-place update
Re: Hash Indexes - Mailing list pgsql-hackers

Attachment

Previous

Next