Home > mailing lists

Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers

From	Melanie Plageman
Subject	Re: BitmapHeapScan streaming read user and prelim refactoring
Date	March 2, 2024 22:11:11
Msg-id	CAAKRu_YHAk2z22Puast888f4izfeezfEUB0wWC0MVWoqh=51vQ@mail.gmail.com Whole thread Raw
In response to	Re: BitmapHeapScan streaming read user and prelim refactoring (Melanie Plageman <melanieplageman@gmail.com>)
Responses	Re: BitmapHeapScan streaming read user and prelim refactoring
List	pgsql-hackers

Tree view

On Fri, Mar 1, 2024 at 2:31 PM Melanie Plageman
<melanieplageman@gmail.com> wrote:
>
> On Thu, Feb 29, 2024 at 7:29 PM Melanie Plageman
> <melanieplageman@gmail.com> wrote:
> >
> > On Thu, Feb 29, 2024 at 5:44 PM Tomas Vondra
> > <tomas.vondra@enterprisedb.com> wrote:
> > >
> > >
> > >
> > > On 2/29/24 22:19, Melanie Plageman wrote:
> > > > On Thu, Feb 29, 2024 at 7:54 AM Tomas Vondra
> > > > <tomas.vondra@enterprisedb.com> wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 2/29/24 00:40, Melanie Plageman wrote:
> > > >>> On Wed, Feb 28, 2024 at 6:17 PM Tomas Vondra
> > > >>> <tomas.vondra@enterprisedb.com> wrote:
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On 2/28/24 21:06, Melanie Plageman wrote:
> > > >>>>> On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra
> > > >>>>> <tomas.vondra@enterprisedb.com> wrote:
> > > >>>>>>
> > > >>>>>> On 2/28/24 15:56, Tomas Vondra wrote:
> > > >>>>>>>> ...
> > > >>>>>>>
> > > >>>>>>> Sure, I can do that. It'll take a couple hours to get the results, I'll
> > > >>>>>>> share them when I have them.
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>> Here are the results with only patches 0001 - 0012 applied (i.e. without
> > > >>>>>> the patch introducing the streaming read API, and the patch switching
> > > >>>>>> the bitmap heap scan to use it).
> > > >>>>>>
> > > >>>>>> The changes in performance don't disappear entirely, but the scale is
> > > >>>>>> certainly much smaller - both in the complete results for all runs, and
> > > >>>>>> for the "optimal" runs that would actually pick bitmapscan.
> > > >>>>>
> > > >>>>> Hmm. I'm trying to think how my refactor could have had this impact.
> > > >>>>> It seems like all the most notable regressions are with 4 parallel
> > > >>>>> workers. What do the numeric column labels mean across the top
> > > >>>>> (2,4,8,16...) -- are they related to "matches"? And if so, what does
> > > >>>>> that mean?
> > > >>>>>
> > > >>>>
> > > >>>> That's the number of distinct values matched by the query, which should
> > > >>>> be an approximation of the number of matching rows. The number of
> > > >>>> distinct values in the data set differs by data set, but for 1M rows
> > > >>>> it's roughly like this:
> > > >>>>
> > > >>>> uniform: 10k
> > > >>>> linear: 10k
> > > >>>> cyclic: 100
> > > >>>>
> > > >>>> So for example matches=128 means ~1% of rows for uniform/linear, and
> > > >>>> 100% for cyclic data sets.
> > > >>>
> > > >>> Ah, thank you for the explanation. I also looked at your script after
> > > >>> having sent this email and saw that it is clear in your script what
> > > >>> "matches" is.
> > > >>>
> > > >>>> As for the possible cause, I think it's clear most of the difference
> > > >>>> comes from the last patch that actually switches bitmap heap scan to the
> > > >>>> streaming read API. That's mostly expected/understandable, although we
> > > >>>> probably need to look into the regressions or cases with e_i_c=0.
> > > >>>
> > > >>> Right, I'm mostly surprised about the regressions for patches 0001-0012.
> > > >>>
> > > >>> Re eic 0: Thomas Munro and I chatted off-list, and you bring up a
> > > >>> great point about eic 0. In old bitmapheapscan code eic 0 basically
> > > >>> disabled prefetching but with the streaming read API, it will still
> > > >>> issue fadvises when eic is 0. That is an easy one line fix. Thomas
> > > >>> prefers to fix it by always avoiding an fadvise for the last buffer in
> > > >>> a range before issuing a read (since we are about to read it anyway,
> > > >>> best not fadvise it too). This will fix eic 0 and also cut one system
> > > >>> call from each invocation of the streaming read machinery.
> > > >>>
> > > >>>> To analyze the 0001-0012 patches, maybe it'd be helpful to run tests for
> > > >>>> individual patches. I can try doing that tomorrow. It'll have to be a
> > > >>>> limited set of tests, to reduce the time, but might tell us whether it's
> > > >>>> due to a single patch or multiple patches.
> > > >>>
> > > >>> Yes, tomorrow I planned to start trying to repro some of the "red"
> > > >>> cases myself. Any one of the commits could cause a slight regression
> > > >>> but a 3.5x regression is quite surprising, so I might focus on trying
> > > >>> to repro that locally and then narrow down which patch causes it.
> > > >>>
> > > >>> For the non-cached regressions, perhaps the commit to use the correct
> > > >>> recheck flag (0004) when prefetching could be the culprit. And for the
> > > >>> cached regressions, my money is on the commit which changes the whole
> > > >>> control flow of BitmapHeapNext() and the next_block() and next_tuple()
> > > >>> functions (0010).
> > > >>>
> > > >>
> > > >> I do have some partial results, comparing the patches. I only ran one of
> > > >> the more affected workloads (cyclic) on the xeon, attached is a PDF
> > > >> comparing master and the 0001-0014 patches. The percentages are timing
> > > >> vs. the preceding patch (green - faster, red - slower).
> > > >
> > > > Just confirming: the results are for uncached?
> > > >
> > >
> > > Yes, cyclic data set, uncached case. I picked this because it seemed
> > > like one of the most affected cases. Do you want me to test some other
> > > cases too?
> >
> > So, I actually may have found the source of at least part of the
> > regression with 0010. I was able to reproduce the regression with
> > patch 0010 applied for the unached case with 4 workers and eic 8 and
> > 100000000 rows for the cyclic dataset. I see it for all number of
> > matches. The regression went away (for this specific example) when I
> > moved the BitmapAdjustPrefetchIterator call back up to before the call
> > to table_scan_bitmap_next_block() like this:
> >
> > diff --git a/src/backend/executor/nodeBitmapHeapscan.c
> > b/src/backend/executor/nodeBitmapHeapscan.c
> > index f7ecc060317..268996bdeea 100644
> > --- a/src/backend/executor/nodeBitmapHeapscan.c
> > +++ b/src/backend/executor/nodeBitmapHeapscan.c
> > @@ -279,6 +279,8 @@ BitmapHeapNext(BitmapHeapScanState *node)
> >         }
> >
> >  new_page:
> > +       BitmapAdjustPrefetchIterator(node, node->blockno);
> > +
> >         if (!table_scan_bitmap_next_block(scan, &node->recheck,
> > &lossy, &node->blockno))
> >             break;
> >
> > @@ -287,7 +289,6 @@ new_page:
> >         else
> >             node->exact_pages++;
> >
> > -       BitmapAdjustPrefetchIterator(node, node->blockno);
> >         /* Adjust the prefetch target */
> >         BitmapAdjustPrefetchTarget(node);
> >     }
> >
> > It makes sense this would fix it. I haven't tried all the combinations
> > you tried. Do you mind running your tests with the new code? I've
> > pushed it into this branch.
> > https://github.com/melanieplageman/postgres/commits/bhs_pgsr/
>
> Hold the phone on this one. I realized why I moved
> BitmapAdjustPrefetchIterator after table_scan_bitmap_next_block() in
> the first place -- master calls BitmapAdjustPrefetchIterator after the
> tbm_iterate() for the current block -- otherwise with eic = 1, it
> considers the prefetch iterator behind the current block iterator. I'm
> going to go through and figure out what order this must be done in and
> fix it.

So, I investigated this further, and, as far as I can tell, for
parallel bitmapheapscan the timing around when workers decrement
prefetch_pages causes the performance differences with patch 0010
applied. It makes very little sense to me, but some of the queries I
borrowed from your regression examples are up to 30% slower when this
code from BitmapAdjustPrefetchIterator() is after
table_scan_bitmap_next_block() instead of before it.

        SpinLockAcquire(&pstate->mutex);
        if (pstate->prefetch_pages > 0)
            pstate->prefetch_pages--;
        SpinLockRelease(&pstate->mutex);

I did some stracing and did see much more time spent in futex/wait
with this code after the call to table_scan_bitmap_next_block() vs
before it. (table_scan_bitmap_next_block()) calls ReadBuffer()).

In my branch, I've now moved only the parallel prefetch_pages-- code
to before table_scan_bitmap_next_block().
https://github.com/melanieplageman/postgres/tree/bhs_pgsr
I'd be interested to know if you see the regressions go away with 0010
applied (commit message "Make table_scan_bitmap_next_block() async
friendly" and sha bfdcbfee7be8e2c461).

- Melanie

pgsql-hackers by date:

From: Thomas Munro
Date: 02 March 2024, 21:39:57
Subject: Failures in constraints regression test, "read only 0 of 8192 bytes"

From: Melanie Plageman
Date: 02 March 2024, 22:28:03
Subject: Re: BitmapHeapScan streaming read user and prelim refactoring

Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers

Previous

Next