Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers
From | Melanie Plageman |
---|---|
Subject | Re: BitmapHeapScan streaming read user and prelim refactoring |
Date | |
Msg-id | CAAKRu_YHAk2z22Puast888f4izfeezfEUB0wWC0MVWoqh=51vQ@mail.gmail.com Whole thread Raw |
In response to | Re: BitmapHeapScan streaming read user and prelim refactoring (Melanie Plageman <melanieplageman@gmail.com>) |
Responses |
Re: BitmapHeapScan streaming read user and prelim refactoring
|
List | pgsql-hackers |
On Fri, Mar 1, 2024 at 2:31 PM Melanie Plageman <melanieplageman@gmail.com> wrote: > > On Thu, Feb 29, 2024 at 7:29 PM Melanie Plageman > <melanieplageman@gmail.com> wrote: > > > > On Thu, Feb 29, 2024 at 5:44 PM Tomas Vondra > > <tomas.vondra@enterprisedb.com> wrote: > > > > > > > > > > > > On 2/29/24 22:19, Melanie Plageman wrote: > > > > On Thu, Feb 29, 2024 at 7:54 AM Tomas Vondra > > > > <tomas.vondra@enterprisedb.com> wrote: > > > >> > > > >> > > > >> > > > >> On 2/29/24 00:40, Melanie Plageman wrote: > > > >>> On Wed, Feb 28, 2024 at 6:17 PM Tomas Vondra > > > >>> <tomas.vondra@enterprisedb.com> wrote: > > > >>>> > > > >>>> > > > >>>> > > > >>>> On 2/28/24 21:06, Melanie Plageman wrote: > > > >>>>> On Wed, Feb 28, 2024 at 2:23 PM Tomas Vondra > > > >>>>> <tomas.vondra@enterprisedb.com> wrote: > > > >>>>>> > > > >>>>>> On 2/28/24 15:56, Tomas Vondra wrote: > > > >>>>>>>> ... > > > >>>>>>> > > > >>>>>>> Sure, I can do that. It'll take a couple hours to get the results, I'll > > > >>>>>>> share them when I have them. > > > >>>>>>> > > > >>>>>> > > > >>>>>> Here are the results with only patches 0001 - 0012 applied (i.e. without > > > >>>>>> the patch introducing the streaming read API, and the patch switching > > > >>>>>> the bitmap heap scan to use it). > > > >>>>>> > > > >>>>>> The changes in performance don't disappear entirely, but the scale is > > > >>>>>> certainly much smaller - both in the complete results for all runs, and > > > >>>>>> for the "optimal" runs that would actually pick bitmapscan. > > > >>>>> > > > >>>>> Hmm. I'm trying to think how my refactor could have had this impact. > > > >>>>> It seems like all the most notable regressions are with 4 parallel > > > >>>>> workers. What do the numeric column labels mean across the top > > > >>>>> (2,4,8,16...) -- are they related to "matches"? And if so, what does > > > >>>>> that mean? > > > >>>>> > > > >>>> > > > >>>> That's the number of distinct values matched by the query, which should > > > >>>> be an approximation of the number of matching rows. The number of > > > >>>> distinct values in the data set differs by data set, but for 1M rows > > > >>>> it's roughly like this: > > > >>>> > > > >>>> uniform: 10k > > > >>>> linear: 10k > > > >>>> cyclic: 100 > > > >>>> > > > >>>> So for example matches=128 means ~1% of rows for uniform/linear, and > > > >>>> 100% for cyclic data sets. > > > >>> > > > >>> Ah, thank you for the explanation. I also looked at your script after > > > >>> having sent this email and saw that it is clear in your script what > > > >>> "matches" is. > > > >>> > > > >>>> As for the possible cause, I think it's clear most of the difference > > > >>>> comes from the last patch that actually switches bitmap heap scan to the > > > >>>> streaming read API. That's mostly expected/understandable, although we > > > >>>> probably need to look into the regressions or cases with e_i_c=0. > > > >>> > > > >>> Right, I'm mostly surprised about the regressions for patches 0001-0012. > > > >>> > > > >>> Re eic 0: Thomas Munro and I chatted off-list, and you bring up a > > > >>> great point about eic 0. In old bitmapheapscan code eic 0 basically > > > >>> disabled prefetching but with the streaming read API, it will still > > > >>> issue fadvises when eic is 0. That is an easy one line fix. Thomas > > > >>> prefers to fix it by always avoiding an fadvise for the last buffer in > > > >>> a range before issuing a read (since we are about to read it anyway, > > > >>> best not fadvise it too). This will fix eic 0 and also cut one system > > > >>> call from each invocation of the streaming read machinery. > > > >>> > > > >>>> To analyze the 0001-0012 patches, maybe it'd be helpful to run tests for > > > >>>> individual patches. I can try doing that tomorrow. It'll have to be a > > > >>>> limited set of tests, to reduce the time, but might tell us whether it's > > > >>>> due to a single patch or multiple patches. > > > >>> > > > >>> Yes, tomorrow I planned to start trying to repro some of the "red" > > > >>> cases myself. Any one of the commits could cause a slight regression > > > >>> but a 3.5x regression is quite surprising, so I might focus on trying > > > >>> to repro that locally and then narrow down which patch causes it. > > > >>> > > > >>> For the non-cached regressions, perhaps the commit to use the correct > > > >>> recheck flag (0004) when prefetching could be the culprit. And for the > > > >>> cached regressions, my money is on the commit which changes the whole > > > >>> control flow of BitmapHeapNext() and the next_block() and next_tuple() > > > >>> functions (0010). > > > >>> > > > >> > > > >> I do have some partial results, comparing the patches. I only ran one of > > > >> the more affected workloads (cyclic) on the xeon, attached is a PDF > > > >> comparing master and the 0001-0014 patches. The percentages are timing > > > >> vs. the preceding patch (green - faster, red - slower). > > > > > > > > Just confirming: the results are for uncached? > > > > > > > > > > Yes, cyclic data set, uncached case. I picked this because it seemed > > > like one of the most affected cases. Do you want me to test some other > > > cases too? > > > > So, I actually may have found the source of at least part of the > > regression with 0010. I was able to reproduce the regression with > > patch 0010 applied for the unached case with 4 workers and eic 8 and > > 100000000 rows for the cyclic dataset. I see it for all number of > > matches. The regression went away (for this specific example) when I > > moved the BitmapAdjustPrefetchIterator call back up to before the call > > to table_scan_bitmap_next_block() like this: > > > > diff --git a/src/backend/executor/nodeBitmapHeapscan.c > > b/src/backend/executor/nodeBitmapHeapscan.c > > index f7ecc060317..268996bdeea 100644 > > --- a/src/backend/executor/nodeBitmapHeapscan.c > > +++ b/src/backend/executor/nodeBitmapHeapscan.c > > @@ -279,6 +279,8 @@ BitmapHeapNext(BitmapHeapScanState *node) > > } > > > > new_page: > > + BitmapAdjustPrefetchIterator(node, node->blockno); > > + > > if (!table_scan_bitmap_next_block(scan, &node->recheck, > > &lossy, &node->blockno)) > > break; > > > > @@ -287,7 +289,6 @@ new_page: > > else > > node->exact_pages++; > > > > - BitmapAdjustPrefetchIterator(node, node->blockno); > > /* Adjust the prefetch target */ > > BitmapAdjustPrefetchTarget(node); > > } > > > > It makes sense this would fix it. I haven't tried all the combinations > > you tried. Do you mind running your tests with the new code? I've > > pushed it into this branch. > > https://github.com/melanieplageman/postgres/commits/bhs_pgsr/ > > Hold the phone on this one. I realized why I moved > BitmapAdjustPrefetchIterator after table_scan_bitmap_next_block() in > the first place -- master calls BitmapAdjustPrefetchIterator after the > tbm_iterate() for the current block -- otherwise with eic = 1, it > considers the prefetch iterator behind the current block iterator. I'm > going to go through and figure out what order this must be done in and > fix it. So, I investigated this further, and, as far as I can tell, for parallel bitmapheapscan the timing around when workers decrement prefetch_pages causes the performance differences with patch 0010 applied. It makes very little sense to me, but some of the queries I borrowed from your regression examples are up to 30% slower when this code from BitmapAdjustPrefetchIterator() is after table_scan_bitmap_next_block() instead of before it. SpinLockAcquire(&pstate->mutex); if (pstate->prefetch_pages > 0) pstate->prefetch_pages--; SpinLockRelease(&pstate->mutex); I did some stracing and did see much more time spent in futex/wait with this code after the call to table_scan_bitmap_next_block() vs before it. (table_scan_bitmap_next_block()) calls ReadBuffer()). In my branch, I've now moved only the parallel prefetch_pages-- code to before table_scan_bitmap_next_block(). https://github.com/melanieplageman/postgres/tree/bhs_pgsr I'd be interested to know if you see the regressions go away with 0010 applied (commit message "Make table_scan_bitmap_next_block() async friendly" and sha bfdcbfee7be8e2c461). - Melanie
pgsql-hackers by date: