Re: [HACKERS] Parallel bitmap heap scan - Mailing list pgsql-hackers
From | Dilip Kumar |
---|---|
Subject | Re: [HACKERS] Parallel bitmap heap scan |
Date | |
Msg-id | CAFiTN-s5KuRuDrQCEpiHHzmVf7JTtbnb8eb10c-6AywJDxbWrA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel bitmap heap scan (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [HACKERS] Parallel bitmap heap scan
|
List | pgsql-hackers |
On Fri, Feb 17, 2017 at 2:01 AM, Robert Haas <robertmhaas@gmail.com> wrote: > + * in order to share relptrs of the chunk and the pages arrays and other > + * TBM related information with other workers. > > No relptrs any more. Fixed > > + dsa_pointer dsapagetable; /* dsa_pointer to the element array */ > + dsa_pointer dsapagetableold; /* dsa_pointer to the old element array */ > + dsa_area *dsa; /* reference to per-query dsa area */ > + dsa_pointer ptpages; /* dsa_pointer to the page array */ > + dsa_pointer ptchunks; /* dsa_pointer to the chunk array */ > > Let's put the DSA pointer first and then the other stuff after it. > That seems more logical. Done that way > > +typedef struct TBMSharedIteratorState > +{ > + int spageptr; /* next spages index */ > + int schunkptr; /* next schunks index */ > + int schunkbit; /* next bit to check in current schunk */ > + LWLock lock; /* lock to protect above members */ > + dsa_pointer pagetable; /* dsa pointers to head of pagetable data */ > + dsa_pointer spages; /* dsa pointer to page array */ > + dsa_pointer schunks; /* dsa pointer to chunk array */ > + int nentries; /* number of entries in pagetable */ > + int maxentries; /* limit on same to meet maxbytes */ > + int npages; /* number of exact entries in > pagetable */ > + int nchunks; /* number of lossy entries in pagetable */ > +} TBMSharedIteratorState; > > I think you've got this largely backwards from the most logical order. > Let's order it like this: nentries, maxentries, npages, nchunks, > pagetable, spages, schunks, lock (to protect BELOW members), spageptr, > chunkptr, schunkbit. Done > > +struct TBMSharedIterator > +{ > + PagetableEntry *ptbase; /* pointer to the pagetable element array */ > + dsa_area *dsa; /* reference to per-query dsa area */ > + int *ptpages; /* sorted exact page index list */ > + int *ptchunks; /* sorted lossy page index list */ > + TBMSharedIteratorState *state; /* shared members */ > + TBMIterateResult output; /* MUST BE LAST (because variable-size) */ > +}; > > Do we really need "dsa" here when it's already present in the shared > state? It doesn't seem like we even use it for anything. It's needed > to create each backend's TBMSharedIterator, but not afterwards, I > think. Right, removed. > > In terms of ordering, I'd move "state" to the top of the structure, > just like "tbm" comes first in a TBMIterator. Yeah, that looks better. Done that way. > > + * total memory consumption. If dsa passed to this function is not NULL > + * then the memory for storing elements of the underlying page table will > + * be allocated from the DSA. > > Notice that you capitalized "DSA" in two different ways in the same > sentence; I'd go for the all-caps version. Also, you need the word > "the" before the first one. Fixed, all such instances. > > + if (tbm->status == TBM_HASH && (tbm->iterating == TBM_NOT_ITERATING)) > > Excess parentheses. Fixed > > + * tbm_prepare_shared_iterate - prepare to iterator through a TIDBitmap > + * by multiple workers using shared iterator. > > Prepare to iterate, not prepare to iterator. I suggest rephrasing > this as "prepare shared iteration state for a TIDBitmap". Fixed. > > + * The TBMSharedIteratorState will be allocated from DSA so that multiple > + * worker can attach to it and iterate jointly. > > Maybe change to "The necessary shared state will be allocated from the > DSA passed to tbm_create, so that multiple workers can attach to it > and iterate jointly". Done. > > + * This will convert the pagetable hash into page and chunk array of the index > + * into pagetable array so that multiple workers can use it to get the actual > + * page entry. > > I think you can leave off everything starting from "so that". It's > basically redundant with what you already said. Done > > + dsa_pointer iterator; > + TBMSharedIteratorState *iterator_state; > > These aren't really great variable names, because the latter isn't the > state associated with the former. They're both the same object. > Maybe just "dp" and "istate". Done > > I think this function should Assert(tbm->dsa != NULL) and > Assert(tbm->iterating != TBM_ITERATING_PRIVATE), and similarly > tbm_begin_iterate should Assert(tbm->iterating != TBM_ITERATE_SHARED). Done > > + /* > + * If we have a hashtable, create and fill the sorted page lists, unless > + * we already did that for a previous iterator. Note that the lists are > + * attached to the bitmap not the iterator, so they can be used by more > + * than one iterator. However, we keep dsa_pointer to these in the shared > + * iterator so that other workers can access them directly. > + */ > > This is mostly cut-and-pasted from tbm_begin_iterate() but it's not > really correct here now because (1) we're no longer trying to fake up > a TIDBitmap proper in every backend and (2) this code runs even if we > don't have a hashtable. I think the comment should be something like > "If we're not already iterating, create and fill the sorted page > lists. (If we are, the sorted page lists are already stored in the > TIDBitmap, and we can just reuse them.)" Done > > + * Create page list and chunk list using relptr so that we can share > + * this information across multiple workers. > > No relptrs any more. Done > > + tbm->ptpages = dsa_allocate(tbm->dsa, tbm->npages * (sizeof(int))); > > Extra parentheses. > > + tbm->nchunks * (sizeof(int))); > > Extra parentheses. Fixed > > + * If TBM status is TBM_HASH then iterate over the pagetable and > > "If the TBM status is"... > > + * directly store it's index i.e. 0 in page array > > s/it's/its/ Done > > Don't you also need some code here to handle the TBM_EMPTY case?, IMHO, we don't need to do anything for TBM_EMPTY because npages and nchunks will be zero so iterator will handle such cases (same as it's done for non-parallel case.) > > + /* > + * Store the TBM member in the shared state so that we can share them > + * across multiple workers. > + */ > + iterator_state->maxentries = tbm->maxentries; > + iterator_state->nchunks = tbm->nchunks; > + iterator_state->nentries = tbm->nentries; > + iterator_state->npages = tbm->npages; > + iterator_state->pagetable = tbm->dsapagetable; > + iterator_state->spages = tbm->ptpages; > + iterator_state->schunks = tbm->ptchunks; > + > + /* Initialize the shared iterator state */ > + iterator_state->schunkbit = 0; > + iterator_state->schunkptr = 0; > + iterator_state->spageptr = 0; > + > + /* Initialize the iterator lock */ > + LWLockInitialize(&iterator_state->lock, LWTRANCHE_PARALLEL_TBM_ITERATOR); > > Set all of the structure members in the same order that you declare them. Done. > > + * tbm_extract_page_tuple - extract the tuple offsets from the page > > s/the page/a page/ > > + * Process the page bits to extract the tuple offsets and store them into > + * TBMIterateResult. > > This duplicates the preceding, a bit. Maybe just "The extracted > offsets are stored into the TBMIterateResult". Done > > +/* > + * tbm_advance_schunkbit > + * > + * Advance the chunkbit to get the next page entry from the chunk > + */ > > The formatting of this header comment is randomly different from the > preceding and following header comments. Seems like different function in the file are using different formatting. But done as near by functions. > > I would change the argument to schunkbitp, declare a local variable > named schunkbit, and do int schunkbit = *schunkbitp at the top and > *schunkbitp = schunkbit at the bottom. > Fixed > + * As above, but this will iterate using shared iterator which is shared > + * across multiple workers. We need to acquire the iterator LWLock, before > + * accessing the shared members. > > "using shared iterator which is shared" -> "using an iterator which is shared" > "multiple workers" -> "multiple processes" Done > > + PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr]; > > Maybe &ptbase[ptchunks[state->schunkptr]] would look a little nicer. > > + PagetableEntry *chunk = ptbase + ptchunks[state->schunkptr]; > + PagetableEntry *page = ptbase + ptpages[state->spageptr]; > + PagetableEntry *page = ptbase + ptpages[state->spageptr]; > > Similarly. Done > > + * tbm_end_shared_iterate - finish an iteration over a TIDBitmap > > Maybe s/an iteration/a shared iteration/ > > + * As above, but it frees the memory of TBMSharedIterator. > > Instead of this, I'd write "This doesn't free any of the shared state > associated with the iterator, just our backend-private state." Done > > + PagetableEntry *lpage = ((PagetableEntry *) arg) + *((int *) left); > + PagetableEntry *rpage = ((PagetableEntry *) arg) + *((int *) right); > > Again, try to use array dereference notation rather than pointer > arithmetic where you can. Maybe: > > PageTableEntry *base = (PageTableEntry *) arg; > PageTableEntry *lpage = &arg[*(int *) left]; > etc. Done > > +tbm_attach_shared_iterate(dsa_area *dsa, dsa_pointer iterator) > +{ > + TBMSharedIterator *shared_iterator; > + TBMSharedIteratorState *shared_state; > > Again, variable naming issues. Let's go with dsa_pointer dp, > TBMSharedIterator *iterator, TBMSharedIteratorState *istate. Please > try to go through and always use the same variable name for the same > kind of object. It's confusing when "iterator" sometimes means a > TBMSharedIterator * and sometimes a TBMSharedIteratorState * and > sometimes a dsa_pointer. Make it all consistent (and hopefully > logical). Done > > + shared_iterator = (TBMSharedIterator *) palloc(sizeof(TBMIterator) + > + MAX_TUPLES_PER_PAGE * sizeof(OffsetNumber)); > > Wrong sizeof. (This isn't just a style issue - it's a bug!) Right, it's a bug. Fixed. > > + * Callback function for allocating the memory for hashtable elements. > + * It allocates memory from DSA if tbm holds a reference to a dsa. > > Inconsistent capitalization of "DSA" again. > > Maybe change the whole comment to "Allocate memory for hashtable > elements, using DSA if available." Fixe > > + if (tbm->dsapagetable) > > Test against InvalidDsaPointer. Or maybe just get rid of the "if" and > do it unconditionally; if dsapagetable is InvalidDsaPointer then > dsapagetableold will be too, so no harm done. Right > > + LWTRANCHE_PARALLEL_TBM_ITERATOR, > > Change to LWLTRANCHE_TBM, add a call to LWLockRegisterTranche in > lwlock.c, and patch the docs with the new tranche name. Fixed > > Despite this fairly extensive list of mostly-trivial cosmetic > problems, I think this is in fairly good shape now. I think the basic > design is right, so once we get these loose ends tied up this should > be good to go in. > >> 0002: actual parallel bitmap heap scan by using interfaces of 0001. > > This needs a lot more extensive review than I quite have the time and > energy for right now, but here are a few comments. > > + <entry><literal>ParallelBitmapScan</></entry> > > ParallelBitmapPopulate, maybe? You're not waiting for the scan; > you're waiting for the bitmap to be populated. Done > > + <entry>Waiting for leader backend to complete the TidBitmap.</entry> > > Likewise, complete -> populate? Done > > + /* If any limit was set to zero, the user doesn't want a parallel scan. */ > > If we got zero? There aren't multiple limits mentioned anywhere > nearby, so this reads oddly. Removed > > + /* > + * It may be possible to amortize some of the I/O cost, but probably > + * not very much, because most operating systems already do aggressive > + * prefetching. For now, we assume that the disk run cost can't be > + * amortized at all. > + */ > > This looks like it was cut-and-pasted from the sequential scan case, > but it doesn't really apply here. This presupposes we're relying on > OS prefetching, but in the case of a bitmap heap scan we do our own > prefetching. Maybe change to "Since we do prefetching in both the > parallel and non-parallel cases, we don't expect to get substantially > more I/O parallelism by spreading the scan across multiple backends. > So, assume the disk run cost can't be amortized at all." Removed this comment. Apart from these, there are some more changes. in 0001: - Improved the comments for TBMSharedIteratorState and TBMSharedIterator. - Added error handling for return pointer from dsa_allocate. - Moved function declaration in tidbitmap.h (functionaly related are kept together) in 0002: - Improved comments. - Code refactoring in BitmapHeapNext. - Removed local tbm creation in BitmapHeapNext : as per new tidbitmap it's of no use. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
pgsql-hackers by date: