Thread: So, is COUNT(*) fast now?
Laments at: http://wiki.postgresql.org/wiki/FAQ#Why_is_.22SELECT_count.28.2A.29_FROM_bigtable.3B.22_slow.3F http://wiki.postgresql.org/wiki/Slow_Counting I tried this on my MacBook Pro this morning, using pgbench -i -s 500 to create a database about 7.5GB in size, and then using "SELECT sum(1) FROM pgbench_accounts" as a test query, on a build WITHOUT --enable-cassert. This machine has 4GB of memory, and I set shared_buffers = 400MB. (No, I'm not sure whether that's the optimal setting for shared_buffers for this machine.) With enable_indexonlyscan = false, times for this query are: 96799.747 ms, 89108.987 ms, 114433.664 ms. With enable_indexonlyscan = true, times were: 16779.325 ms, 16537.726 ms, 16703.229 ms. That's about six times faster. It's worth noting that the pgbench_accounts table has relatively narrow rows. On a table with wider rows (but not so wide that they get toasted and become narrow again), the benefit might be more. On the other hand, this is also a table that's just been vacuumed, and you've got the happy case where the table (6404 MB) does not fit in memory but but the index (1071 MB) does. What happens on a smaller test case? Here are the results at scale factor = 100: enable_indexonlyscan = false times: 1774.945 ms, 1784.985 ms, 1836.099 ms enable_indexonlyscan = true times: 1450.445 ms, 1452.407 ms, 1452.426 ms That's about a 23% speedup. At this scale factor, everything fits into memory, but the index by itself (214 MB) fits into memory while the table (1281 MB) does not. Let's crank the scale factor down some more. Here are the results at scale_factor = 20: enable_indexonlyscan = false times: 352.213 ms, 353.988 ms, 350.859 ms enable_indexonlyscan = true times: 300.623 ms, 301.355 ms, 302.346 ms Now the entire database fits into shared_buffers, but we're still seeing a 17% speedup. But this turns out to misleading. The ring-buffer logic is actually preventing shared_buffers from getting all of the heap blocks in cache quickly. If I run the query over and over again until the whole table actually makes it into shared buffers, the sequential scan gets much faster: enable_indexonlyscan = false times after lots and lots of prewarming: 215.487 ms, 219.006 ms, 218.490 ms That's a bit disappointing - it's now more than a third faster to do the sequential scan, even though the sequential scan has to touch six times as many blocks (at scale factor 20, index is 43 MB, table is 256 MB) all of which are in cache. Of course, touching that many fewer blocks does have some advantages if there is concurrent activity on the system, but it still seems unfortunate that the ratio of runtime to blocks touched is more than 8x higher for the index-only case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > That's a bit disappointing - it's now more than a third faster to do > the sequential scan, even though the sequential scan has to touch six > times as many blocks (at scale factor 20, index is 43 MB, table is 256 > MB) all of which are in cache. Of course, touching that many fewer > blocks does have some advantages if there is concurrent activity on > the system, but it still seems unfortunate that the ratio of runtime > to blocks touched is more than 8x higher for the index-only case. I don't know why you'd imagine that touching an index is free, or even cheap, CPU-wise. The whole point of the index-only optimization is to avoid I/O. When you try it on a case where there's no I/O to be saved, *and* no shared-buffers contention to be avoided, there's no way it's going to be a win. regards, tom lane
On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> That's a bit disappointing - it's now more than a third faster to do >> the sequential scan, even though the sequential scan has to touch six >> times as many blocks (at scale factor 20, index is 43 MB, table is 256 >> MB) all of which are in cache. Of course, touching that many fewer >> blocks does have some advantages if there is concurrent activity on >> the system, but it still seems unfortunate that the ratio of runtime >> to blocks touched is more than 8x higher for the index-only case. > > I don't know why you'd imagine that touching an index is free, or even > cheap, CPU-wise. The whole point of the index-only optimization is to > avoid I/O. When you try it on a case where there's no I/O to be saved, > *and* no shared-buffers contention to be avoided, there's no way it's > going to be a win. Well, call me naive, but I would have thought touching six times less data would make the operation run faster, not slower. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I don't know why you'd imagine that touching an index is free, or even >> cheap, CPU-wise. �The whole point of the index-only optimization is to >> avoid I/O. �When you try it on a case where there's no I/O to be saved, >> *and* no shared-buffers contention to be avoided, there's no way it's >> going to be a win. > Well, call me naive, but I would have thought touching six times less > data would make the operation run faster, not slower. It's not "touching six times less data". It's touching the exact same number of tuples either way, just index tuples in one case and heap tuples in the other. You've arranged the test case so that all these tuples are in shared buffers already, so there's no data movement to be avoided. What this test case proves is that btree's overhead per index tuple touched is significantly more than the cost of the fastest path through HeapTupleSatisfiesMVCC, which I don't find surprising considering how much sweat has been expended on that code path over the years. (It may well be that it's not even btree at fault but the per-tuple visits to the visibility map ... did you do any oprofiling yet?) regards, tom lane
On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I don't know why you'd imagine that touching an index is free, or even >>> cheap, CPU-wise. The whole point of the index-only optimization is to >>> avoid I/O. When you try it on a case where there's no I/O to be saved, >>> *and* no shared-buffers contention to be avoided, there's no way it's >>> going to be a win. > >> Well, call me naive, but I would have thought touching six times less >> data would make the operation run faster, not slower. > > It's not "touching six times less data". It's touching the exact same > number of tuples either way, just index tuples in one case and heap > tuples in the other. Yeah, but it works out to fewer pages. > You've arranged the test case so that all these > tuples are in shared buffers already, so there's no data movement to be > avoided. What this test case proves is that btree's overhead per index > tuple touched is significantly more than the cost of the fastest path > through HeapTupleSatisfiesMVCC, which I don't find surprising > considering how much sweat has been expended on that code path over the > years. I think HeapTupleSatisfiesMVCC is probably being skipped anyway in this case, since all the heap pages should be PD_ALL_VISIBLE. > (It may well be that it's not even btree at fault but the per-tuple > visits to the visibility map ... did you do any oprofiling yet?) No, but I think that might be a good idea. Maybe I'll go do that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What this test case proves is that btree's overhead per index >> tuple touched is significantly more than the cost of the fastest path >> through HeapTupleSatisfiesMVCC, which I don't find surprising >> considering how much sweat has been expended on that code path over the >> years. > I think HeapTupleSatisfiesMVCC is probably being skipped anyway in > this case, since all the heap pages should be PD_ALL_VISIBLE. Proves my point ;-) ... you're comparing a code path that's been beat on for *years* with one that just got written. regards, tom lane
On Fri, Oct 21, 2011 at 2:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think HeapTupleSatisfiesMVCC is probably being skipped anyway in >> this case, since all the heap pages should be PD_ALL_VISIBLE. > > Proves my point ;-) ... you're comparing a code path that's been beat on > for *years* with one that just got written. I know. I wrote a chunk of it. :-) My point is just that it'd be nice to make it better. Anyhow, here's the scoop. On my desktop machine running F14, running SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of oprofile data: 176830 13.0801 postgres postgres ExecProject 170028 12.5770 postgres postgres IndexOnlyNext 96631 7.1478 postgres postgres visibilitymap_test 86019 6.3628 postgres postgres advance_aggregates 74366 5.5009 postgres postgres ExecScan 72428 5.3575 postgres postgres ExecClearTuple 68483 5.0657 postgres postgres btgettuple 60614 4.4836 postgres postgres advance_transition_function 59680 4.4145 postgres postgres ExecProcNode 52295 3.8683 postgres postgres _bt_checkkeys 52078 3.8522 libc-2.12.90.so libc-2.12.90.so __memcpy_sse2 49548 3.6651 postgres postgres index_getnext_tid 48265 3.5702 postgres postgres ExecEvalConst 42989 3.1799 postgres postgres _bt_next 40544 2.9990 postgres postgres _bt_readpage 35162 2.6009 no-vmlinux no-vmlinux /no-vmlinux 33639 2.4883 postgres postgres MemoryContextReset And without index-only scans. but everything in shared_buffers: 169515 18.4261 postgres postgres ExecProject 94827 10.3076 postgres postgres heapgettup_pagemode 84850 9.2231 postgres postgres advance_aggregates 57998 6.3043 postgres postgres advance_transition_function 55638 6.0478 postgres postgres ExecEvalConst 53684 5.8354 postgres postgres heapgetpage 51411 5.5883 postgres postgres ExecScan 48387 5.2596 postgres postgres ExecProcNode 44129 4.7968 postgres postgres ExecStoreTuple 30759 3.3435 postgres postgres heap_getnext 25923 2.8178 postgres postgres SeqNext 24145 2.6245 postgres postgres CheckForSerializableConflictOut 23155 2.5169 postgres postgres ExecAgg 18864 2.0505 postgres postgres heap_page_prune_opt 18784 2.0418 no-vmlinux no-vmlinux /no-vmlinux The index-only scan takes about 385 ms, while the non-index-only version takes about 284 ms. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 21, 2011 at 3:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: > [ oprofile results ] *grovels through the line-by-line results* Hmm, I guess there is a bit of a hotspot in StoreIndexTuple, which is probably being folded into IndexOnlyNext in the per-function timings: ExecClearTuple(slot); for (i = 0; i < nindexatts; i++) values[i] = index_getattr(itup, i + 1, itupdesc, &isnull[i]); ExecStoreVirtualTuple(slot); If I'm reading these results right, that section is about 3% of the total number of samples. Also, this line is kind of expensive: if (!visibilitymap_test(scandesc->heapRelation, ItemPointerGetBlockNumber(tid), &node->ioss_VMBuffer)) Around 2%. But I don't see any way to avoid that, or even make it cheaper. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Anyhow, here's the scoop. On my desktop machine running F14, running > SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of > oprofile data: > 176830 13.0801 postgres postgres ExecProject Hm, that's weird. In both these cases, I'd have expected that ExecProject would get optimized away thanks to selection of a physical tlist for the scan node. Wonder if that got broken ... regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > Hmm, I guess there is a bit of a hotspot in StoreIndexTuple, which is > probably being folded into IndexOnlyNext in the per-function timings: > ExecClearTuple(slot); > for (i = 0; i < nindexatts; i++) > values[i] = index_getattr(itup, i + 1, itupdesc, &isnull[i]); > ExecStoreVirtualTuple(slot); I had wondered whether it'd be worth optimizing that along the lines of slot_getallattrs(). But most indexes probably have only one column, or anyway not enough to make for a useful savings. regards, tom lane
On Fri, Oct 21, 2011 at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Anyhow, here's the scoop. On my desktop machine running F14, running >> SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of >> oprofile data: > >> 176830 13.0801 postgres postgres ExecProject > > Hm, that's weird. In both these cases, I'd have expected that > ExecProject would get optimized away thanks to selection of a physical > tlist for the scan node. Wonder if that got broken ... If it did, it looks like it wasn't recent. I set up the same test case on my MacBook using REL9_1_STABLE and REL9_0_STABLE and set a breakpoint on ExecProject(). Both back-branches appear to also call ExecProject() for every tuple. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 21, 2011 at 11:14 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> I don't know why you'd imagine that touching an index is free, or even >>>> cheap, CPU-wise. The whole point of the index-only optimization is to >>>> avoid I/O. When you try it on a case where there's no I/O to be saved, >>>> *and* no shared-buffers contention to be avoided, there's no way it's >>>> going to be a win. >> >>> Well, call me naive, but I would have thought touching six times less >>> data would make the operation run faster, not slower. >> >> It's not "touching six times less data". It's touching the exact same >> number of tuples either way, just index tuples in one case and heap >> tuples in the other. > > Yeah, but it works out to fewer pages. But since those pages are already in RAM, why would it matter all that much? (Other than in the case of highly concurrent access, which you don't seem to be testing?) One of Tom's commits that made it not lock the same index page over and over again (once for each tuple on it) made me think it should be much faster than the seq scan, but a bit of random flailing about convinced me that any saving from this were compensated for by the high over head of FunctionCall2Coll and all of the hokey-pokey that that call entails, which a seqscan can skip entirely. If count(*) could cause the index-only scan to happen in physical order of the index, rather than logical order, that might be a big win. Both for all in memory and for not-all-in-memory. Cheers, Jeff
On Friday, October 21, 2011 08:14:12 PM Robert Haas wrote: > On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: > >> On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> I don't know why you'd imagine that touching an index is free, or even > >>> cheap, CPU-wise. The whole point of the index-only optimization is to > >>> avoid I/O. When you try it on a case where there's no I/O to be saved, > >>> and no shared-buffers contention to be avoided, there's no way it's > >>> going to be a win. > >> > >> Well, call me naive, but I would have thought touching six times less > >> data would make the operation run faster, not slower. > > > > It's not "touching six times less data". It's touching the exact same > > number of tuples either way, just index tuples in one case and heap > > tuples in the other. > > Yeah, but it works out to fewer pages. But access to those is not sequential. I guess if you measure cache hit ratios the index scan will come out significantly worse. Andres
2011/10/22 Andres Freund <andres@anarazel.de>
"But access to those is not sequential" yes, I am agree.
On Friday, October 21, 2011 08:14:12 PM Robert Haas wrote:> >>> and no shared-buffers contention to be avoided, there's no way it's
> On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> I don't know why you'd imagine that touching an index is free, or even
> >>> cheap, CPU-wise. The whole point of the index-only optimization is to
> >>> avoid I/O. When you try it on a case where there's no I/O to be saved,> >>> going to be a win.But access to those is not sequential. I guess if you measure cache hit ratios
> >>
> >> Well, call me naive, but I would have thought touching six times less
> >> data would make the operation run faster, not slower.
> >
> > It's not "touching six times less data". It's touching the exact same
> > number of tuples either way, just index tuples in one case and heap
> > tuples in the other.
>
> Yeah, but it works out to fewer pages.
the index scan will come out significantly worse.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
In my opinion the problem is that. If the query needs to scan all the b-tree index without to
access the table rows, the better way to read the index is like sequential one,
in fact , query like count(*) or other not need the data are in "order" so I think we
could read all blocks (better, "only" the leaf blocks) without to touching too much the branch blocks.
For example query like this :
select column_a from table ;
is better to read the data from indexes like sequential
For query like this :
select column_a from table order by column_a ;
is better to read the data from indexes in range scan from root block to first branch blocks and their leaf blocks, so we could "save"
the sorting.
Mat
Andres Freund <andres@anarazel.de> writes: > On Friday, October 21, 2011 08:14:12 PM Robert Haas wrote: >> On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> It's not "touching six times less data". It's touching the exact same >>> number of tuples either way, just index tuples in one case and heap >>> tuples in the other. >> Yeah, but it works out to fewer pages. > But access to those is not sequential. I guess if you measure cache hit ratios > the index scan will come out significantly worse. Huh? In the case he's complaining about, the index is all in RAM. Sequentiality of access is not an issue (at least not at the page level --- within a page I suppose there could be cache-line effects). regards, tom lane
On Saturday, October 22, 2011 05:20:26 PM Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On Friday, October 21, 2011 08:14:12 PM Robert Haas wrote: > >> On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >>> It's not "touching six times less data". It's touching the exact same > >>> number of tuples either way, just index tuples in one case and heap > >>> tuples in the other. > >> > >> Yeah, but it works out to fewer pages. > > > > But access to those is not sequential. I guess if you measure cache hit > > ratios the index scan will come out significantly worse. > > Huh? In the case he's complaining about, the index is all in RAM. > Sequentiality of access is not an issue (at least not at the page > level --- within a page I suppose there could be cache-line effects). I was talking about L2/L3 caches... Andres
Andres Freund <andres@anarazel.de> writes: > On Saturday, October 22, 2011 05:20:26 PM Tom Lane wrote: >> Huh? In the case he's complaining about, the index is all in RAM. >> Sequentiality of access is not an issue (at least not at the page >> level --- within a page I suppose there could be cache-line effects). > I was talking about L2/L3 caches... Yeah, but unless you think cache lines cross page boundaries (and we do take pains to align the buffers on 8K addresses), there's not going to be any sequentiality effect. Even if there were, it would only apply if the pages chanced to be adjacent in the buffer array, and there is no reason to expect that to be the case, for either seqscans or indexscans. regards, tom lane
On Fri, Oct 21, 2011 at 10:57 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> Yeah, but it works out to fewer pages. > > But since those pages are already in RAM, why would it matter all that > much? (Other than in the case of highly concurrent access, which you > don't seem to be testing?) Well, because memory access takes time, and accessing more memory takes more time. In the testing that I've done recently, performance on in-memory workloads seems to be extremely sensitive to memory speed, so you'd think that cutting down on memory access would be a win. > One of Tom's commits that made it not lock the same index page over > and over again (once for each tuple on it) made me think it should be > much faster than the seq scan, but a bit of random flailing about > convinced me that any saving from this were compensated for by the > high over head of FunctionCall2Coll and all of the hokey-pokey that > that call entails, which a seqscan can skip entirely. Huh. Not sure whether that's significant or not. > If count(*) could cause the index-only scan to happen in physical > order of the index, rather than logical order, that might be a big > win. Both for all in memory and for not-all-in-memory. That's an interesting point. I sort of assumed that would only help for not-all-in-memory, but maybe not. The trouble is that I think there are some problematic concurrency issues there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
----- Цитат от Tom Lane (tgl@sss.pgh.pa.us), на 22.10.2011 в 19:19 ----- <br /><br />> Andres Freund writes: <br />>>On Saturday, October 22, 2011 05:20:26 PM Tom Lane wrote: <br />>>> Huh? In the case he's complainingabout, the index is all in RAM. <br />>>> Sequentiality of access is not an issue (at least not at thepage <br />>>> level --- within a page I suppose there could be cache-line effects). <br />> <br />>>I was talking about L2/L3 caches... <br />> <br />> Yeah, but unless you think cache lines cross page boundaries(and we do <br />> take pains to align the buffers on 8K addresses), there's not going to <br />> be anysequentiality effect. Even if there were, it would only apply <br />> if the pages chanced to be adjacent in the bufferarray, and there is no <br />> reason to expect that to be the case, for either seqscans or indexscans. <br />><br />> regards, tom lane <br /><br />I worked on in-memory hash stables of parrot project. It is not the same as<br />btrees but the structure and memory layout are not that different - tupples are <br />going into pages etc. <br /><br/>I have benchmarked iterating over such hash tables - sequential scan <br />of the same table comes 20-30% faster thanscan ordered by the hash value <br />of the key. And this is overhead only of CPU cache lines - the numbers of <br />instructionsexecuted on the processor are pretty much the same (counted by <br />valgrind). <br /><br />So I do think thatif we have sequential scan of indexes (physical order) it <br />will help even when all the data is in the buffercache.<br /><br />Best regards <br /><br />-- <br />Luben Karavelov
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Oct 21, 2011 at 10:57 PM, Jeff Janes <jeff.janes@gmail.com> wrote: >> If count(*) could cause the index-only scan to happen in physical >> order of the index, rather than logical order, that might be a big >> win. �Both for all in memory and for not-all-in-memory. > That's an interesting point. I sort of assumed that would only help > for not-all-in-memory, but maybe not. The trouble is that I think > there are some problematic concurrency issues there. Yeah. We managed to make physical-order scanning work for VACUUM because it's okay if VACUUM sometimes sees the same index tuple twice; it'll just make the same decision about (not) deleting it. That will not fly for regular querying. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Oct 21, 2011 at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Robert Haas <robertmhaas@gmail.com> writes: >>> Anyhow, here's the scoop. On my desktop machine running F14, running >>> SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of >>> oprofile data: >>> 176830 13.0801 postgres postgres ExecProject >> Hm, that's weird. In both these cases, I'd have expected that >> ExecProject would get optimized away thanks to selection of a physical >> tlist for the scan node. Wonder if that got broken ... > If it did, it looks like it wasn't recent. I set up the same test > case on my MacBook using REL9_1_STABLE and REL9_0_STABLE and set a > breakpoint on ExecProject(). Both back-branches appear to also call > ExecProject() for every tuple. Oh, the ExecProject calls are coming from advance_aggregates(). Move along, nothing to see here ... regards, tom lane
On Fri, Oct 21, 2011 at 12:07 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Oct 21, 2011 at 2:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I think HeapTupleSatisfiesMVCC is probably being skipped anyway in >>> this case, since all the heap pages should be PD_ALL_VISIBLE. >> >> Proves my point ;-) ... you're comparing a code path that's been beat on >> for *years* with one that just got written. > > I know. I wrote a chunk of it. :-) My point is just that it'd be > nice to make it better. > > Anyhow, here's the scoop. On my desktop machine running F14, running > SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of > oprofile data: > > 176830 13.0801 postgres postgres ExecProject Hi Robert, count(*) and sum(1) do different things internally, and in my hands sum(1) is ~10% slower. I don't know how to dump the output of ExecBuildProjectionInfo into a human readable form, so I don't know the basis of the difference. But I wonder if using count(*) would lower the weight of the ExecProject function. Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > count(*) and sum(1) do different things internally, and in my hands > sum(1) is ~10% slower. > I don't know how to dump the output of ExecBuildProjectionInfo into a > human readable form, so I don't know the basis of the difference. But > I wonder if using count(*) would lower the weight of the ExecProject > function. Probably. count() doesn't actually have any arguments, so there's nothing for ExecProject to do. sum(1) invokes the generic case there (ExecTargetList). I suppose we could add another special-case path for constant tlist elements, but I suspect that would mostly be optimizing for benchmarks rather than helping real-world cases. regards, tom lane
On Fri, Oct 21, 2011 at 12:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > Also, this line is kind of expensive: > > if (!visibilitymap_test(scandesc->heapRelation, > ItemPointerGetBlockNumber(tid), > &node->ioss_VMBuffer)) > > Around 2%. But I don't see any way to avoid that, or even make it cheaper. Could we cache by ItemPointerGetBlockNumber(tid) the results of those tests, for groups of tids on the same index page? How useful this would be would depend on how well-clustered the table and index are. Cheers, Jeff
On Sun, Oct 23, 2011 at 7:01 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Fri, Oct 21, 2011 at 12:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> Also, this line is kind of expensive: >> >> if (!visibilitymap_test(scandesc->heapRelation, >> ItemPointerGetBlockNumber(tid), >> &node->ioss_VMBuffer)) >> >> Around 2%. But I don't see any way to avoid that, or even make it cheaper. > > Could we cache by ItemPointerGetBlockNumber(tid) the results of those > tests, for groups of tids on the same index page? > > How useful this would be would depend on how well-clustered the table > and index are. I thought about that, but the existing code is so ridiculously cheap that it's hard to believe a caching layer would save enough to pay for itself. I mean, I presume that the cost attributed to that line has to be associated with either (a) one of the pointer deferences, (b) the expense of evaluating ItemPointerGetBlockNumber(), (c) setting up the function call, or perhaps (d) overhead incurred as a result of branch mis-prediction. The actual time spent *inside* visibilitymap_test will be attributed to that function, not this one. If you add up the time for this line and visibilitymap_test(), it's like 10% of the runtime, which seems pretty significant. But it's 10% of the runtime that is spent basically a handful of arithmetic operations and then reading a byte from shared memory. It's astonishing to find that so expensive on a test with just one backend running. If you stick some kind of cache in there, it's going to involve adding a branch that isn't there now, and I think that's going to be pricey given how hot this code apparently is. Also, I'm not sure it's safe. Suppose that we lock the index page, return a tuple, check the visibility map, and find the page all visible. Another transaction comes along, adds a tuple to the index page, and clears the visibility map bit. We then go back, relock the index page, and return another tuple. We'd better notice that the visibility map bit has now been cleared, or we're in trouble. I wonder if it might help to create some method for the index to return all the matching keys on a single index page in one call. If you're dealing with an MVCC snapshot, any new tuples added after the start of the scan can't be visible anyway. That would save the overhead of unlocking and relocking the buffer once per tuple, and probably some overhead associated with validating and decoding the possibly-changed page contents each time. If you did it that way, it would also be safe to do what you're proposing - if a bunch of the tuples on the index page are also on the same heap page, you could do one visibility map check for all of them. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I had wondered whether it'd be worth optimizing that along the > lines of slot_getallattrs(). But most indexes probably have only > one column, or anyway not enough to make for a useful savings. From a heavily-used production database: cir=> select indnatts, count(*) from pg_index group by indnatts order by indnatts;indnatts | count ----------+------- 1 | 200 2 | 684 3 | 155 4 | 76 5 | 43 6 | 13 7 | 2 9 | 1 (8 rows) This includes system table and TOAST table indexes (which seem to have two columns). There are over 400 user tables, each of which has a primary key, so most primary keys in our database are more than one column. -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I had wondered whether it'd be worth optimizing that along the >> lines of slot_getallattrs(). But most indexes probably have only >> one column, or anyway not enough to make for a useful savings. >> From a heavily-used production database: > cir=> select indnatts, count(*) from pg_index group by indnatts > order by indnatts; > indnatts | count > ----------+------- > 1 | 200 > 2 | 684 > 3 | 155 > 4 | 76 > 5 | 43 > 6 | 13 > 7 | 2 > 9 | 1 > (8 rows) > This includes system table and TOAST table indexes (which seem to > have two columns). Yeah, TOAST indexes are 2-column. It would be best to exclude those from your counts, since it seems pretty unlikely that anyone will care how fast nodeIndexonlyscan.c is for scans on toast tables. > There are over 400 user tables, each of which > has a primary key, so most primary keys in our database are more > than one column. It doesn't look to me like the mean is above 2 (unless you have many fewer toast tables than I suspect), so trying to optimize many-column cases isn't going to help. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > Yeah, TOAST indexes are 2-column. It would be best to exclude > those from your counts, since it seems pretty unlikely that anyone > will care how fast nodeIndexonlyscan.c is for scans on toast > tables. User indexes (excluding toast):indnatts | count ----------+------- 1 | 200 2 | 222 3 | 155 4 | 76 5 | 43 6 | 13 7 | 2 9 | 1 (8 rows) System indexes (excluding toast):indnatts | count ----------+------- 1 | 46 2 | 24 3 | 9 4 | 5 (4 rows) > It doesn't look to me like the mean is above 2 (unless you have > many fewer toast tables than I suspect), so trying to optimize > many-column cases isn't going to help. The mean is 2.4 (give or take a little depending on whether you include system tables). I have no idea where the optimization becomes worthwhile, but the assertion that most indexes probably have a single column worried me. I'm sure there are databases where that is true (especially for those who insist on adding a meaningless surrogate key column to every table), but there are many where it isn't true. I would guess that our average of 2.4 is higher than most, though. -Kevin
Copy/paste problems -- the first set includes the system tables except for toast. User tables would be the difference between the results below. Sorry. -Kevin "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> Yeah, TOAST indexes are 2-column. It would be best to exclude >> those from your counts, since it seems pretty unlikely that >> anyone will care how fast nodeIndexonlyscan.c is for scans on >> toast tables. > > User indexes (excluding toast): > > indnatts | count > ----------+------- > 1 | 200 > 2 | 222 > 3 | 155 > 4 | 76 > 5 | 43 > 6 | 13 > 7 | 2 > 9 | 1 > (8 rows) > > System indexes (excluding toast): > > indnatts | count > ----------+------- > 1 | 46 > 2 | 24 > 3 | 9 > 4 | 5 > (4 rows) > >> It doesn't look to me like the mean is above 2 (unless you have >> many fewer toast tables than I suspect), so trying to optimize >> many-column cases isn't going to help. > > The mean is 2.4 (give or take a little depending on whether you > include system tables). I have no idea where the optimization > becomes worthwhile, but the assertion that most indexes probably > have a single column worried me. I'm sure there are databases > where that is true (especially for those who insist on adding a > meaningless surrogate key column to every table), but there are > many where it isn't true. I would guess that our average of 2.4 > is higher than most, though. > > -Kevin
On Mon, Oct 24, 2011 at 2:15 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: >> It doesn't look to me like the mean is above 2 (unless you have >> many fewer toast tables than I suspect), so trying to optimize >> many-column cases isn't going to help. > > The mean is 2.4 (give or take a little depending on whether you > include system tables). I have no idea where the optimization > becomes worthwhile, but the assertion that most indexes probably > have a single column worried me. I'm sure there are databases where > that is true (especially for those who insist on adding a > meaningless surrogate key column to every table), but there are many > where it isn't true. I would guess that our average of 2.4 is > higher than most, though. Keep in mind that the existence of index-only scans is going to encourage people to try to create covering indexes on columns that aren't indexed today. It's not clear how much of a win that will be; for certainly workloads, HOT might make it backfire spectacularly. But even though Tom's statement that most indexes are one column might be a slight exaggeration, I suspect it probably is true that the optimizations he's talking about for large numbers of columns won't produce any material benefit even for a 3 or 4 column index. Which makes me think maybe we should focus our efforts elsewhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > But even though Tom's statement that most indexes are one column might > be a slight exaggeration, I suspect it probably is true that the > optimizations he's talking about for large numbers of columns won't > produce any material benefit even for a 3 or 4 column index. Which > makes me think maybe we should focus our efforts elsewhere. Right. If we thought the average was something like ten, it might be worth pursuing optimizations similar to slot_getallattrs. If it's around two or three, almost certainly not. Your point about people trying to create wider indexes to exploit index-only scans is an interesting one, but I think it's premature to optimize on the basis of hypotheses about what people might do in future. Not sure about your other idea of returning multiple tuples per amgettuple call. The trouble with that is that it will add complexity (and hence cycles) at the nodeIndexscan level, because now nodeIndexscan will have to buffer those tuples, keep track of whether it's fetching forward or backward, etc etc. Plus another layer of the same in indexam.c (index_getnext etc). I'm not at all convinced that it's likely to be a net win. I wonder how trustworthy the measure of the visibilitymap_test call site as a consumer of cycles really is. I've frequently noticed that oprofile blames remarkably large fractions of the runtime on individual statements that appear to be quite trivial. I'm not sure if that represents real hardware-level effects such as cache line switching, or whether it's just measurement artifacts. Keep in mind that sampling-based measurements are always subject to sampling artifacts. regards, tom lane
On 10/24/11 12:35 PM, Tom Lane wrote: > Your point about people trying to create wider indexes to exploit > index-only scans is an interesting one, but I think it's premature to > optimize on the basis of hypotheses about what people might do in > future. I don't think that this is hypothetical at all. I know *I'll* be doing it, and we can expect users who are familiar with MySQL and Oracle to do it as well. No, it won't be the majority of our users, who are using ORMs and thus don't really think about indexing at all. But it will be a significant number of users who are performance-sensitive ... such as most or all of our data warehousing users. Mind you, we're pretty much talking exclusively about users whose tables don't fit in memory ... usually tables which are 10X or more the size of memory. One case which is going to be critical to test is the "join" table, i.e. the table which supports many-to-many joins and consists only of keys from the respective two other tables. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Your point about people trying to create wider indexes to exploit > index-only scans is an interesting one, but I think it's premature to > optimize on the basis of hypotheses about what people might do in > future. Well, I don't think it's too much of a stretch to guess that people will try to use covering indexes; that's common practice on other products and a frequent and a not-uncommon heartfelt request from people with large (non-memory-resident) databases they want to migrate to PostgreSQL. Exactly to what degree they'll do that, and how well it will work, is another question. But I have little doubt that it will be tried. > Not sure about your other idea of returning multiple tuples per > amgettuple call. The trouble with that is that it will add complexity > (and hence cycles) at the nodeIndexscan level, because now nodeIndexscan > will have to buffer those tuples, keep track of whether it's fetching > forward or backward, etc etc. Plus another layer of the same in > indexam.c (index_getnext etc). I'm not at all convinced that it's > likely to be a net win. I definitely agree that you don't want two layers of caching, but I don't see why we'd need that. I wasn't thinking of changing index_getnext() at all, but rather adding a new API that fills a buffer (or a pair of buffers, maybe) with index tuples and heap TIDs. It should spit them out in the same order that multiple index_getnext() calls would have done and leave the scan position wherever it would have ended up after a number of index_getnext_tid() calls equal to the number of TIDs returned. Any user of the API (and it might be just nodeIndexonlyscan.c) would just need to keep track of the number of tuples returned and the number consumed to date. This actually gets into a wider architectural discussion, which is whether the fact that the whole executor (modulo bitmap scans and a few other special cases) operates on one tuple at a time is a good design... but my brain hurts just thinking about that. > I wonder how trustworthy the measure of the visibilitymap_test call site > as a consumer of cycles really is. I've frequently noticed that > oprofile blames remarkably large fractions of the runtime on individual > statements that appear to be quite trivial. I'm not sure if that > represents real hardware-level effects such as cache line switching, > or whether it's just measurement artifacts. Keep in mind that > sampling-based measurements are always subject to sampling artifacts. I'm not sure either. I guess we could try short-circuiting visibilitymap_test and see what that does to performance (let's leave correct answers out of it). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Josh Berkus <josh@agliodbs.com> wrote: > On 10/24/11 12:35 PM, Tom Lane wrote: >> Your point about people trying to create wider indexes to exploit >> index-only scans is an interesting one, but I think it's >> premature to optimize on the basis of hypotheses about what >> people might do in future. > > I don't think that this is hypothetical at all. I know *I'll* be > doing it, and we can expect users who are familiar with MySQL and > Oracle to do it as well. And Sybase, and MS SQL Server. And others, most likely. We've never gotten around to narrowing the indexes to which we added extra columns to overcome performance problems through "covering index" techniques when we were using Sybase, so they're already here. :-) > One case which is going to be critical to test is the "join" > table, i.e. the table which supports many-to-many joins and > consists only of keys from the respective two other tables. Yeah, that is an important use of covering indexes for us. -Kevin
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I wonder how trustworthy the measure of the visibilitymap_test call site >> as a consumer of cycles really is. > I'm not sure either. I guess we could try short-circuiting > visibilitymap_test and see what that does to performance (let's leave > correct answers out of it). That would conflate the cost of the call with the cost of the function. Maybe you could try manually inlining the visibility test? regards, tom lane
Hello,
my experience is that as soon as index only scans are available they are used - sometimes just because of the simple logic that a user thinks it is faster. Even when the index is so ridiculously long just to have all info in the index...
Regards
Wolfgang Wilhelm
Von: Tom Lane <tgl@sss.pgh.pa.us>
An: Robert Haas <robertmhaas@gmail.com>
Cc: Kevin Grittner <Kevin.Grittner@wicourts.gov>; pgsql-hackers@postgresql.org
Gesendet: 21:35 Montag, 24.Oktober 2011
Betreff: Re: [HACKERS] So, is COUNT(*) fast now?
Robert Haas <robertmhaas@gmail.com> writes:
> But even though Tom's statement that most indexes are one column might
> be a slight exaggeration, I suspect it probably is true that the
> optimizations he's talking about for large numbers of columns won't
> produce any material benefit even for a 3 or 4 column index. Which
> makes me think maybe we should focus our efforts elsewhere.
Right. If we thought the average was something like ten, it might be
worth pursuing optimizations similar to slot_getallattrs. If it's
around two or three, almost certainly not.
Your point about people trying to create wider indexes to exploit
index-only scans is an interesting one, but I think it's premature to
optimize on the basis of hypotheses about what people might do in
future.
Not sure about your other idea of returning multiple tuples per
amgettuple call. The trouble with that is that it will add complexity
(and hence cycles) at the nodeIndexscan level, because now nodeIndexscan
will have to buffer those tuples, keep track of whether it's fetching
forward or backward, etc etc. Plus another layer of the same in
indexam.c (index_getnext etc). I'm not at all convinced that it's
likely to be a net win.
I wonder how trustworthy the measure of the visibilitymap_test call site
as a consumer of cycles really is. I've frequently noticed that
oprofile blames remarkably large fractions of the runtime on individual
statements that appear to be quite trivial. I'm not sure if that
represents real hardware-level effects such as cache line switching,
or whether it's just measurement artifacts. Keep in mind that
sampling-based measurements are always subject to sampling artifacts.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Tom Lane wrote: > I wonder how trustworthy the measure of the visibilitymap_test call site > as a consumer of cycles really is. I've frequently noticed that > oprofile blames remarkably large fractions of the runtime on individual > statements that appear to be quite trivial. I'm not sure if that Are these statements perhaps near kernel calls? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, Oct 24, 2011 at 4:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I wonder how trustworthy the measure of the visibilitymap_test call site >>> as a consumer of cycles really is. > >> I'm not sure either. I guess we could try short-circuiting >> visibilitymap_test and see what that does to performance (let's leave >> correct answers out of it). > > That would conflate the cost of the call with the cost of the function. > Maybe you could try manually inlining the visibility test? I fooled around with this some more on my Fedora 14 desktop machine. pgbench database, scale factor 20, shared_buffers=400MB. I ran the query "select count(*) from pgbench_accounts". I'm having a hard time getting completely reproducible results, but it appears that warming the cache makes the sequential scan go faster drop from maybe 390 ms to 245 ms, and an index-only scan takes about 350 ms, so which one is better depends a lot on your assumptions about what is going on on the system at the same time (which means maybe we ought not to sweat it). If I wipe out the whole if-block that calls visibilitymap_test(), the index-only scan drops down to about 300 ms (and delivers potentially wrong answers, of course). Inlining visibilitymap_test (see attached vismap-inline.patch) causes the index-only scan to drop to about 330 ms. I also tried changing the BufferIsValid() tests in visibilitymap_test() to use BufferIsInvalid() instead, with the sense of the tests reversed (see attached vismap-test-invalid.patch). Since BufferIsInvalid() just checks for InvalidBuffer instead of also doing the sanity checks, it's significantly cheaper. This also reduced the time to about 330 ms, so seems clearly worth doing. Apparently these changes don't stack, because doing both things only gets me down to about 320 ms, which is fairly unexciting for the amount of ugliness that inlining entails. I tried sprinkling a little branch-prediction magic on this code using GCC's __builtin_expect(). That initially seemed to help, but once I changed the BufferIsValid() test to !BufferIsInvalid() essentially all of the savings disappeared. I also spent some time poking through the opannotate -s -a output, which shows where time is being spent by individual assembler instruction, but also annotates the assembler listing with the original source code. At least according to oprofile, the time that is being spent in IndexOnlyNext() is mostly being spent on seemingly innocuous operations like saving and restoring registers. For example, much of the time being attributed to the visibilitymap_test() call in IndexOnlyNext() is actually attributable to the instructions that are calculating what address to pass for scandesc->heapRelation. Many but not all of the pointer deferences at the top of IndexOnlyNext() have a chunk cycles attributed to them, and while they're not that significant individually, they add up. Similarly, the long series of instructions to which index_getattr() resolves bleeds cycles at just about every step. There's not much to optimize there, though, unless we want to add some code that avoids decoding the tuple altogether in the particular case of a zero-argument aggregate, or maybe something more general that only pulls out the columns that are actually needed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > I also tried changing the BufferIsValid() tests in > visibilitymap_test() to use BufferIsInvalid() instead, with the sense > of the tests reversed (see attached vismap-test-invalid.patch). Since > BufferIsInvalid() just checks for InvalidBuffer instead of also doing > the sanity checks, it's significantly cheaper. This also reduced the > time to about 330 ms, so seems clearly worth doing. Hmm. I wonder whether it wouldn't be better to get rid of the range checks in BufferIsValid, or better convert them into Asserts. It seems less than intuitive that BufferIsValid and BufferIsInvalid aren't simple inverses. regards, tom lane
On Fri, Oct 28, 2011 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I also tried changing the BufferIsValid() tests in >> visibilitymap_test() to use BufferIsInvalid() instead, with the sense >> of the tests reversed (see attached vismap-test-invalid.patch). Since >> BufferIsInvalid() just checks for InvalidBuffer instead of also doing >> the sanity checks, it's significantly cheaper. This also reduced the >> time to about 330 ms, so seems clearly worth doing. > > Hmm. I wonder whether it wouldn't be better to get rid of the range > checks in BufferIsValid, or better convert them into Asserts. It seems > less than intuitive that BufferIsValid and BufferIsInvalid aren't simple > inverses. Seems reasonable. It would break if anyone is using an out-of-range buffer number in lieu of InvalidBuffer, but I doubt that anyone is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Oct 28, 2011 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hmm. �I wonder whether it wouldn't be better to get rid of the range >> checks in BufferIsValid, or better convert them into Asserts. �It seems >> less than intuitive that BufferIsValid and BufferIsInvalid aren't simple >> inverses. > Seems reasonable. It would break if anyone is using an out-of-range > buffer number in lieu of InvalidBuffer, but I doubt that anyone is. Yeah, I find that unlikely as well. But leaving Asserts in place would tell us. regards, tom lane
On Fri, Oct 28, 2011 at 3:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Oct 28, 2011 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Hmm. I wonder whether it wouldn't be better to get rid of the range >>> checks in BufferIsValid, or better convert them into Asserts. It seems >>> less than intuitive that BufferIsValid and BufferIsInvalid aren't simple >>> inverses. > >> Seems reasonable. It would break if anyone is using an out-of-range >> buffer number in lieu of InvalidBuffer, but I doubt that anyone is. > > Yeah, I find that unlikely as well. But leaving Asserts in place would > tell us. OK. Should I go do that, or are you all over it? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Oct 28, 2011 at 3:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Yeah, I find that unlikely as well. �But leaving Asserts in place would >> tell us. > OK. Should I go do that, or are you all over it? Go for it. regards, tom lane
On Fri, Oct 28, 2011 at 3:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Oct 28, 2011 at 3:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Yeah, I find that unlikely as well. But leaving Asserts in place would >>> tell us. > >> OK. Should I go do that, or are you all over it? > > Go for it. OK, done. Any other thoughts on the rest of what I wrote? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Quoting Robert Haas: """ I tried this on my MacBook Pro this morning, using pgbench -i -s 500 to create a database about 7.5GB in size, and then using "SELECT sum(1) FROM pgbench_accounts" as a test query, on a build WITHOUT --enable-cassert. This machine has 4GB of memory, and I set shared_buffers = 400MB. (No, I'm not sure whether that's the optimal setting for shared_buffers for this machine.) """ I did some tests where I tried to compare the effect of having the index ordered tuples not be in the same order they are in the base table. The idea is to test what effect accessing the VM map randomly as opposed to sequential order has on performance. I suspect the above test access the VM in order (the accounts table is effectively clustered on the index used in the test). I might be mistaken here. My test setup is this: drop table if exists test; drop table if exists test2; create unlogged table test /* with (fillfactor = 10) */ as select generate_series(0, 20*1000*1000) as id; create index idx1 on test(id); vacuum test; create unlogged table test2 /* with (fillfactor = 10) */ as (select * from test order by random()); create index idx2 on test2(id); vacuum test2; Table size is around 600MB, index size is around 350MB and VM on-disk size is 16kB with default fillfactor. With fillfactor = 10, the VM size is 104 KB, and table size is around 6GB. The index size is the same. Results for the randomly ordered table: # select count(*) from test2; 14822.045 ms 14826.253 ms 14815.450 ms Results for the effectively clustered table: # select count(*) from test; 11761.890 ms 11767.926 ms 11810.900 ms Now, this test still has the benefit of fitting the VM easily into the L1 cache. Next, I did a ugly hack to get the table size large enough so that the VM will trash the L1 cache while still having somewhat reasonable test setup creation time. My harware is old, 1GB of memory, processor is Genuine Intel(R) CPU L2400 @ 1.66GHz. The L1 data cache size is 32kB on my. The hack is to simply set fillfactor to 10. The VM size is now 104kB, the table size is about 6.3 GB while the index size is still the same as in above test. Results for the randomly ordered table: # select count(*) from test2; 21606.683 ms 21829.063 ms 21637.434 ms Results for the effectively clustered table: # select count(*) from test; 11714.663 ms 11449.264 ms 11658.534 ms Now, the next step would be to trash the L2 cache (20GB table size should do this on Sandy Bridge, where L2 cache is 256KB). I don't have hardware to do that test. It is worth noting that the L2 cache is shared on Sandy Bridge, so it is likely that an index-only scan of a large enough table would slow down other processes, too. Without tests this is only FUD, though. The test would be to scan a 20GB table's index repeatedly in one process, and see how it affects standard in-memory pgbench results for other processes. Compare this with doing the same with a sequential scan process. Lessons learned (or what I learned, at least): - Clustering is important for index only scans. Picking a clustered index over non-clustered index will have a big performance effect. - Large table index-only scans are going to be more expensivecompared to sequential scan than what pgbench accounts tests suggests. I assume that the accounts table is effectivelyclustered on the index used. I haven't verified this. - There is the possibility that index-only scans willtrash the caches for other processes, too. Not tested, though. I am sure these results will vary significantly based on hardware used. I am also notorious for screwing up benchmarks, so verifying these results is recommended. You will need around 16GB of disk space for the fillfactor = 10 test. I would recommend you have more than 1GB of memory, otherwise creating the test setup can take some time... - Anssi Kääriäinen
Sorry, I forgot to include the version used & some information about my setup: PostgreSQL version: Git HEAD as of: Date: Fri Oct 28 21:18:36 2011 -0400 Commit: 51eba98cf4595e90730dedd9305da8aa84b649ee Compiled with defaults, (only change --with-pgport = 5431). I used default settings, shared_buffers size is 24MB. The system is Linux Mint Debian edition (kernel 3.0.0, gcc 4.6.1). The interesting parts about my hardware were in the original post. - Anssi
On Sun, Oct 30, 2011 at 8:02 AM, Kääriäinen Anssi <anssi.kaariainen@thl.fi> wrote: > Table size is around 600MB, index size is around 350MB and VM on-disk > size is 16kB with default fillfactor. With fillfactor = 10, the VM size is 104 > KB, and table size is around 6GB. The index size is the same. What I think you're probably measuring here (oprofile would tell us for sure) is that once the size of the table goes beyond about half a gigabyte, it will have more than one page in the visibility map. The index-only scan code keeps the most recently used visibility map page pinned to save on overhead, but if you're bouncing back and forth between data in the first ~500MB of the table and data in the last ~100MB, each switch will result in dropping the current pin and getting a new one, which figures to be fairly expensive. With the table is only a little over 500GB, you're probably only changing VM pages every couple of tuples, but with a 6GB table just about every tuple will switch to a new VM page. Now, maybe you're right and the CPU caches are the more significant effect. But I wouldn't like to bet on it without seeing how much the drop-and-get-new-pin operations are costing us. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 10/31/2011 02:44 PM, Robert Haas wrote: > What I think you're probably measuring here (oprofile would tell us > for sure) is that once the size of the table goes beyond about half a > gigabyte, it will have more than one page in the visibility map. The > index-only scan code keeps the most recently used visibility map page > pinned to save on overhead, but if you're bouncing back and forth > between data in the first ~500MB of the table and data in the last > ~100MB, each switch will result in dropping the current pin and > getting a new one, which figures to be fairly expensive. With the > table is only a little over 500GB, you're probably only changing VM > pages every couple of tuples, but with a 6GB table just about every > tuple will switch to a new VM page. > > Now, maybe you're right and the CPU caches are the more significant > effect. But I wouldn't like to bet on it without seeing how much the > drop-and-get-new-pin operations are costing us. > Maybe I should have left the analysis part out of the post, I don't know the internals, so my analysis is likely to be wrong. Now that I think of it, claiming that the cache effect is 50% of the runtime is likely a little wrong... However the part about clustering being important is still correct. According to the test, you can get 50% overhead because of random access to the VM. Stupid question, but why not keep the whole VM pinned? - Anssi
On Mon, Oct 31, 2011 at 9:51 AM, Anssi Kääriäinen <anssi.kaariainen@thl.fi> wrote: > Stupid question, but why not keep the whole VM pinned? It might be that keeping more than one VM page pinned is a good idea, but we'd have to think carefully about it. For example, if we pin too many pages in shared_buffers, other queries could start erroring out for failure to find an evictable buffer. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company