Home > mailing lists

Thread: So, is COUNT(*) fast now?

So, is COUNT(*) fast now?

From

Robert Haas

Date:

21 October 2011, 13:38:11

Laments at:

http://wiki.postgresql.org/wiki/FAQ#Why_is_.22SELECT_count.28.2A.29_FROM_bigtable.3B.22_slow.3F
http://wiki.postgresql.org/wiki/Slow_Counting

I tried this on my MacBook Pro this morning, using pgbench -i -s 500
to create a database about 7.5GB in size, and then using "SELECT
sum(1) FROM pgbench_accounts" as a test query, on a build WITHOUT
--enable-cassert. This machine has 4GB of memory, and I set
shared_buffers = 400MB. (No, I'm not sure whether that's the optimal
setting for shared_buffers for this machine.)

With enable_indexonlyscan = false, times for this query are: 96799.747
ms, 89108.987 ms, 114433.664 ms.
With enable_indexonlyscan = true, times were: 16779.325 ms, 16537.726
ms, 16703.229 ms.

That's about six times faster. It's worth noting that the
pgbench_accounts table has relatively narrow rows. On a table with
wider rows (but not so wide that they get toasted and become narrow
again), the benefit might be more. On the other hand, this is also a
table that's just been vacuumed, and you've got the happy case where
the table (6404 MB) does not fit in memory but but the index (1071 MB)
does. What happens on a smaller test case? Here are the results at
scale factor = 100:

enable_indexonlyscan = false times: 1774.945 ms, 1784.985 ms, 1836.099 ms
enable_indexonlyscan = true times: 1450.445 ms, 1452.407 ms, 1452.426 ms

That's about a 23% speedup. At this scale factor, everything fits
into memory, but the index by itself (214 MB) fits into memory while
the table (1281 MB) does not. Let's crank the scale factor down some
more. Here are the results at scale_factor = 20:

enable_indexonlyscan = false times: 352.213 ms, 353.988 ms, 350.859 ms
enable_indexonlyscan = true times: 300.623 ms, 301.355 ms, 302.346 ms

Now the entire database fits into shared_buffers, but we're still
seeing a 17% speedup. But this turns out to misleading. The
ring-buffer logic is actually preventing shared_buffers from getting
all of the heap blocks in cache quickly. If I run the query over and
over again until the whole table actually makes it into shared
buffers, the sequential scan gets much faster:

enable_indexonlyscan = false times after lots and lots of prewarming:
215.487 ms, 219.006 ms, 218.490 ms

That's a bit disappointing - it's now more than a third faster to do
the sequential scan, even though the sequential scan has to touch six
times as many blocks (at scale factor 20, index is 43 MB, table is 256
MB) all of which are in cache. Of course, touching that many fewer
blocks does have some advantages if there is concurrent activity on
the system, but it still seems unfortunate that the ratio of runtime
to blocks touched is more than 8x higher for the index-only case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

21 October 2011, 14:18:28

Robert Haas <robertmhaas@gmail.com> writes:
> That's a bit disappointing - it's now more than a third faster to do
> the sequential scan, even though the sequential scan has to touch six
> times as many blocks (at scale factor 20, index is 43 MB, table is 256
> MB) all of which are in cache.  Of course, touching that many fewer
> blocks does have some advantages if there is concurrent activity on
> the system, but it still seems unfortunate that the ratio of runtime
> to blocks touched is more than 8x higher for the index-only case.

I don't know why you'd imagine that touching an index is free, or even
cheap, CPU-wise.  The whole point of the index-only optimization is to
avoid I/O.  When you try it on a case where there's no I/O to be saved,
*and* no shared-buffers contention to be avoided, there's no way it's
going to be a win.
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

21 October 2011, 14:20:55

On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> That's a bit disappointing - it's now more than a third faster to do
>> the sequential scan, even though the sequential scan has to touch six
>> times as many blocks (at scale factor 20, index is 43 MB, table is 256
>> MB) all of which are in cache.  Of course, touching that many fewer
>> blocks does have some advantages if there is concurrent activity on
>> the system, but it still seems unfortunate that the ratio of runtime
>> to blocks touched is more than 8x higher for the index-only case.
>
> I don't know why you'd imagine that touching an index is free, or even
> cheap, CPU-wise.  The whole point of the index-only optimization is to
> avoid I/O.  When you try it on a case where there's no I/O to be saved,
> *and* no shared-buffers contention to be avoided, there's no way it's
> going to be a win.

Well, call me naive, but I would have thought touching six times less
data would make the operation run faster, not slower.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

21 October 2011, 15:08:32

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I don't know why you'd imagine that touching an index is free, or even
>> cheap, CPU-wise. �The whole point of the index-only optimization is to
>> avoid I/O. �When you try it on a case where there's no I/O to be saved,
>> *and* no shared-buffers contention to be avoided, there's no way it's
>> going to be a win.

> Well, call me naive, but I would have thought touching six times less
> data would make the operation run faster, not slower.

It's not "touching six times less data".  It's touching the exact same
number of tuples either way, just index tuples in one case and heap
tuples in the other.  You've arranged the test case so that all these
tuples are in shared buffers already, so there's no data movement to be
avoided.  What this test case proves is that btree's overhead per index
tuple touched is significantly more than the cost of the fastest path
through HeapTupleSatisfiesMVCC, which I don't find surprising
considering how much sweat has been expended on that code path over the
years.

(It may well be that it's not even btree at fault but the per-tuple
visits to the visibility map ... did you do any oprofiling yet?)
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

21 October 2011, 15:14:20

On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I don't know why you'd imagine that touching an index is free, or even
>>> cheap, CPU-wise.  The whole point of the index-only optimization is to
>>> avoid I/O.  When you try it on a case where there's no I/O to be saved,
>>> *and* no shared-buffers contention to be avoided, there's no way it's
>>> going to be a win.
>
>> Well, call me naive, but I would have thought touching six times less
>> data would make the operation run faster, not slower.
>
> It's not "touching six times less data".  It's touching the exact same
> number of tuples either way, just index tuples in one case and heap
> tuples in the other.

Yeah, but it works out to fewer pages.

> You've arranged the test case so that all these
> tuples are in shared buffers already, so there's no data movement to be
> avoided.  What this test case proves is that btree's overhead per index
> tuple touched is significantly more than the cost of the fastest path
> through HeapTupleSatisfiesMVCC, which I don't find surprising
> considering how much sweat has been expended on that code path over the
> years.

I think HeapTupleSatisfiesMVCC is probably being skipped anyway in
this case, since all the heap pages should be PD_ALL_VISIBLE.

> (It may well be that it's not even btree at fault but the per-tuple
> visits to the visibility map ... did you do any oprofiling yet?)

No, but I think that might be a good idea.  Maybe I'll go do that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

21 October 2011, 15:33:39

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What this test case proves is that btree's overhead per index
>> tuple touched is significantly more than the cost of the fastest path
>> through HeapTupleSatisfiesMVCC, which I don't find surprising
>> considering how much sweat has been expended on that code path over the
>> years.

> I think HeapTupleSatisfiesMVCC is probably being skipped anyway in
> this case, since all the heap pages should be PD_ALL_VISIBLE.

Proves my point ;-) ... you're comparing a code path that's been beat on
for *years* with one that just got written.
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

21 October 2011, 16:07:38

On Fri, Oct 21, 2011 at 2:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think HeapTupleSatisfiesMVCC is probably being skipped anyway in
>> this case, since all the heap pages should be PD_ALL_VISIBLE.
>
> Proves my point ;-) ... you're comparing a code path that's been beat on
> for *years* with one that just got written.

I know.  I wrote a chunk of it.  :-)  My point is just that it'd be
nice to make it better.

Anyhow, here's the scoop.  On my desktop machine running F14, running
SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of
oprofile data:

176830   13.0801  postgres                 postgres                 ExecProject
170028   12.5770  postgres                 postgres
IndexOnlyNext
96631     7.1478  postgres                 postgres
visibilitymap_test
86019     6.3628  postgres                 postgres
advance_aggregates
74366     5.5009  postgres                 postgres                 ExecScan
72428     5.3575  postgres                 postgres
ExecClearTuple
68483     5.0657  postgres                 postgres                 btgettuple
60614     4.4836  postgres                 postgres
advance_transition_function
59680     4.4145  postgres                 postgres                 ExecProcNode
52295     3.8683  postgres                 postgres
_bt_checkkeys
52078     3.8522  libc-2.12.90.so          libc-2.12.90.so
__memcpy_sse2
49548     3.6651  postgres                 postgres
index_getnext_tid
48265     3.5702  postgres                 postgres
ExecEvalConst
42989     3.1799  postgres                 postgres                 _bt_next
40544     2.9990  postgres                 postgres                 _bt_readpage
35162     2.6009  no-vmlinux               no-vmlinux               /no-vmlinux
33639     2.4883  postgres                 postgres
MemoryContextReset

And without index-only scans. but everything in shared_buffers:

169515   18.4261  postgres                 postgres                 ExecProject
94827    10.3076  postgres                 postgres
heapgettup_pagemode
84850     9.2231  postgres                 postgres
advance_aggregates
57998     6.3043  postgres                 postgres
advance_transition_function
55638     6.0478  postgres                 postgres
ExecEvalConst
53684     5.8354  postgres                 postgres                 heapgetpage
51411     5.5883  postgres                 postgres                 ExecScan
48387     5.2596  postgres                 postgres                 ExecProcNode
44129     4.7968  postgres                 postgres
ExecStoreTuple
30759     3.3435  postgres                 postgres                 heap_getnext
25923     2.8178  postgres                 postgres                 SeqNext
24145     2.6245  postgres                 postgres
CheckForSerializableConflictOut
23155     2.5169  postgres                 postgres                 ExecAgg
18864     2.0505  postgres                 postgres
heap_page_prune_opt
18784     2.0418  no-vmlinux               no-vmlinux               /no-vmlinux

The index-only scan takes about 385 ms, while the non-index-only
version takes about 284 ms.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

21 October 2011, 16:52:54

On Fri, Oct 21, 2011 at 3:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> [ oprofile results ]

*grovels through the line-by-line results*

Hmm, I guess there is a bit of a hotspot in StoreIndexTuple, which is
probably being folded into IndexOnlyNext in the per-function timings:
   ExecClearTuple(slot);   for (i = 0; i < nindexatts; i++)       values[i] = index_getattr(itup, i + 1, itupdesc,
&isnull[i]);  ExecStoreVirtualTuple(slot);
 

If I'm reading these results right, that section is about 3% of the
total number of samples.

Also, this line is kind of expensive:
       if (!visibilitymap_test(scandesc->heapRelation,                               ItemPointerGetBlockNumber(tid),
                          &node->ioss_VMBuffer))
 

Around 2%.  But I don't see any way to avoid that, or even make it cheaper.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

21 October 2011, 16:55:45

Robert Haas <robertmhaas@gmail.com> writes:
> Anyhow, here's the scoop.  On my desktop machine running F14, running
> SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of
> oprofile data:

> 176830   13.0801  postgres                 postgres                 ExecProject

Hm, that's weird.  In both these cases, I'd have expected that
ExecProject would get optimized away thanks to selection of a physical
tlist for the scan node.  Wonder if that got broken ...
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

21 October 2011, 17:06:00

Robert Haas <robertmhaas@gmail.com> writes:
> Hmm, I guess there is a bit of a hotspot in StoreIndexTuple, which is
> probably being folded into IndexOnlyNext in the per-function timings:

>     ExecClearTuple(slot);
>     for (i = 0; i < nindexatts; i++)
>         values[i] = index_getattr(itup, i + 1, itupdesc, &isnull[i]);
>     ExecStoreVirtualTuple(slot);

I had wondered whether it'd be worth optimizing that along the lines of
slot_getallattrs().  But most indexes probably have only one column,
or anyway not enough to make for a useful savings.
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

21 October 2011, 17:24:20

On Fri, Oct 21, 2011 at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Anyhow, here's the scoop.  On my desktop machine running F14, running
>> SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of
>> oprofile data:
>
>> 176830   13.0801  postgres                 postgres                 ExecProject
>
> Hm, that's weird.  In both these cases, I'd have expected that
> ExecProject would get optimized away thanks to selection of a physical
> tlist for the scan node.  Wonder if that got broken ...

If it did, it looks like it wasn't recent.  I set up the same test
case on my MacBook using REL9_1_STABLE and REL9_0_STABLE and set a
breakpoint on ExecProject().  Both back-branches appear to also call
ExecProject() for every tuple.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

Jeff Janes

Date:

21 October 2011, 23:58:01

On Fri, Oct 21, 2011 at 11:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> I don't know why you'd imagine that touching an index is free, or even
>>>> cheap, CPU-wise.  The whole point of the index-only optimization is to
>>>> avoid I/O.  When you try it on a case where there's no I/O to be saved,
>>>> *and* no shared-buffers contention to be avoided, there's no way it's
>>>> going to be a win.
>>
>>> Well, call me naive, but I would have thought touching six times less
>>> data would make the operation run faster, not slower.
>>
>> It's not "touching six times less data".  It's touching the exact same
>> number of tuples either way, just index tuples in one case and heap
>> tuples in the other.
>
> Yeah, but it works out to fewer pages.

But since those pages are already in RAM, why would it matter all that
much?  (Other than in the case of highly concurrent access, which you
don't seem to be testing?)

One of Tom's commits that made it not lock the same index page over
and over again (once for each tuple on it) made me think it should be
much faster than the seq scan, but a bit of random flailing about
convinced me that any saving from this were compensated for by the
high over head of FunctionCall2Coll and all of the hokey-pokey that
that call entails, which a seqscan can skip entirely.

If count(*) could cause the index-only scan to happen in physical
order of the index, rather than logical order, that might be a big
win.  Both for all in memory and for not-all-in-memory.

Cheers,

Jeff

Re: So, is COUNT(*) fast now?

From

Andres Freund

Date:

22 October 2011, 06:54:58

On Friday, October 21, 2011 08:14:12 PM Robert Haas wrote:
> On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> I don't know why you'd imagine that touching an index is free, or even
> >>> cheap, CPU-wise.  The whole point of the index-only optimization is to
> >>> avoid I/O.  When you try it on a case where there's no I/O to be saved,
> >>> and no shared-buffers contention to be avoided, there's no way it's
> >>> going to be a win.
> >> 
> >> Well, call me naive, but I would have thought touching six times less
> >> data would make the operation run faster, not slower.
> > 
> > It's not "touching six times less data".  It's touching the exact same
> > number of tuples either way, just index tuples in one case and heap
> > tuples in the other.
> 
> Yeah, but it works out to fewer pages.
But access to those is not sequential. I guess if you measure cache hit ratios 
the index scan will come out significantly worse.


Andres

Re: So, is COUNT(*) fast now?

From

desmodemone

Date:

22 October 2011, 07:44:07

2011/10/22 Andres Freund <andres@anarazel.de>

On Friday, October 21, 2011 08:14:12 PM Robert Haas wrote:
> On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Fri, Oct 21, 2011 at 1:18 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> I don't know why you'd imagine that touching an index is free, or even
> >>> cheap, CPU-wise. The whole point of the index-only optimization is to
> >>> avoid I/O. When you try it on a case where there's no I/O to be saved,
> >>> and no shared-buffers contention to be avoided, there's no way it's
> >>> going to be a win.
> >>
> >> Well, call me naive, but I would have thought touching six times less
> >> data would make the operation run faster, not slower.
> >
> > It's not "touching six times less data". It's touching the exact same
> > number of tuples either way, just index tuples in one case and heap
> > tuples in the other.
>
> Yeah, but it works out to fewer pages.
But access to those is not sequential. I guess if you measure cache hit ratios
the index scan will come out significantly worse.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

"But access to those is not sequential" yes, I am agree.

In my opinion the problem is that. If the query needs to scan all the b-tree index without to

access the table rows, the better way to read the index is like sequential one,

in fact , query like count(*) or other not need the data are in "order" so I think we

could read all blocks (better, "only" the leaf blocks) without to touching too much the branch blocks.

For example query like this :

select column_a from table ;

is better to read the data from indexes like sequential

For query like this :

select column_a from table order by column_a ;

is better to read the data from indexes in range scan from root block to first branch blocks and their leaf blocks, so we could "save"

the sorting.

Mat

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

22 October 2011, 12:20:38

Andres Freund <andres@anarazel.de> writes:
> On Friday, October 21, 2011 08:14:12 PM Robert Haas wrote:
>> On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> It's not "touching six times less data".  It's touching the exact same
>>> number of tuples either way, just index tuples in one case and heap
>>> tuples in the other.

>> Yeah, but it works out to fewer pages.

> But access to those is not sequential. I guess if you measure cache hit ratios 
> the index scan will come out significantly worse.

Huh?  In the case he's complaining about, the index is all in RAM.
Sequentiality of access is not an issue (at least not at the page
level --- within a page I suppose there could be cache-line effects).
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Andres Freund

Date:

22 October 2011, 12:54:28

On Saturday, October 22, 2011 05:20:26 PM Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On Friday, October 21, 2011 08:14:12 PM Robert Haas wrote:
> >> On Fri, Oct 21, 2011 at 2:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >>> It's not "touching six times less data".  It's touching the exact same
> >>> number of tuples either way, just index tuples in one case and heap
> >>> tuples in the other.
> >> 
> >> Yeah, but it works out to fewer pages.
> > 
> > But access to those is not sequential. I guess if you measure cache hit
> > ratios the index scan will come out significantly worse.
> 
> Huh?  In the case he's complaining about, the index is all in RAM.
> Sequentiality of access is not an issue (at least not at the page
> level --- within a page I suppose there could be cache-line effects).
I was talking about L2/L3 caches...

Andres

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

22 October 2011, 13:19:57

Andres Freund <andres@anarazel.de> writes:
> On Saturday, October 22, 2011 05:20:26 PM Tom Lane wrote:
>> Huh?  In the case he's complaining about, the index is all in RAM.
>> Sequentiality of access is not an issue (at least not at the page
>> level --- within a page I suppose there could be cache-line effects).

> I was talking about L2/L3 caches...

Yeah, but unless you think cache lines cross page boundaries (and we do
take pains to align the buffers on 8K addresses), there's not going to
be any sequentiality effect.  Even if there were, it would only apply
if the pages chanced to be adjacent in the buffer array, and there is no
reason to expect that to be the case, for either seqscans or indexscans.
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

22 October 2011, 15:20:23

On Fri, Oct 21, 2011 at 10:57 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> Yeah, but it works out to fewer pages.
>
> But since those pages are already in RAM, why would it matter all that
> much?  (Other than in the case of highly concurrent access, which you
> don't seem to be testing?)

Well, because memory access takes time, and accessing more memory
takes more time.  In the testing that I've done recently, performance
on in-memory workloads seems to be extremely sensitive to memory
speed, so you'd think that cutting down on memory access would be a
win.

> One of Tom's commits that made it not lock the same index page over
> and over again (once for each tuple on it) made me think it should be
> much faster than the seq scan, but a bit of random flailing about
> convinced me that any saving from this were compensated for by the
> high over head of FunctionCall2Coll and all of the hokey-pokey that
> that call entails, which a seqscan can skip entirely.

Huh.  Not sure whether that's significant or not.

> If count(*) could cause the index-only scan to happen in physical
> order of the index, rather than logical order, that might be a big
> win.  Both for all in memory and for not-all-in-memory.

That's an interesting point.  I sort of assumed that would only help
for not-all-in-memory, but maybe not.  The trouble is that I think
there are some problematic concurrency issues there.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

karavelov@mail.bg

Date:

22 October 2011, 17:39:05

----- Цитат от Tom Lane (tgl@sss.pgh.pa.us), на 22.10.2011 в 19:19 ----- <br /><br />> Andres Freund  writes: <br
/>>>On Saturday, October 22, 2011 05:20:26 PM Tom Lane wrote: <br />>>> Huh? In the case he's
complainingabout, the index is all in RAM. <br />>>> Sequentiality of access is not an issue (at least not at
thepage <br />>>> level --- within a page I suppose there could be cache-line effects). <br />> <br
/>>>I was talking about L2/L3 caches... <br />> <br />> Yeah, but unless you think cache lines cross page
boundaries(and we do <br />> take pains to align the buffers on 8K addresses), there's not going to <br />> be
anysequentiality effect. Even if there were, it would only apply <br />> if the pages chanced to be adjacent in the
bufferarray, and there is no <br />> reason to expect that to be the case, for either seqscans or indexscans. <br
/>><br />> regards, tom lane <br /><br />I worked on in-memory hash stables of parrot project. It is not the same
as<br />btrees but the structure and memory layout are not that different - tupples are <br />going into pages etc. <br
/><br/>I have benchmarked iterating over such hash tables - sequential scan <br />of the same table comes 20-30% faster
thanscan ordered by the hash value <br />of the key. And this is overhead only of CPU cache lines - the numbers of <br
/>instructionsexecuted on the processor are pretty much the same (counted by <br />valgrind). <br /><br />So I do think
thatif we have sequential scan of indexes (physical order) it <br />will help even when all the data is in the
buffercache.<br /><br />Best regards <br /><br />-- <br />Luben Karavelov

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

22 October 2011, 18:15:29

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Oct 21, 2011 at 10:57 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> If count(*) could cause the index-only scan to happen in physical
>> order of the index, rather than logical order, that might be a big
>> win. �Both for all in memory and for not-all-in-memory.

> That's an interesting point.  I sort of assumed that would only help
> for not-all-in-memory, but maybe not.  The trouble is that I think
> there are some problematic concurrency issues there.

Yeah.  We managed to make physical-order scanning work for VACUUM
because it's okay if VACUUM sometimes sees the same index tuple twice;
it'll just make the same decision about (not) deleting it.  That will
not fly for regular querying.
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

23 October 2011, 00:14:56

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Oct 21, 2011 at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Anyhow, here's the scoop.  On my desktop machine running F14, running
>>> SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of
>>> oprofile data:
>>> 176830   13.0801  postgres                 postgres                 ExecProject

>> Hm, that's weird.  In both these cases, I'd have expected that
>> ExecProject would get optimized away thanks to selection of a physical
>> tlist for the scan node.  Wonder if that got broken ...

> If it did, it looks like it wasn't recent.  I set up the same test
> case on my MacBook using REL9_1_STABLE and REL9_0_STABLE and set a
> breakpoint on ExecProject().  Both back-branches appear to also call
> ExecProject() for every tuple.

Oh, the ExecProject calls are coming from advance_aggregates().
Move along, nothing to see here ...
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Jeff Janes

Date:

23 October 2011, 18:52:46

On Fri, Oct 21, 2011 at 12:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 21, 2011 at 2:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I think HeapTupleSatisfiesMVCC is probably being skipped anyway in
>>> this case, since all the heap pages should be PD_ALL_VISIBLE.
>>
>> Proves my point ;-) ... you're comparing a code path that's been beat on
>> for *years* with one that just got written.
>
> I know.  I wrote a chunk of it.  :-)  My point is just that it'd be
> nice to make it better.
>
> Anyhow, here's the scoop.  On my desktop machine running F14, running
> SELECT sum(1) FROM pgbench_accounts in a tight loop, 60 s worth of
> oprofile data:
>
> 176830   13.0801  postgres                 postgres                 ExecProject

Hi Robert,

count(*) and sum(1) do different things internally, and in my hands
sum(1) is ~10% slower.

I don't know how to dump the output of ExecBuildProjectionInfo into a
human readable form, so I don't know the basis of the difference.  But
I wonder if using count(*) would lower the weight of the ExecProject
function.


Cheers,

Jeff

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

23 October 2011, 19:04:41

Jeff Janes <jeff.janes@gmail.com> writes:
> count(*) and sum(1) do different things internally, and in my hands
> sum(1) is ~10% slower.
> I don't know how to dump the output of ExecBuildProjectionInfo into a
> human readable form, so I don't know the basis of the difference.  But
> I wonder if using count(*) would lower the weight of the ExecProject
> function.

Probably.  count() doesn't actually have any arguments, so there's
nothing for ExecProject to do.  sum(1) invokes the generic case there
(ExecTargetList).  I suppose we could add another special-case path for
constant tlist elements, but I suspect that would mostly be optimizing
for benchmarks rather than helping real-world cases.
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Jeff Janes

Date:

23 October 2011, 20:01:09

On Fri, Oct 21, 2011 at 12:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> Also, this line is kind of expensive:
>
>        if (!visibilitymap_test(scandesc->heapRelation,
>                                ItemPointerGetBlockNumber(tid),
>                                &node->ioss_VMBuffer))
>
> Around 2%.  But I don't see any way to avoid that, or even make it cheaper.

Could we cache by ItemPointerGetBlockNumber(tid) the results of those
tests, for groups of tids on the same index page?

How useful this would be would depend on how well-clustered the table
and index are.

Cheers,

Jeff

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

24 October 2011, 13:42:31

On Sun, Oct 23, 2011 at 7:01 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Fri, Oct 21, 2011 at 12:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> Also, this line is kind of expensive:
>>
>>        if (!visibilitymap_test(scandesc->heapRelation,
>>                                ItemPointerGetBlockNumber(tid),
>>                                &node->ioss_VMBuffer))
>>
>> Around 2%.  But I don't see any way to avoid that, or even make it cheaper.
>
> Could we cache by ItemPointerGetBlockNumber(tid) the results of those
> tests, for groups of tids on the same index page?
>
> How useful this would be would depend on how well-clustered the table
> and index are.

I thought about that, but the existing code is so ridiculously cheap
that it's hard to believe a caching layer would save enough to pay for
itself.  I mean, I presume that the cost attributed to that line has
to be associated with either (a) one of the pointer deferences, (b)
the expense of evaluating ItemPointerGetBlockNumber(), (c) setting up
the function call, or perhaps (d) overhead incurred as a result of
branch mis-prediction.  The actual time spent *inside*
visibilitymap_test will be attributed to that function, not this one.

If you add up the time for this line and visibilitymap_test(), it's
like 10% of the runtime, which seems pretty significant.  But it's 10%
of the runtime that is spent basically a handful of arithmetic
operations and then reading a byte from shared memory.  It's
astonishing to find that so expensive on a test with just one backend
running.  If you stick some kind of cache in there, it's going to
involve adding a branch that isn't there now, and I think that's going
to be pricey given how hot this code apparently is.

Also, I'm not sure it's safe.  Suppose that we lock the index page,
return a tuple, check the visibility map, and find the page all
visible.  Another transaction comes along, adds a tuple to the index
page, and clears the visibility map bit.  We then go back, relock the
index page, and return another tuple.  We'd better notice that the
visibility map bit has now been cleared, or we're in trouble.

I wonder if it might help to create some method for the index to
return all the matching keys on a single index page in one call.   If
you're dealing with an MVCC snapshot, any new tuples added after the
start of the scan can't be visible anyway.  That would save the
overhead of unlocking and relocking the buffer once per tuple, and
probably some overhead associated with validating and decoding the
possibly-changed page contents each time.  If you did it that way, it
would also be safe to do what you're proposing - if a bunch of the
tuples on the index page are also on the same heap page, you could do
one visibility map check for all of them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

"Kevin Grittner"

Date:

24 October 2011, 14:38:45

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I had wondered whether it'd be worth optimizing that along the
> lines of slot_getallattrs().  But most indexes probably have only
> one column, or anyway not enough to make for a useful savings.
From a heavily-used production database:
cir=> select indnatts, count(*) from pg_index group by indnatts
order by indnatts;indnatts | count 
----------+-------       1 |   200       2 |   684       3 |   155       4 |    76       5 |    43       6 |    13
7 |     2       9 |     1
 
(8 rows)
This includes system table and TOAST table indexes (which seem to
have two columns).  There are over 400 user tables, each of which
has a primary key, so most primary keys in our database are more
than one column.
-Kevin

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

24 October 2011, 14:52:06

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I had wondered whether it'd be worth optimizing that along the
>> lines of slot_getallattrs().  But most indexes probably have only
>> one column, or anyway not enough to make for a useful savings.
>> From a heavily-used production database:
> cir=> select indnatts, count(*) from pg_index group by indnatts
> order by indnatts;
>  indnatts | count 
> ----------+-------
>         1 |   200
>         2 |   684
>         3 |   155
>         4 |    76
>         5 |    43
>         6 |    13
>         7 |     2
>         9 |     1
> (8 rows)
> This includes system table and TOAST table indexes (which seem to
> have two columns).

Yeah, TOAST indexes are 2-column.  It would be best to exclude those
from your counts, since it seems pretty unlikely that anyone will care
how fast nodeIndexonlyscan.c is for scans on toast tables.

> There are over 400 user tables, each of which
> has a primary key, so most primary keys in our database are more
> than one column.

It doesn't look to me like the mean is above 2 (unless you have many
fewer toast tables than I suspect), so trying to optimize many-column
cases isn't going to help.
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

"Kevin Grittner"

Date:

24 October 2011, 15:16:45

Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah, TOAST indexes are 2-column.  It would be best to exclude
> those from your counts, since it seems pretty unlikely that anyone
> will care how fast nodeIndexonlyscan.c is for scans on toast
> tables.
User indexes (excluding toast):indnatts | count 
----------+-------       1 |   200       2 |   222       3 |   155       4 |    76       5 |    43       6 |    13
7 |     2       9 |     1
 
(8 rows)
System indexes (excluding toast):indnatts | count 
----------+-------       1 |    46       2 |    24       3 |     9       4 |     5
(4 rows)
> It doesn't look to me like the mean is above 2 (unless you have
> many fewer toast tables than I suspect), so trying to optimize
> many-column cases isn't going to help.
The mean is 2.4 (give or take a little depending on whether you
include system tables).  I have no idea where the optimization
becomes worthwhile, but the assertion that most indexes probably
have a single column worried me.  I'm sure there are databases where
that is true (especially for those who insist on adding a
meaningless surrogate key column to every table), but there are many
where it isn't true.  I would guess that our average of 2.4 is
higher than most, though.
-Kevin

Re: So, is COUNT(*) fast now?

From

"Kevin Grittner"

Date:

24 October 2011, 15:22:00

Copy/paste problems -- the first set includes the system tables
except for toast.  User tables would be the difference between the
results below.  Sorry.
-Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> Tom Lane <tgl@sss.pgh.pa.us> wrote:
>  
>> Yeah, TOAST indexes are 2-column.  It would be best to exclude
>> those from your counts, since it seems pretty unlikely that
>> anyone will care how fast nodeIndexonlyscan.c is for scans on
>> toast tables.
>  
> User indexes (excluding toast):
>  
>  indnatts | count 
> ----------+-------
>         1 |   200
>         2 |   222
>         3 |   155
>         4 |    76
>         5 |    43
>         6 |    13
>         7 |     2
>         9 |     1
> (8 rows)
>  
> System indexes (excluding toast):
>  
>  indnatts | count 
> ----------+-------
>         1 |    46
>         2 |    24
>         3 |     9
>         4 |     5
> (4 rows)
>  
>> It doesn't look to me like the mean is above 2 (unless you have
>> many fewer toast tables than I suspect), so trying to optimize
>> many-column cases isn't going to help.
>  
> The mean is 2.4 (give or take a little depending on whether you
> include system tables).  I have no idea where the optimization
> becomes worthwhile, but the assertion that most indexes probably
> have a single column worried me.  I'm sure there are databases
> where that is true (especially for those who insist on adding a
> meaningless surrogate key column to every table), but there are
> many where it isn't true.  I would guess that our average of 2.4
> is higher than most, though.
>  
> -Kevin

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

24 October 2011, 15:50:17

On Mon, Oct 24, 2011 at 2:15 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
>> It doesn't look to me like the mean is above 2 (unless you have
>> many fewer toast tables than I suspect), so trying to optimize
>> many-column cases isn't going to help.
>
> The mean is 2.4 (give or take a little depending on whether you
> include system tables).  I have no idea where the optimization
> becomes worthwhile, but the assertion that most indexes probably
> have a single column worried me.  I'm sure there are databases where
> that is true (especially for those who insist on adding a
> meaningless surrogate key column to every table), but there are many
> where it isn't true.  I would guess that our average of 2.4 is
> higher than most, though.

Keep in mind that the existence of index-only scans is going to
encourage people to try to create covering indexes on columns that
aren't indexed today.  It's not clear how much of a win that will be;
for certainly workloads, HOT might make it backfire spectacularly.

But even though Tom's statement that most indexes are one column might
be a slight exaggeration, I suspect it probably is true that the
optimizations he's talking about for large numbers of columns won't
produce any material benefit even for a 3 or 4 column index.  Which
makes me think maybe we should focus our efforts elsewhere.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

24 October 2011, 16:36:02

Robert Haas <robertmhaas@gmail.com> writes:
> But even though Tom's statement that most indexes are one column might
> be a slight exaggeration, I suspect it probably is true that the
> optimizations he's talking about for large numbers of columns won't
> produce any material benefit even for a 3 or 4 column index.  Which
> makes me think maybe we should focus our efforts elsewhere.

Right.  If we thought the average was something like ten, it might be
worth pursuing optimizations similar to slot_getallattrs.  If it's
around two or three, almost certainly not.

Your point about people trying to create wider indexes to exploit
index-only scans is an interesting one, but I think it's premature to
optimize on the basis of hypotheses about what people might do in
future.

Not sure about your other idea of returning multiple tuples per
amgettuple call.  The trouble with that is that it will add complexity
(and hence cycles) at the nodeIndexscan level, because now nodeIndexscan
will have to buffer those tuples, keep track of whether it's fetching
forward or backward, etc etc.  Plus another layer of the same in
indexam.c (index_getnext etc).  I'm not at all convinced that it's
likely to be a net win.

I wonder how trustworthy the measure of the visibilitymap_test call site
as a consumer of cycles really is.  I've frequently noticed that
oprofile blames remarkably large fractions of the runtime on individual
statements that appear to be quite trivial.  I'm not sure if that
represents real hardware-level effects such as cache line switching,
or whether it's just measurement artifacts.  Keep in mind that
sampling-based measurements are always subject to sampling artifacts.
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Josh Berkus

Date:

24 October 2011, 16:47:09

On 10/24/11 12:35 PM, Tom Lane wrote:
> Your point about people trying to create wider indexes to exploit
> index-only scans is an interesting one, but I think it's premature to
> optimize on the basis of hypotheses about what people might do in
> future.

I don't think that this is hypothetical at all.  I know *I'll* be doing
it, and we can expect users who are familiar with MySQL and Oracle to do
it as well.

No, it won't be the majority of our users, who are using ORMs and thus
don't really think about indexing at all.  But it will be a significant
number of users who are performance-sensitive ... such as most or all of
our data warehousing users.

Mind you, we're pretty much talking exclusively about users whose tables
don't fit in memory ... usually tables which are 10X or more the size of
memory.

One case which is going to be critical to test is the "join" table, i.e.
the table which supports many-to-many joins and consists only of keys
from the respective two other tables.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

24 October 2011, 17:01:32

On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Your point about people trying to create wider indexes to exploit
> index-only scans is an interesting one, but I think it's premature to
> optimize on the basis of hypotheses about what people might do in
> future.

Well, I don't think it's too much of a stretch to guess that people
will try to use covering indexes; that's common practice on other
products and a frequent and a not-uncommon heartfelt request from
people with large (non-memory-resident) databases they want to migrate
to PostgreSQL.  Exactly to what degree they'll do that, and how well
it will work, is another question.  But I have little doubt that it
will be tried.

> Not sure about your other idea of returning multiple tuples per
> amgettuple call.  The trouble with that is that it will add complexity
> (and hence cycles) at the nodeIndexscan level, because now nodeIndexscan
> will have to buffer those tuples, keep track of whether it's fetching
> forward or backward, etc etc.  Plus another layer of the same in
> indexam.c (index_getnext etc).  I'm not at all convinced that it's
> likely to be a net win.

I definitely agree that you don't want two layers of caching, but I
don't see why we'd need that.  I wasn't thinking of changing
index_getnext() at all, but rather adding a new API that fills a
buffer (or a pair of buffers, maybe) with index tuples and heap TIDs.
It should spit them out in the same order that multiple
index_getnext() calls would have done and leave the scan position
wherever it would have ended up after a number of index_getnext_tid()
calls equal to the number of TIDs returned.  Any user of the API (and
it might be just nodeIndexonlyscan.c) would just need to keep track of
the number of tuples returned and the number consumed to date.

This actually gets into a wider architectural discussion, which is
whether the fact that the whole executor (modulo bitmap scans and a
few other special cases) operates on one tuple at a time is a good
design...  but my brain hurts just thinking about that.

> I wonder how trustworthy the measure of the visibilitymap_test call site
> as a consumer of cycles really is.  I've frequently noticed that
> oprofile blames remarkably large fractions of the runtime on individual
> statements that appear to be quite trivial.  I'm not sure if that
> represents real hardware-level effects such as cache line switching,
> or whether it's just measurement artifacts.  Keep in mind that
> sampling-based measurements are always subject to sampling artifacts.

I'm not sure either.  I guess we could try short-circuiting
visibilitymap_test and see what that does to performance (let's leave
correct answers out of it).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

"Kevin Grittner"

Date:

24 October 2011, 17:07:56

Josh Berkus <josh@agliodbs.com> wrote:
> On 10/24/11 12:35 PM, Tom Lane wrote:
>> Your point about people trying to create wider indexes to exploit
>> index-only scans is an interesting one, but I think it's
>> premature to optimize on the basis of hypotheses about what
>> people might do in future.
> 
> I don't think that this is hypothetical at all.  I know *I'll* be
> doing it, and we can expect users who are familiar with MySQL and
> Oracle to do it as well.
And Sybase, and MS SQL Server.  And others, most likely.  We've
never gotten around to narrowing the indexes to which we added extra
columns to overcome performance problems through "covering index"
techniques when we were using Sybase, so they're already here.  :-)
> One case which is going to be critical to test is the "join"
> table, i.e. the table which supports many-to-many joins and
> consists only of keys from the respective two other tables.
Yeah, that is an important use of covering indexes for us.
-Kevin

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

24 October 2011, 17:23:33

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I wonder how trustworthy the measure of the visibilitymap_test call site
>> as a consumer of cycles really is.

> I'm not sure either.  I guess we could try short-circuiting
> visibilitymap_test and see what that does to performance (let's leave
> correct answers out of it).

That would conflate the cost of the call with the cost of the function.
Maybe you could try manually inlining the visibility test?
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Wolfgang Wilhelm

Date:

25 October 2011, 03:40:32

Hello,

my experience is that as soon as index only scans are available they are used - sometimes just because of the simple logic that a user thinks it is faster. Even when the index is so ridiculously long just to have all info in the index...

Regards

Wolfgang Wilhelm

Von: Tom Lane <tgl@sss.pgh.pa.us>
An: Robert Haas <robertmhaas@gmail.com>
Cc: Kevin Grittner <Kevin.Grittner@wicourts.gov>; pgsql-hackers@postgresql.org
Gesendet: 21:35 Montag, 24.Oktober 2011
Betreff: Re: [HACKERS] So, is COUNT(*) fast now?

Robert Haas <robertmhaas@gmail.com> writes:
> But even though Tom's statement that most indexes are one column might
> be a slight exaggeration, I suspect it probably is true that the
> optimizations he's talking about for large numbers of columns won't
> produce any material benefit even for a 3 or 4 column index. Which
> makes me think maybe we should focus our efforts elsewhere.

Right. If we thought the average was something like ten, it might be
worth pursuing optimizations similar to slot_getallattrs. If it's
around two or three, almost certainly not.

Your point about people trying to create wider indexes to exploit
index-only scans is an interesting one, but I think it's premature to
optimize on the basis of hypotheses about what people might do in
future.

Not sure about your other idea of returning multiple tuples per
amgettuple call. The trouble with that is that it will add complexity
(and hence cycles) at the nodeIndexscan level, because now nodeIndexscan
will have to buffer those tuples, keep track of whether it's fetching
forward or backward, etc etc. Plus another layer of the same in
indexam.c (index_getnext etc). I'm not at all convinced that it's
likely to be a net win.

I wonder how trustworthy the measure of the visibilitymap_test call site
as a consumer of cycles really is. I've frequently noticed that
oprofile blames remarkably large fractions of the runtime on individual
statements that appear to be quite trivial. I'm not sure if that
represents real hardware-level effects such as cache line switching,
or whether it's just measurement artifacts. Keep in mind that
sampling-based measurements are always subject to sampling artifacts.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: So, is COUNT(*) fast now?

From

Bruce Momjian

Date:

28 October 2011, 09:17:07

Tom Lane wrote:
> I wonder how trustworthy the measure of the visibilitymap_test call site
> as a consumer of cycles really is.  I've frequently noticed that
> oprofile blames remarkably large fractions of the runtime on individual
> statements that appear to be quite trivial.  I'm not sure if that

Are these statements perhaps near kernel calls?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

28 October 2011, 14:59:01

On Mon, Oct 24, 2011 at 4:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Oct 24, 2011 at 3:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I wonder how trustworthy the measure of the visibilitymap_test call site
>>> as a consumer of cycles really is.
>
>> I'm not sure either.  I guess we could try short-circuiting
>> visibilitymap_test and see what that does to performance (let's leave
>> correct answers out of it).
>
> That would conflate the cost of the call with the cost of the function.
> Maybe you could try manually inlining the visibility test?

I fooled around with this some more on my Fedora 14 desktop machine.
pgbench database, scale factor 20, shared_buffers=400MB.  I ran the
query "select count(*) from pgbench_accounts".  I'm having a hard time
getting completely reproducible results, but it appears that warming
the cache makes the sequential scan go faster drop from maybe 390 ms
to 245 ms, and an index-only scan takes about 350 ms, so which one is
better depends a lot on your assumptions about what is going on on the
system at the same time (which means maybe we ought not to sweat it).

If I wipe out the whole if-block that calls visibilitymap_test(), the
index-only scan drops down to about 300 ms (and delivers potentially
wrong answers, of course).  Inlining visibilitymap_test (see attached
vismap-inline.patch) causes the index-only scan to drop to about 330
ms.

I also tried changing the BufferIsValid() tests in
visibilitymap_test() to use BufferIsInvalid() instead, with the sense
of the tests reversed (see attached vismap-test-invalid.patch).  Since
BufferIsInvalid() just checks for InvalidBuffer instead of also doing
the sanity checks, it's significantly cheaper.  This also reduced the
time to about 330 ms, so seems clearly worth doing.  Apparently these
changes don't stack, because doing both things only gets me down to
about 320 ms, which is fairly unexciting for the amount of ugliness
that inlining entails.

I tried sprinkling a little branch-prediction magic on this code using
GCC's __builtin_expect().  That initially seemed to help, but once I
changed the BufferIsValid() test to !BufferIsInvalid() essentially all
of the savings disappeared.

I also spent some time poking through the opannotate -s -a output,
which shows where time is being spent by individual assembler
instruction, but also annotates the assembler listing with the
original source code.  At least according to oprofile, the time that
is being spent in IndexOnlyNext() is mostly being spent on seemingly
innocuous operations like saving and restoring registers.  For
example, much of the time being attributed to the visibilitymap_test()
call in IndexOnlyNext() is actually attributable to the instructions
that are calculating what address to pass for scandesc->heapRelation.
Many but not all of the pointer deferences at the top of
IndexOnlyNext() have a chunk cycles attributed to them, and while
they're not that significant individually, they add up.  Similarly,
the long series of instructions to which index_getattr() resolves
bleeds cycles at just about every step.  There's not much to optimize
there, though, unless we want to add some code that avoids decoding
the tuple altogether in the particular case of a zero-argument
aggregate, or maybe something more general that only pulls out the
columns that are actually needed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

28 October 2011, 15:49:02

Robert Haas <robertmhaas@gmail.com> writes:
> I also tried changing the BufferIsValid() tests in
> visibilitymap_test() to use BufferIsInvalid() instead, with the sense
> of the tests reversed (see attached vismap-test-invalid.patch).  Since
> BufferIsInvalid() just checks for InvalidBuffer instead of also doing
> the sanity checks, it's significantly cheaper.  This also reduced the
> time to about 330 ms, so seems clearly worth doing.

Hmm.  I wonder whether it wouldn't be better to get rid of the range
checks in BufferIsValid, or better convert them into Asserts.  It seems
less than intuitive that BufferIsValid and BufferIsInvalid aren't simple
inverses.
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

28 October 2011, 15:51:35

On Fri, Oct 28, 2011 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I also tried changing the BufferIsValid() tests in
>> visibilitymap_test() to use BufferIsInvalid() instead, with the sense
>> of the tests reversed (see attached vismap-test-invalid.patch).  Since
>> BufferIsInvalid() just checks for InvalidBuffer instead of also doing
>> the sanity checks, it's significantly cheaper.  This also reduced the
>> time to about 330 ms, so seems clearly worth doing.
>
> Hmm.  I wonder whether it wouldn't be better to get rid of the range
> checks in BufferIsValid, or better convert them into Asserts.  It seems
> less than intuitive that BufferIsValid and BufferIsInvalid aren't simple
> inverses.

Seems reasonable.  It would break if anyone is using an out-of-range
buffer number in lieu of InvalidBuffer, but I doubt that anyone is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

28 October 2011, 16:27:44

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Oct 28, 2011 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hmm. �I wonder whether it wouldn't be better to get rid of the range
>> checks in BufferIsValid, or better convert them into Asserts. �It seems
>> less than intuitive that BufferIsValid and BufferIsInvalid aren't simple
>> inverses.

> Seems reasonable.  It would break if anyone is using an out-of-range
> buffer number in lieu of InvalidBuffer, but I doubt that anyone is.

Yeah, I find that unlikely as well.  But leaving Asserts in place would
tell us.
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

28 October 2011, 16:30:39

On Fri, Oct 28, 2011 at 3:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Oct 28, 2011 at 2:48 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Hmm.  I wonder whether it wouldn't be better to get rid of the range
>>> checks in BufferIsValid, or better convert them into Asserts.  It seems
>>> less than intuitive that BufferIsValid and BufferIsInvalid aren't simple
>>> inverses.
>
>> Seems reasonable.  It would break if anyone is using an out-of-range
>> buffer number in lieu of InvalidBuffer, but I doubt that anyone is.
>
> Yeah, I find that unlikely as well.  But leaving Asserts in place would
> tell us.

OK.  Should I go do that, or are you all over it?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

Tom Lane

Date:

28 October 2011, 16:53:34

Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Oct 28, 2011 at 3:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Yeah, I find that unlikely as well. �But leaving Asserts in place would
>> tell us.

> OK.  Should I go do that, or are you all over it?

Go for it.
        regards, tom lane

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

28 October 2011, 18:07:33

On Fri, Oct 28, 2011 at 3:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Oct 28, 2011 at 3:27 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Yeah, I find that unlikely as well.  But leaving Asserts in place would
>>> tell us.
>
>> OK.  Should I go do that, or are you all over it?
>
> Go for it.

OK, done.  Any other thoughts on the rest of what I wrote?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

Kääriäinen Anssi

Date:

30 October 2011, 09:04:06

Quoting Robert Haas:
"""
I tried this on my MacBook Pro this morning, using pgbench -i -s 500
to create a database about 7.5GB in size, and then using "SELECT
sum(1) FROM pgbench_accounts" as a test query, on a build WITHOUT
--enable-cassert. This machine has 4GB of memory, and I set
shared_buffers = 400MB. (No, I'm not sure whether that's the optimal
setting for shared_buffers for this machine.)
"""

I did some tests where I tried to compare the effect of having the index
ordered tuples not be in the same order they are in the base table.
The idea is to test what effect accessing the VM map randomly as
opposed to sequential order has on performance. I suspect the above
test access the VM in order (the accounts table is effectively clustered
on the index used in the test). I might be mistaken here.

My test setup is this:
drop table if exists test;
drop table if exists test2;
create unlogged table test /* with (fillfactor = 10) */ as select generate_series(0, 20*1000*1000) as id;
create index idx1 on test(id);
vacuum test;
create unlogged table test2 /* with (fillfactor = 10) */ as (select * from test order by random());
create index idx2 on test2(id);
vacuum test2;

Table size is around 600MB, index size is around 350MB and VM on-disk
size is 16kB with default fillfactor. With fillfactor = 10, the VM size is 104
KB, and table size is around 6GB. The index size is the same.

Results for the randomly ordered table:
# select count(*) from test2;
14822.045 ms
14826.253 ms
14815.450 ms

Results for the effectively clustered table:
# select count(*) from test;
11761.890 ms
11767.926 ms
11810.900 ms

Now, this test still has the benefit of fitting the VM easily into the L1 cache.

Next, I did a ugly hack to get the table size large enough so that the VM
will trash the L1 cache while still having somewhat reasonable test setup
creation time. My harware is old, 1GB of memory, processor is Genuine
Intel(R) CPU L2400 @ 1.66GHz. The L1 data cache size is 32kB on my.

The hack is to simply set fillfactor to 10. The VM size is now 104kB, the
table size is about 6.3 GB while the index size is still the same as in above
test.

Results for the randomly ordered table:
# select count(*) from test2;
21606.683 ms
21829.063 ms
21637.434 ms

Results for the effectively clustered table:
# select count(*) from test;
11714.663 ms
11449.264 ms
11658.534 ms

Now, the next step would be to trash the L2 cache (20GB table size should
do this on Sandy Bridge, where L2 cache is 256KB). I don't have hardware
to do that test. It is worth noting that the L2 cache is shared on Sandy
Bridge, so it is likely that an index-only scan of a large enough table would
slow down other processes, too. Without tests this is only FUD, though. The
test would be to scan a 20GB table's index repeatedly in one process, and
see how it affects standard in-memory pgbench results for other processes.
Compare this with doing the same with a sequential scan process.

Lessons learned (or what I learned, at least): - Clustering is important for index only scans. Picking a clustered
index over non-clustered index will have a big performance effect. - Large table index-only scans are going to be more
expensivecompared to sequential scan than what pgbench accounts tests suggests. I assume that the accounts table is
effectivelyclustered on the index used. I haven't verified this. - There is the possibility that index-only scans
willtrash the caches for other processes, too. Not tested, though.

I am sure these results will vary significantly based on hardware used. I
am also notorious for screwing up benchmarks, so verifying these results
is recommended.

You will need around 16GB of disk space for the fillfactor = 10 test. I would
recommend you have more than 1GB of memory, otherwise creating the
test setup can take some time...
- Anssi Kääriäinen

Re: So, is COUNT(*) fast now?

From

Kääriäinen Anssi

Date:

30 October 2011, 09:14:31

Sorry, I forgot to include the version used & some information about my setup:
PostgreSQL version: Git HEAD as of:
Date:   Fri Oct 28 21:18:36 2011 -0400
Commit: 51eba98cf4595e90730dedd9305da8aa84b649ee

Compiled with defaults, (only change --with-pgport = 5431). I used default
settings, shared_buffers size is 24MB. The system is Linux Mint Debian edition
(kernel 3.0.0, gcc 4.6.1). The interesting parts about my hardware were in the
original post.
- Anssi

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

31 October 2011, 09:44:26

On Sun, Oct 30, 2011 at 8:02 AM, Kääriäinen Anssi
<anssi.kaariainen@thl.fi> wrote:
> Table size is around 600MB, index size is around 350MB and VM on-disk
> size is 16kB with default fillfactor. With fillfactor = 10, the VM size is 104
> KB, and table size is around 6GB.  The index size is the same.

What I think you're probably measuring here (oprofile would tell us
for sure) is that once the size of the table goes beyond about half a
gigabyte, it will have more than one page in the visibility map.  The
index-only scan code keeps the most recently used visibility map page
pinned to save on overhead, but if you're bouncing back and forth
between data in the first ~500MB of the table and data in the last
~100MB, each switch will result in dropping the current pin and
getting a new one, which figures to be fairly expensive.  With the
table is only a little over 500GB, you're probably only changing VM
pages every couple of tuples, but with a 6GB table just about every
tuple will switch to a new VM page.

Now, maybe you're right and the CPU caches are the more significant
effect.  But I wouldn't like to bet on it without seeing how much the
drop-and-get-new-pin operations are costing us.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: So, is COUNT(*) fast now?

From

Anssi Kääriäinen

Date:

31 October 2011, 10:50:55

On 10/31/2011 02:44 PM, Robert Haas wrote:
> What I think you're probably measuring here (oprofile would tell us
> for sure) is that once the size of the table goes beyond about half a
> gigabyte, it will have more than one page in the visibility map.  The
> index-only scan code keeps the most recently used visibility map page
> pinned to save on overhead, but if you're bouncing back and forth
> between data in the first ~500MB of the table and data in the last
> ~100MB, each switch will result in dropping the current pin and
> getting a new one, which figures to be fairly expensive.  With the
> table is only a little over 500GB, you're probably only changing VM
> pages every couple of tuples, but with a 6GB table just about every
> tuple will switch to a new VM page.
>
> Now, maybe you're right and the CPU caches are the more significant
> effect.  But I wouldn't like to bet on it without seeing how much the
> drop-and-get-new-pin operations are costing us.
>
Maybe I should have left the analysis part out of the post,
I don't know the internals, so my analysis is likely to be wrong.
Now that I think of it, claiming that the cache effect is 50%
of the runtime is likely a little wrong...

However the part about clustering being important is still correct.
According to the test, you can get 50% overhead because of
random access to the VM.

Stupid question, but why not keep the whole VM pinned?
 - Anssi

Re: So, is COUNT(*) fast now?

From

Robert Haas

Date:

31 October 2011, 11:03:07

On Mon, Oct 31, 2011 at 9:51 AM, Anssi Kääriäinen
<anssi.kaariainen@thl.fi> wrote:
> Stupid question, but why not keep the whole VM pinned?

It might be that keeping more than one VM page pinned is a good idea,
but we'd have to think carefully about it.  For example, if we pin too
many pages in shared_buffers, other queries could start erroring out
for failure to find an evictable buffer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company