Thread: Accounting of zero-filled buffers in EXPLAIN (BUFFERS)
Hi, In EXPLAIN (BUFFERS), there are two kinds of cache misses that show up as "reads" when in fact they are not reads at all: 1. Relation extension, which in fact writes a zero-filled block. 2. The RBM_ZERO_* modes, which provoke neither read nor write. Here's a suggested fix. I noticed this because I have some patches in development that hit these cases a bit more and the numbers didn't match my expectation. I suppose someone might want a separate counter for zero-filled buffers (they're still buffer interactions and cache misses) but it seems like it's probably below the kind of thing we're interested in counting (though in passing, I notice recently that some kernels keep some free pages pre-zeroed so they can supply them faster, which is curious). -- Thomas Munro http://www.enterprisedb.com
Attachment
On 2018-04-30 14:59:31 +1200, Thomas Munro wrote: > Hi, > > In EXPLAIN (BUFFERS), there are two kinds of cache misses that show up > as "reads" when in fact they are not reads at all: > > 1. Relation extension, which in fact writes a zero-filled block. > 2. The RBM_ZERO_* modes, which provoke neither read nor write. Just for understanding: 2) can never happen for buffers showing up in EXPLAIN, right? I'm not saying you shouldn't fix the accounting... Greetings, Andres Freund
On Mon, Apr 30, 2018 at 3:13 PM, Andres Freund <andres@anarazel.de> wrote: > On 2018-04-30 14:59:31 +1200, Thomas Munro wrote: >> In EXPLAIN (BUFFERS), there are two kinds of cache misses that show up >> as "reads" when in fact they are not reads at all: >> >> 1. Relation extension, which in fact writes a zero-filled block. >> 2. The RBM_ZERO_* modes, which provoke neither read nor write. > > Just for understanding: 2) can never happen for buffers showing up in > EXPLAIN, right? > > I'm not saying you shouldn't fix the accounting... Maybe the hash AM can reach that in _hash_getinitbuf() while adding overflow pages to bucket chains? Admittedly case 2 is obscure and rare if not unreachable and probably no one would care too much about that in practice (whereas case 1 can be seen by simply inserting stuff into a new empty table). Other patches I'm working on for later proposal do it more often (think accessing known-empty pages in a kind of preallocated extent), and it occurred to me that it's clearly a bug on principle, hence this patch. -- Thomas Munro http://www.enterprisedb.com
On Mon, Apr 30, 2018 at 1:38 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Mon, Apr 30, 2018 at 3:13 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2018-04-30 14:59:31 +1200, Thomas Munro wrote:
>> In EXPLAIN (BUFFERS), there are two kinds of cache misses that show up
>> as "reads" when in fact they are not reads at all:
>>
>> 1. Relation extension, which in fact writes a zero-filled block.
>> 2. The RBM_ZERO_* modes, which provoke neither read nor write.
>
> Just for understanding: 2) can never happen for buffers showing up in
> EXPLAIN, right?
>
> I'm not saying you shouldn't fix the accounting...
Maybe the hash AM can reach that in _hash_getinitbuf() while adding
overflow pages to bucket chains? Admittedly case 2 is obscure and
rare if not unreachable and probably no one would care too much about
that in practice (whereas case 1 can be seen by simply inserting stuff
into a new empty table). Other patches I'm working on for later
proposal do it more often (think accessing known-empty pages in a kind
of preallocated extent), and it occurred to me that it's clearly a bug
on principle, hence this patch.
I checked the patch and I agree with the change 1). And regarding change 2)
whether it is zeroing the contents of the page or not, it does read?
because if it exists in the buffer pool, we are counting them as hits irrespective
of the mode? Am I missing something?
Regards,
Haribabu Kommi
Fujitsu Australia
On Thu, Jul 12, 2018 at 12:46 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: >> > On 2018-04-30 14:59:31 +1200, Thomas Munro wrote: >> >> In EXPLAIN (BUFFERS), there are two kinds of cache misses that show up >> >> as "reads" when in fact they are not reads at all: >> >> >> >> 1. Relation extension, which in fact writes a zero-filled block. >> >> 2. The RBM_ZERO_* modes, which provoke neither read nor write. > > I checked the patch and I agree with the change 1). And regarding change 2) > whether it is zeroing the contents of the page or not, it does read? > because if it exists in the buffer pool, we are counting them as hits > irrespective > of the mode? Am I missing something? Further down in the function you can see that there is no read() system call for the RBM_ZERO_* modes: if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK) MemSet((char *) bufBlock, 0, BLCKSZ); else { ... smgrread(smgr, forkNum, blockNum, (char *) bufBlock); ... } I suppose someone might argue that even when it's not a hit and it's not a read, we might still want to count this buffer interaction in some other way. Perhaps there should be a separate counter? It may technically be a kind of cache miss, but it's nowhere near as expensive as a synchronous system call like read() so I didn't propose that. Some more on my motivation: In our zheap prototype, when the system is working well and we have enough space, we constantly allocate zeroed buffer pages at the insert point (= head) of an undo log and drop pages at the discard point (= tail) in the background; effectively a few pages just go round and round via the freelist and no read() or write() syscalls happen. That's something I'm very happy about and it's one of our claimed advantages over the traditional heap (which tends to read and dirty more pages), but EXPLAIN (BUFFERS) hides this virtuous behaviour when comparing with the traditional heap: it falsely and slanderously reports that zheap is reading undo pages when it is not. Of course I don't intent to litigate zheap design in this thread, I just I figured that since this accounting is wrong on principle and affects current PostgreSQL too (at least in theory) I would propose this little patch independently. It's subtle enough that I wouldn't bother to back-patch it though. -- Thomas Munro http://www.enterprisedb.com
On Thu, Jul 12, 2018 at 8:32 AM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Thu, Jul 12, 2018 at 12:46 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>> > On 2018-04-30 14:59:31 +1200, Thomas Munro wrote:
>> >> In EXPLAIN (BUFFERS), there are two kinds of cache misses that show up
>> >> as "reads" when in fact they are not reads at all:
>> >>
>> >> 1. Relation extension, which in fact writes a zero-filled block.
>> >> 2. The RBM_ZERO_* modes, which provoke neither read nor write.
>
> I checked the patch and I agree with the change 1). And regarding change 2)
> whether it is zeroing the contents of the page or not, it does read?
> because if it exists in the buffer pool, we are counting them as hits
> irrespective
> of the mode? Am I missing something?
Further down in the function you can see that there is no read()
system call for the RBM_ZERO_* modes:
if (mode == RBM_ZERO_AND_LOCK || mode ==
RBM_ZERO_AND_CLEANUP_LOCK)
MemSet((char *) bufBlock, 0, BLCKSZ);
else
{
...
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
...
}
Thanks for the details. I got your point. But we need to include
RBM_ZERO_ON_ERROR case read operations, excluding others
are fine.
I suppose someone might argue that even when it's not a hit and it's
not a read, we might still want to count this buffer interaction in
some other way. Perhaps there should be a separate counter? It may
technically be a kind of cache miss, but it's nowhere near as
expensive as a synchronous system call like read() so I didn't propose
that.
Yes, I agree that we may need a new counter that counts the buffers that
are just allocated (no read or no write). But currently, may be the counter
value is very less, so people are not interested.
Some more on my motivation: In our zheap prototype, when the system
is working well and we have enough space, we constantly allocate
zeroed buffer pages at the insert point (= head) of an undo log and
drop pages at the discard point (= tail) in the background;
effectively a few pages just go round and round via the freelist and
no read() or write() syscalls happen. That's something I'm very happy
about and it's one of our claimed advantages over the traditional heap
(which tends to read and dirty more pages), but EXPLAIN (BUFFERS)
hides this virtuous behaviour when comparing with the traditional
heap: it falsely and slanderously reports that zheap is reading undo
pages when it is not. Of course I don't intent to litigate zheap
design in this thread, I just I figured that since this accounting is
wrong on principle and affects current PostgreSQL too (at least in
theory) I would propose this little patch independently. It's subtle
enough that I wouldn't bother to back-patch it though.
OK. May be it is better to implement the buffer allocate counter along with
zheap to provide better buffer results?
Regards,
Haribabu Kommi
Fujitsu Australia
On 12 July 2018 at 12:19, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > Yes, I agree that we may need a new counter that counts the buffers that > are just allocated (no read or no write). But currently, may be the counter > value is very less, so people are not interested. The existing counters don't show up in EXPLAIN (ANALYZE, BUFFERS) when they're zero. The new counter would surely work the same way, so the users wouldn't be burdened by additional explain output it when they're not affected by it. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
At Thu, 12 Jul 2018 10:31:46 +1200, Thomas Munro <thomas.munro@enterprisedb.com> wrote in <CAEepm=2yGnw6dJ5wx8tuSRmcpDZ5uziiH_qQ0ptdha8mMLSVqw@mail.gmail.com> > I suppose someone might argue that even when it's not a hit and it's > not a read, we might still want to count this buffer interaction in > some other way. Perhaps there should be a separate counter? It may > technically be a kind of cache miss, but it's nowhere near as > expensive as a synchronous system call like read() so I didn't propose > that. At Thu, 12 Jul 2018 10:19:29 +1000, Haribabu Kommi <kommi.haribabu@gmail.com> wrote in <CAJrrPGduEpxwtu9VFxT21DNK=WRP=LUjK4GjPfm+4+PCjpcAxA@mail.gmail.com> > Thanks for the details. I got your point. But we need to include > RBM_ZERO_ON_ERROR case read operations, excluding others > are fine. FWIW I agree to Haribabu's comment that the case of RBM_ZERO_* other than ON_ERROR is a hit. It seems to show zheap's disk I /O reduction by itself in certain extent. At Thu, 12 Jul 2018 13:28:37 +1200, David Rowley <david.rowley@2ndquadrant.com> wrote in <CAKJS1f_kem=OzVpFADzC2=VpzcXuSQ+HCW=Hp67GAXrnQ-EVTw@mail.gmail.com> > On 12 July 2018 at 12:19, Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > > Yes, I agree that we may need a new counter that counts the buffers that > > are just allocated (no read or no write). But currently, may be the counter > > value is very less, so people are not interested. > The existing counters don't show up in EXPLAIN (ANALYZE, BUFFERS) when > they're zero. The new counter would surely work the same way, so the > users wouldn't be burdened by additional explain output it when > they're not affected by it. I don't object strongly neither to the new counter but I'm not sure it is enough distinctive from hits, in the view that "a hit is where we found a buffer for the requested page". regards. -- Kyotaro Horiguchi NTT Open Source Software Center
Is this patch committable now? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
At Sat, 17 Nov 2018 11:15:54 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20181117141554.4dkx2u4j6md3bqdh@alvherre.pgsql> > Is this patch committable now? I don't think so. We should make a decision on a point. I was a bit confused (sorry) but IIUIC Haribabu suggested that the RBM_ZERO_ON_ERROR case should be included to read (not just ignored). I'm on it. I think that RPM_ZERO_* other than ON_ERROR cases could be a kind of hit but I don't insist on it. So, I think we should decide on at least the ON_ERROR case before this becomes commttable. The another counter could be another issue. I don't object to add the counter but I'm not sure what is the suitable name. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On Mon, Nov 19, 2018 at 5:24 PM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Sat, 17 Nov 2018 11:15:54 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20181117141554.4dkx2u4j6md3bqdh@alvherre.pgsql> > > Is this patch committable now? > > I don't think so. We should make a decision on a point. > > I was a bit confused (sorry) but IIUIC Haribabu suggested that > the RBM_ZERO_ON_ERROR case should be included to read (not just > ignored). I'm on it. I think that RPM_ZERO_* other than ON_ERROR > cases could be a kind of hit but I don't insist on it. > > So, I think we should decide on at least the ON_ERROR case before > this becomes commttable. Agreed, RBM_ZERO_ON_ERROR needs to be counted as a read. Here's a new version like that. > The another counter could be another issue. I don't object to add > the counter but I'm not sure what is the suitable name. I think that might be interesting information in the future, but let's consider that for a later patch. -- Thomas Munro http://www.enterprisedb.com
Attachment
On Fri, Nov 23, 2018 at 6:39 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Mon, Nov 19, 2018 at 5:24 PM Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Sat, 17 Nov 2018 11:15:54 -0300, Alvaro Herrera <alvherre@2ndquadrant.com> wrote in <20181117141554.4dkx2u4j6md3bqdh@alvherre.pgsql>
> > Is this patch committable now?
>
> I don't think so. We should make a decision on a point.
>
> I was a bit confused (sorry) but IIUIC Haribabu suggested that
> the RBM_ZERO_ON_ERROR case should be included to read (not just
> ignored). I'm on it. I think that RPM_ZERO_* other than ON_ERROR
> cases could be a kind of hit but I don't insist on it.
>
> So, I think we should decide on at least the ON_ERROR case before
> this becomes commttable.
Agreed, RBM_ZERO_ON_ERROR needs to be counted as a read. Here's a new
version like that.
> The another counter could be another issue. I don't object to add
> the counter but I'm not sure what is the suitable name.
I think that might be interesting information in the future, but let's
consider that for a later patch.
Thanks for the updated patch. It looks good.
I marked it in the commitfest as ready for committer.
Regards,
Haribabu Kommi
Fujitsu Australia
On Tue, Nov 27, 2018 at 2:53 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote: > Thanks for the updated patch. It looks good. > I marked it in the commitfest as ready for committer. Pushed. Thanks for the reviews! -- Thomas Munro http://www.enterprisedb.com