Thread: Process local hint bit cache
In a previous thread (http://postgresql.1045698.n5.nabble.com/Set-hint-bits-upon-eviction-from-BufMgr-td4264323.html) I was playing with the idea of granting the bgwriter the ability to due last chance hint bit setting before evicting the page out. I still think might be a good idea, and it might be an especially good idea, especially in scenarios where you get set the PD_ALL_VISIBLE bit when writing out the page, a huge win in some cases. That said, it bugged me that it didn't really fix the general case of large data insertions and the subsequent i/o problems setting out the bits, which is my primary objective in the short term. So I went back to the drawing board, reviewed the archives, and came up with a new proposal. I'd like to see a process local clog page cache of around 1-4 pages (8-32kb typically) that would replace the current TransactionLogFetch last xid cache. It should be small, because I doubt more would really help much (see below) and we want to keep this data in the tight cpu caches since it's going to be constantly scanned. The cache itself will be the clog pages and a small header per page which will contain the minimum information necessary to match an xid against a page to determine a hit, and a count of hits. Additionally we keep a separate small array (say 100) of type xid (int) that we insert write into in a cache miss. So, cache lookup algorithm basically is: scan clog cache headers and check for hit if found (xid in range covered in clog page), header.hit_count++; else miss_array[miss_count++] = xid; A cache hit is defined about getting useful information from the page, that is a transaction being committed or invalid. When the miss count array fills, we sort it and determine the most commonly hit clog page that is not in the cache and use that information to replace pages in the cache if necessary, then reset the counts. Maybe we can add a minimum threshold of hits, say 5-10% if miss_array size for a page to be deemed interesting enough to be loaded into the cache. Interaction w/set hint bits: *) If a clog lookup faults through the cache, we basically keep the current behavior. That is, the hint bits are set and the page is marked BM_DIRTY and the hint bits get written back *) If we get a clog cache hit, that is the hint bits are not set but we pulled the transaction status from the cache, the hint bits are recorded on the page *but the page is not written back*, at least on hint bit basis alone. This behavior branch is more or less the BM_UNTIDY as suggested by haas (see archives), except it's only seen in 'cache hit' scenarios. We are not writing pages back because the cache is suggesting there is little/not benefit to write them back. Thus, if a single backend is scanning a lot of pages with transactions touching a very small number of clog pages, hint bits are generally not written back because they are not needed and in fact not helping. However, if the xid are spread around a large number of clog pages, we get the current behavior more or less (plus the overhead of cache maintenance). With the current code base, hint bits are very beneficial when the xid entropy is high and the number of repeated scan is high, and not so good when the xid entropy is low and the number of repeated scans is low. The process local cache attempts to redress this without disadvantaging the already good cases. Furthermore, if it can be proven that the cache overhead is epsilon, it's pretty unlikely to negatively impact anyone negatively, at lest, that's my hope. Traffic to clog will reduce (although not much, since i'd wager the current 'last xid' cache works pretty well), but i/o should be reduced, in some cases quite significantly for a tiny cpu cost (although that remains to be proven). merlin
On Tue, Mar 29, 2011 at 4:34 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > In a previous thread > (http://postgresql.1045698.n5.nabble.com/Set-hint-bits-upon-eviction-from-BufMgr-td4264323.html) > I was playing with the idea of granting the bgwriter the ability to > due last chance hint bit setting before evicting the page out. I still > think might be a good idea, and it might be an especially good idea, > especially in scenarios where you get set the PD_ALL_VISIBLE bit when > writing out the page, a huge win in some cases. That said, it bugged > me that it didn't really fix the general case of large data insertions > and the subsequent i/o problems setting out the bits, which is my > primary objective in the short term. > > So I went back to the drawing board, reviewed the archives, and came > up with a new proposal. I'd like to see a process local clog page > cache of around 1-4 pages (8-32kb typically) that would replace the > current TransactionLogFetch last xid cache. It should be small, > because I doubt more would really help much (see below) and we want to > keep this data in the tight cpu caches since it's going to be > constantly scanned. > > The cache itself will be the clog pages and a small header per page > which will contain the minimum information necessary to match an xid > against a page to determine a hit, and a count of hits. Additionally > we keep a separate small array (say 100) of type xid (int) that we > insert write into in a cache miss. > > So, cache lookup algorithm basically is: > scan clog cache headers and check for hit > if found (xid in range covered in clog page), > header.hit_count++; > else > miss_array[miss_count++] = xid; > > A cache hit is defined about getting useful information from the page, > that is a transaction being committed or invalid. > > When the miss count array fills, we sort it and determine the most > commonly hit clog page that is not in the cache and use that > information to replace pages in the cache if necessary, then reset the > counts. Maybe we can add a minimum threshold of hits, say 5-10% if > miss_array size for a page to be deemed interesting enough to be > loaded into the cache. > > Interaction w/set hint bits: > *) If a clog lookup faults through the cache, we basically keep the > current behavior. That is, the hint bits are set and the page is > marked BM_DIRTY and the hint bits get written back > > *) If we get a clog cache hit, that is the hint bits are not set but > we pulled the transaction status from the cache, the hint bits are > recorded on the page *but the page is not written back*, at least on > hint bit basis alone. This behavior branch is more or less the > BM_UNTIDY as suggested by haas (see archives), except it's only seen > in 'cache hit' scenarios. We are not writing pages back because the > cache is suggesting there is little/not benefit to write them back. > > Thus, if a single backend is scanning a lot of pages with transactions > touching a very small number of clog pages, hint bits are generally > not written back because they are not needed and in fact not helping. > However, if the xid are spread around a large number of clog pages, we > get the current behavior more or less (plus the overhead of cache > maintenance). > > With the current code base, hint bits are very beneficial when the xid > entropy is high and the number of repeated scan is high, and not so > good when the xid entropy is low and the number of repeated scans is > low. The process local cache attempts to redress this without > disadvantaging the already good cases. Furthermore, if it can be > proven that the cache overhead is epsilon, it's pretty unlikely to > negatively impact anyone negatively, at lest, that's my hope. Traffic > to clog will reduce (although not much, since i'd wager the current > 'last xid' cache works pretty well), but i/o should be reduced, in > some cases quite significantly for a tiny cpu cost (although that > remains to be proven). A couple of details I missed: *) clog lookups that return no cacheable information will not have their xid inserted into the 'missed' array -- this will prevent a clog page returning 'in progress' type states for transactions from pushing out pages that are returning useful information. In other words, an in progress transaction is neither a hit or miss from the point of view of the cache -- it's nothing. *) If we fault to the clog and get useful information (commit/invalid), and the clog page is already cached -- either the particular bit is set or the entire cache page is refreshed (not sure which is the better way to go yet) *) The clog cache might be useful in other places like during page eviction hint bet setting scenarios I mentioned earlier. In non bgwriter scenarios it's almost certainly a win to at least check the cache and set hint bits for BM_HEAP pages since you are leveraging the work already paid during scans. In the bgwriter case, you would have to build the cache by checking clog pages. I'm somewhat skeptical if this would actually help the bgwriter though since it involves tradeoffs that are hard to estimate. *) Maybe the shared buffer cache currently being maintained over the clog can be scrapped. I'm going to leave it alone for now, but I'm quite skeptical it provides much benefit even without local process cache. clog page have a very nice property that you don't have to worry about what else is going on from other processes and thus no complicated locking or invalidation issues when considering cache structure. IMNSHO -- this makes a local cache a much better fit even if you have to keep it smaller for memory usage reasons. merlin
On 30.03.2011 17:05, Merlin Moncure wrote: > *) Maybe the shared buffer cache currently being maintained over the > clog can be scrapped. I'm going to leave it alone for now, but I'm > quite skeptical it provides much benefit even without local process > cache. clog page have a very nice property that you don't have to > worry about what else is going on from other processes and thus no > complicated locking or invalidation issues when considering cache > structure. IMNSHO -- this makes a local cache a much better fit even > if you have to keep it smaller for memory usage reasons. A related idea I've sometimes pondered is to mmap() the clog in every process. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Tue, Mar 29, 2011 at 10:34 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > So I went back to the drawing board, reviewed the archives, and came > up with a new proposal. I'd like to see a process local clog page > cache of around 1-4 pages (8-32kb typically) that would replace the > current TransactionLogFetch last xid cache. It should be small, > because I doubt more would really help much (see below) and we want to > keep this data in the tight cpu caches since it's going to be > constantly scanned. > How is this different from the existing clog SLRU? It seems like the fundamental difference is that you plan to defer updating the hint bits rather than update them every time the row is accessed. That doesn't seem like a net win, it'll just defer the i/o, not eliminate it. I suppose the existing clog SLRU is page-based whereas this could cache individual xids in a btree so that it could have a higher hit rate. Or we could just increase the clog SLRU size if there's any evidence there are often cache misses on it. I suggested having the SLRU share memory pages with the buffer cache so it would automatically size itself rather than having to be statically sized. There is something to be gained by trying to update *all* the hint bits on a page whenever any row is updated. And there might be something to be gained by not dirtying the page so we only update the hint bits on disk if the page is dirtied for some other reason. But one way or another the hint bits have to get set sometime. The sooner that happens the less clog i/o has to happen in the meantime. -- greg
On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark <gsstark@mit.edu> wrote: > But one way or another the hint bits have to get set sometime. The > sooner that happens the less clog i/o has to happen in the meantime. I talked about this with Merlin a bit yesterday. I think that his thought is that most transactions will access a small enough number of distinct CLOG pages, and the cache accesses might be fast enough, that we really wouldn't need to get the hint bits set, or at least that vacuum/freeze time would be soon enough. I'm not sure if that's actually true, though - I think the overhead of the cache might be higher than he's imagining. However, there's a sure-fire way to find out... code it up and see how it plays. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 30, 2011 at 9:40 AM, Greg Stark <gsstark@mit.edu> wrote: > On Tue, Mar 29, 2011 at 10:34 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> So I went back to the drawing board, reviewed the archives, and came >> up with a new proposal. I'd like to see a process local clog page >> cache of around 1-4 pages (8-32kb typically) that would replace the >> current TransactionLogFetch last xid cache. It should be small, >> because I doubt more would really help much (see below) and we want to >> keep this data in the tight cpu caches since it's going to be >> constantly scanned. >> > > How is this different from the existing clog SLRU? It seems like the > fundamental difference is that you plan to defer updating the hint > bits rather than update them every time the row is accessed. That > doesn't seem like a net win, it'll just defer the i/o, not eliminate It is very different -- the slru layer is in shared memory and requires locks to access. The entire point is trying to avoid accessing this structure in tight code paths. I'm actually very skeptical the slru layer provides any benefit at all. I bet it would be cheaper just mmap in the pages directly (as Heikki is also speculating). Regarding deferring i/o, you have good chance of eliminating i/o if the tuples are deleted or touched before a particular backend gets to a page without already having seen the transaction (very high chance under certain workloads). Worst case scenario, you get the old behavior plus the overhead of maintaining the cache (which is why i'm keen on bucketing at the page level -- so it's ultra cheap to scan and memory efficient). merlin
Excerpts from Merlin Moncure's message of mié mar 30 12:14:20 -0300 2011: > It is very different -- the slru layer is in shared memory and > requires locks to access. The entire point is trying to avoid > accessing this structure in tight code paths. I'm actually very > skeptical the slru layer provides any benefit at all. I bet it would > be cheaper just mmap in the pages directly (as Heikki is also > speculating). Maybe it would be useful to distinguish the last SLRU page(s) (the one where clog writing actually takes place) from the older ones (which only ever see reading). You definitely need locks to be able to access the active pages, but all the rest could be held as mmapped and accessed without locks because they never change (except to be truncated away). You know that any page behind RecentXmin is not going to be written anymore, so why go through all the locking hoops? -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
On Wed, Mar 30, 2011 at 10:43 AM, Alvaro Herrera <alvherre@commandprompt.com> wrote: > Excerpts from Merlin Moncure's message of mié mar 30 12:14:20 -0300 2011: > >> It is very different -- the slru layer is in shared memory and >> requires locks to access. The entire point is trying to avoid >> accessing this structure in tight code paths. I'm actually very >> skeptical the slru layer provides any benefit at all. I bet it would >> be cheaper just mmap in the pages directly (as Heikki is also >> speculating). > > Maybe it would be useful to distinguish the last SLRU page(s) (the one > where clog writing actually takes place) from the older ones (which only > ever see reading). You definitely need locks to be able to access the > active pages, but all the rest could be held as mmapped and accessed > without locks because they never change (except to be truncated away). > You know that any page behind RecentXmin is not going to be written > anymore, so why go through all the locking hoops? hm, that's a good thought -- cheap and easy -- and good to implement irrespective of other changes. any improvement helps here, although from my perspective the largest hobgoblins are shared memory access generally and the extra i/o. merlin
On 30.03.2011 18:02, Robert Haas wrote: > On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark@mit.edu> wrote: >> But one way or another the hint bits have to get set sometime. The >> sooner that happens the less clog i/o has to happen in the meantime. > > I talked about this with Merlin a bit yesterday. I think that his > thought is that most transactions will access a small enough number of > distinct CLOG pages, and the cache accesses might be fast enough, that > we really wouldn't need to get the hint bits set, or at least that > vacuum/freeze time would be soon enough. I'm not sure if that's > actually true, though - I think the overhead of the cache might be > higher than he's imagining. However, there's a sure-fire way to find > out... code it up and see how it plays. I did a little experiment: I hacked SetHintBits() to do nothing, so that hint bits are never set. I then created a table with 100000 rows in it: postgres=# \d foo Table "public.foo" Column | Type | Modifiers --------+---------+----------- a | integer | postgres=# INSERT INTO foo SELECT generate_series(1, 100000); INSERT 0 100000 And ran queries on the table: postgres=# do $$ declare i int4; begin loop perform COUNT(*) FROM foo; end loop; end; $$; This test case is designed to exercise the visibility tests as much as possible. However, all the tuples are loaded in one transaction, so the one-element cache in TransactionLogFetch is 100% effective. I ran oprofile on that. It shows that about 15% of the time is spent in HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in HeapTupleSatisfiesMVCC itself. Here's the breakdown of that: $ opreport -c --global-percent CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated) Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % app name symbol name ... ------------------------------------------------------------------------------- 2143 0.4419 postgres postgres heapgettup_pagemode 73277 15.1091 postgres postgres heapgetpage 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC [self] 12809 2.6411 postgres postgres TransactionIdIsInProgress 12091 2.4931 postgres postgres XidInMVCCSnapshot 7150 1.4743 postgres postgres TransactionIdIsCurrentTransactionId 7056 1.4549 postgres postgres TransactionIdDidCommit 1839 0.3792 postgres postgres TransactionIdPrecedes 1467 0.3025 postgres postgres SetHintBits 1155 0.2382 postgres postgres TransactionLogFetch ------------------------------------------------------------------------------- ... I then ran the same test again with an unpatched version, to set the hint bits. After the hint bits were set, I ran oprofile again: ------------------------------------------------------------------------------- 275 0.4986 postgres heapgettup_pagemode 4459 8.0851 postgres heapgetpage 3005 5.4487 postgres HeapTupleSatisfiesMVCC 3005 5.4487 postgres HeapTupleSatisfiesMVCC[self] 1620 2.9374 postgres XidInMVCCSnapshot 110 0.1995 postgres TransactionIdPrecedes ------------------------------------------------------------------------------- So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total CPU time, and without hint bits, 15%. Even if clog access was free, hint bits still give a significant speedup thanks to skipping all the other overhead like TransactionIdIsInProgress() and TransactionIdIsCurrentTransactionId(). Speeding up clog access is important; when the one-element cache isn't saving you the clog access becomes a dominant factor. But you have to address that other overhead too before you can get rid of hint bits. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > Even if clog access was free, hint bits still give a significant speedup > thanks to skipping all the other overhead like TransactionIdIsInProgress() > and TransactionIdIsCurrentTransactionId(). Speeding up clog access is > important; when the one-element cache isn't saving you the clog access > becomes a dominant factor. But you have to address that other overhead too > before you can get rid of hint bits. Yes, note I am not getting rid of the hint bits -- either you get them directly from the tuple or the cache. The clog access layers are: 1. hint bit 2. backend local clog cache (my proposal) ====shared memory layer ==== 3. slru 4. os file cache 5. clog file My idea working hinges on rigging a cache (#2) that is not materially slower than the raw hint bit check. If you think out all the cases around what i'm suggesting, there is no way you hit #3 that you wouldn't ottherwise with the old behavior, since when you fault through #2 you mark the page dirty, but there are cases were full faults to clog are avoided. I'm not optimizing clog accesses, I'm avoiding them. Basically, i'm extending the last xid cache check in TransactionLogFetch so it holds >1 xid, and not setting BM_DIRTY *if and only if* you got xid status from that cache. merlin
On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 30.03.2011 18:02, Robert Haas wrote: >> >> On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark@mit.edu> wrote: >>> >>> But one way or another the hint bits have to get set sometime. The >>> sooner that happens the less clog i/o has to happen in the meantime. >> >> I talked about this with Merlin a bit yesterday. I think that his >> thought is that most transactions will access a small enough number of >> distinct CLOG pages, and the cache accesses might be fast enough, that >> we really wouldn't need to get the hint bits set, or at least that >> vacuum/freeze time would be soon enough. I'm not sure if that's >> actually true, though - I think the overhead of the cache might be >> higher than he's imagining. However, there's a sure-fire way to find >> out... code it up and see how it plays. > > I did a little experiment: I hacked SetHintBits() to do nothing, so that > hint bits are never set. I then created a table with 100000 rows in it: > > postgres=# \d foo > Table "public.foo" > Column | Type | Modifiers > --------+---------+----------- > a | integer | > > postgres=# INSERT INTO foo SELECT generate_series(1, 100000); > INSERT 0 100000 > > And ran queries on the table: > > postgres=# do $$ > declare > i int4; > begin > loop > perform COUNT(*) FROM foo; > end loop; > end; > $$; > > This test case is designed to exercise the visibility tests as much as > possible. However, all the tuples are loaded in one transaction, so the > one-element cache in TransactionLogFetch is 100% effective. > > I ran oprofile on that. It shows that about 15% of the time is spent in > HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in > HeapTupleSatisfiesMVCC itself. Here's the breakdown of that: > > $ opreport -c --global-percent > > CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated) > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit > mask of 0x00 (No unit mask) count 100000 > samples % app name symbol name > ... > ------------------------------------------------------------------------------- > 2143 0.4419 postgres postgres heapgettup_pagemode > 73277 15.1091 postgres postgres heapgetpage > 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC > 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC > [self] > 12809 2.6411 postgres postgres > TransactionIdIsInProgress > 12091 2.4931 postgres postgres XidInMVCCSnapshot > 7150 1.4743 postgres postgres > TransactionIdIsCurrentTransactionId > 7056 1.4549 postgres postgres TransactionIdDidCommit > 1839 0.3792 postgres postgres TransactionIdPrecedes > 1467 0.3025 postgres postgres SetHintBits > 1155 0.2382 postgres postgres TransactionLogFetch > ------------------------------------------------------------------------------- > ... > > I then ran the same test again with an unpatched version, to set the hint > bits. After the hint bits were set, I ran oprofile again: > > ------------------------------------------------------------------------------- > 275 0.4986 postgres heapgettup_pagemode > 4459 8.0851 postgres heapgetpage > 3005 5.4487 postgres HeapTupleSatisfiesMVCC > 3005 5.4487 postgres HeapTupleSatisfiesMVCC [self] > 1620 2.9374 postgres XidInMVCCSnapshot > 110 0.1995 postgres TransactionIdPrecedes > ------------------------------------------------------------------------------- > > So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total > CPU time, and without hint bits, 15%. > > Even if clog access was free, hint bits still give a significant speedup > thanks to skipping all the other overhead like TransactionIdIsInProgress() > and TransactionIdIsCurrentTransactionId(). Speeding up clog access is > important; when the one-element cache isn't saving you the clog access > becomes a dominant factor. But you have to address that other overhead too > before you can get rid of hint bits. Here is a patch demonstrating the caching action, but without the cache table, which isn't done yet -- It only works using the very last transaction id fetched. I used macros so I could keep the changes quite localized. The payoff is obvious: stock postgres: postgres=# create table v as select generate_series(1,50000000) v; select count(*) from v; SELECT 50000000 Time: 70010.160 ms select count(*) from v; run 1: 64.5 seconds <-- ! run 2: 21.3 seconds run 3: 19.3 seconds hint bit patch: run 1: 19.2 seconds <-- the 'money shot' run 2: 20.7 seconds run 3: 19.3 seconds Of course, until I get the cache table mechanism finished, you only see real benefit if you have significant # of pages all the same transaction. otoh, checking the last transaction has cost of 0, so your worst case performance is the old behavior. I'm pretty sure I can make a cache that is cheap to check and maintain because I'm completely free from locks, side effects, etc. btw I haven't forgotten your idea to move TransactionIdInProgress Down. I think this is a good idea, and will experiment with it pre and post cache. merlin
Attachment
On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > > btw I haven't forgotten your idea to move TransactionIdInProgress > Down. I think this is a good idea, and will experiment with it pre and > post cache. aside: Moving TransactionIdInProgress below TransactionIdDidCommit can help in once sense: TransactionIdDidCommit grabs the XidStatus but discards the knowledge if the transaction is known aborted. If it is in fact aborted you can immediately set the hint bits and punt. This should save an awful lot of calls to TransactionIdInProgress when scanning unhinted dead tuples. otoh, checking commit status first can burn you if you are repeatedly tuples that are in transaction, especially your own. Probably this could be mitigated by splitting TransactionIdIsInProgress so that you can do just the local checks and not shared memory, so that you could: 1. is this transaction mine? (if so, we are done) 2. get xid status if commit/abort done 3. do ProcArray portions of TransactionIdIsInProgress merlin
On Wed, Mar 30, 2011 at 6:30 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> >> btw I haven't forgotten your idea to move TransactionIdInProgress >> Down. I think this is a good idea, and will experiment with it pre and >> post cache. > > aside: > Moving TransactionIdInProgress below TransactionIdDidCommit can help > in once sense: TransactionIdDidCommit grabs the XidStatus but discards > the knowledge if the transaction is known aborted. If it is in fact > aborted you can immediately set the hint bits and punt. This should > save an awful lot of calls to TransactionIdInProgress when scanning > unhinted dead tuples. Yeah, it might make sense to have a function that returns commit/abort/unsure, where unsure can only happen if the transaction ID follows RecentXmin. You might also want to rearrange things so that that function starts by checking the passed-in XID against cachedFetchXid so we don't lose the benefit of that one-element cache. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas > <heikki.linnakangas@enterprisedb.com> wrote: >> On 30.03.2011 18:02, Robert Haas wrote: >>> >>> On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark@mit.edu> wrote: >>>> >>>> But one way or another the hint bits have to get set sometime. The >>>> sooner that happens the less clog i/o has to happen in the meantime. >>> >>> I talked about this with Merlin a bit yesterday. I think that his >>> thought is that most transactions will access a small enough number of >>> distinct CLOG pages, and the cache accesses might be fast enough, that >>> we really wouldn't need to get the hint bits set, or at least that >>> vacuum/freeze time would be soon enough. I'm not sure if that's >>> actually true, though - I think the overhead of the cache might be >>> higher than he's imagining. However, there's a sure-fire way to find >>> out... code it up and see how it plays. >> >> I did a little experiment: I hacked SetHintBits() to do nothing, so that >> hint bits are never set. I then created a table with 100000 rows in it: >> >> postgres=# \d foo >> Table "public.foo" >> Column | Type | Modifiers >> --------+---------+----------- >> a | integer | >> >> postgres=# INSERT INTO foo SELECT generate_series(1, 100000); >> INSERT 0 100000 >> >> And ran queries on the table: >> >> postgres=# do $$ >> declare >> i int4; >> begin >> loop >> perform COUNT(*) FROM foo; >> end loop; >> end; >> $$; >> >> This test case is designed to exercise the visibility tests as much as >> possible. However, all the tuples are loaded in one transaction, so the >> one-element cache in TransactionLogFetch is 100% effective. >> >> I ran oprofile on that. It shows that about 15% of the time is spent in >> HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in >> HeapTupleSatisfiesMVCC itself. Here's the breakdown of that: >> >> $ opreport -c --global-percent >> >> CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated) >> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit >> mask of 0x00 (No unit mask) count 100000 >> samples % app name symbol name >> ... >> ------------------------------------------------------------------------------- >> 2143 0.4419 postgres postgres heapgettup_pagemode >> 73277 15.1091 postgres postgres heapgetpage >> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC >> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC >> [self] >> 12809 2.6411 postgres postgres >> TransactionIdIsInProgress >> 12091 2.4931 postgres postgres XidInMVCCSnapshot >> 7150 1.4743 postgres postgres >> TransactionIdIsCurrentTransactionId >> 7056 1.4549 postgres postgres TransactionIdDidCommit >> 1839 0.3792 postgres postgres TransactionIdPrecedes >> 1467 0.3025 postgres postgres SetHintBits >> 1155 0.2382 postgres postgres TransactionLogFetch >> ------------------------------------------------------------------------------- >> ... >> >> I then ran the same test again with an unpatched version, to set the hint >> bits. After the hint bits were set, I ran oprofile again: >> >> ------------------------------------------------------------------------------- >> 275 0.4986 postgres heapgettup_pagemode >> 4459 8.0851 postgres heapgetpage >> 3005 5.4487 postgres HeapTupleSatisfiesMVCC >> 3005 5.4487 postgres HeapTupleSatisfiesMVCC [self] >> 1620 2.9374 postgres XidInMVCCSnapshot >> 110 0.1995 postgres TransactionIdPrecedes >> ------------------------------------------------------------------------------- >> >> So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total >> CPU time, and without hint bits, 15%. >> >> Even if clog access was free, hint bits still give a significant speedup >> thanks to skipping all the other overhead like TransactionIdIsInProgress() >> and TransactionIdIsCurrentTransactionId(). Speeding up clog access is >> important; when the one-element cache isn't saving you the clog access >> becomes a dominant factor. But you have to address that other overhead too >> before you can get rid of hint bits. > > Here is a patch demonstrating the caching action, but without the > cache table, which isn't done yet -- It only works using the very last > transaction id fetched. I used macros so I could keep the changes > quite localized. While playing with the data structure to implement an xid cache inside TransactionLogFetch, I learned a nasty lesson. HeapTupleSatisifiesMVCC is super sensitive to any changes and even jumping through a couple of calls to TransactionLogFetch was measurable in a couple of performance test case I came up with. I hauntingly recalled Tom's warnings to beware unexpected performance downsides. In short, irregardless of implementation, I unfortunately don't think any caching as or under the clog abstraction is going to work without impacting at least some cases. OTOH, maybe some micro optimization of HeapTupleSatisifiesMVCC could provide the same benefit I'm going after here. If you save off the the last xid that came back committed in a static variable, you can check that at the same level as if it were a hint bit. From my testing, this is well and truly 'free' in the sense that it can defer/eliminate clog checks and hint bit i/o. The patch attached gives all the advantages I was after in the previous patch without the downsides I found (extra calls to TransactionIdInProgress and inefficient calls out to TransactionLogFetch). It remains to be determined if it's possible to make a structure that is fast enough to expand the above to more than 1 xid. For starters, all the calls should be inline. Considering we are only after the commit bits, a super tight hashing/bucketing algorithm might fit the bill. merlin diff -x '*.sql' -x '*.o' -x '*.txt' -rupN postgresql-9.1alpha4/src/backend/utils/time/tqual.c postgresql-9.1alpha4_hbcache/src/backend/utils/time/tqual.c --- postgresql-9.1alpha4/src/backend/utils/time/tqual.c 2011-03-09 08:19:24.000000000 -0600 +++ postgresql-9.1alpha4_hbcache/src/backend/utils/time/tqual.c 2011-03-31 17:03:43.135195002 -0500 @@ -912,7 +912,10 @@ boolHeapTupleSatisfiesMVCC(HeapTupleHeader tuple, Snapshot snapshot, Buffer buffer){ - if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED)) + static TransactionId LastCommitXid = InvalidTransactionId; + + if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED) + && LastCommitXid != HeapTupleHeaderGetXmin(tuple)) { if (tuple->t_infomask & HEAP_XMIN_INVALID) return false; @@ -985,8 +988,11 @@ HeapTupleSatisfiesMVCC(HeapTupleHeader t else if (TransactionIdIsInProgress(HeapTupleHeaderGetXmin(tuple))) return false; else if (TransactionIdDidCommit(HeapTupleHeaderGetXmin(tuple))) + { + LastCommitXid = HeapTupleHeaderGetXmin(tuple); SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED, HeapTupleHeaderGetXmin(tuple)); + } else { /* it must have aborted or crashed */
On Thu, Mar 31, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas >> <heikki.linnakangas@enterprisedb.com> wrote: >>> On 30.03.2011 18:02, Robert Haas wrote: >>>> >>>> On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark@mit.edu> wrote: >>>>> >>>>> But one way or another the hint bits have to get set sometime. The >>>>> sooner that happens the less clog i/o has to happen in the meantime. >>>> >>>> I talked about this with Merlin a bit yesterday. I think that his >>>> thought is that most transactions will access a small enough number of >>>> distinct CLOG pages, and the cache accesses might be fast enough, that >>>> we really wouldn't need to get the hint bits set, or at least that >>>> vacuum/freeze time would be soon enough. I'm not sure if that's >>>> actually true, though - I think the overhead of the cache might be >>>> higher than he's imagining. However, there's a sure-fire way to find >>>> out... code it up and see how it plays. >>> >>> I did a little experiment: I hacked SetHintBits() to do nothing, so that >>> hint bits are never set. I then created a table with 100000 rows in it: >>> >>> postgres=# \d foo >>> Table "public.foo" >>> Column | Type | Modifiers >>> --------+---------+----------- >>> a | integer | >>> >>> postgres=# INSERT INTO foo SELECT generate_series(1, 100000); >>> INSERT 0 100000 >>> >>> And ran queries on the table: >>> >>> postgres=# do $$ >>> declare >>> i int4; >>> begin >>> loop >>> perform COUNT(*) FROM foo; >>> end loop; >>> end; >>> $$; >>> >>> This test case is designed to exercise the visibility tests as much as >>> possible. However, all the tuples are loaded in one transaction, so the >>> one-element cache in TransactionLogFetch is 100% effective. >>> >>> I ran oprofile on that. It shows that about 15% of the time is spent in >>> HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in >>> HeapTupleSatisfiesMVCC itself. Here's the breakdown of that: >>> >>> $ opreport -c --global-percent >>> >>> CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated) >>> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit >>> mask of 0x00 (No unit mask) count 100000 >>> samples % app name symbol name >>> ... >>> ------------------------------------------------------------------------------- >>> 2143 0.4419 postgres postgres heapgettup_pagemode >>> 73277 15.1091 postgres postgres heapgetpage >>> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC >>> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC >>> [self] >>> 12809 2.6411 postgres postgres >>> TransactionIdIsInProgress >>> 12091 2.4931 postgres postgres XidInMVCCSnapshot >>> 7150 1.4743 postgres postgres >>> TransactionIdIsCurrentTransactionId >>> 7056 1.4549 postgres postgres TransactionIdDidCommit >>> 1839 0.3792 postgres postgres TransactionIdPrecedes >>> 1467 0.3025 postgres postgres SetHintBits >>> 1155 0.2382 postgres postgres TransactionLogFetch >>> ------------------------------------------------------------------------------- >>> ... >>> >>> I then ran the same test again with an unpatched version, to set the hint >>> bits. After the hint bits were set, I ran oprofile again: >>> >>> ------------------------------------------------------------------------------- >>> 275 0.4986 postgres heapgettup_pagemode >>> 4459 8.0851 postgres heapgetpage >>> 3005 5.4487 postgres HeapTupleSatisfiesMVCC >>> 3005 5.4487 postgres HeapTupleSatisfiesMVCC [self] >>> 1620 2.9374 postgres XidInMVCCSnapshot >>> 110 0.1995 postgres TransactionIdPrecedes >>> ------------------------------------------------------------------------------- >>> >>> So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total >>> CPU time, and without hint bits, 15%. >>> >>> Even if clog access was free, hint bits still give a significant speedup >>> thanks to skipping all the other overhead like TransactionIdIsInProgress() >>> and TransactionIdIsCurrentTransactionId(). Speeding up clog access is >>> important; when the one-element cache isn't saving you the clog access >>> becomes a dominant factor. But you have to address that other overhead too >>> before you can get rid of hint bits. >> >> Here is a patch demonstrating the caching action, but without the >> cache table, which isn't done yet -- It only works using the very last >> transaction id fetched. I used macros so I could keep the changes >> quite localized. > > While playing with the data structure to implement an xid cache inside > TransactionLogFetch, I learned a nasty lesson. > HeapTupleSatisifiesMVCC is super sensitive to any changes and even > jumping through a couple of calls to TransactionLogFetch was > measurable in a couple of performance test case I came up with. I > hauntingly recalled Tom's warnings to beware unexpected performance > downsides. > > In short, irregardless of implementation, I unfortunately don't think > any caching as or under the clog abstraction is going to work without > impacting at least some cases. > > OTOH, maybe some micro optimization of HeapTupleSatisifiesMVCC could > provide the same benefit I'm going after here. If you save off the > the last xid that came back committed in a static variable, you can > check that at the same level as if it were a hint bit. From my > testing, this is well and truly 'free' in the sense that it can > defer/eliminate clog checks and hint bit i/o. > > The patch attached gives all the advantages I was after in the hm, this isn't quite right -- if you get a 'last xid cache hit', the hint bit tuples should be set if they are not already (but the page should not be dirtied). working on exanding the cache to # xid > 1. merlin
On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Mar 31, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> On Wed, Mar 30, 2011 at 11:23 AM, Heikki Linnakangas >>> <heikki.linnakangas@enterprisedb.com> wrote: >>>> On 30.03.2011 18:02, Robert Haas wrote: >>>>> >>>>> On Wed, Mar 30, 2011 at 10:40 AM, Greg Stark<gsstark@mit.edu> wrote: >>>>>> >>>>>> But one way or another the hint bits have to get set sometime. The >>>>>> sooner that happens the less clog i/o has to happen in the meantime. >>>>> >>>>> I talked about this with Merlin a bit yesterday. I think that his >>>>> thought is that most transactions will access a small enough number of >>>>> distinct CLOG pages, and the cache accesses might be fast enough, that >>>>> we really wouldn't need to get the hint bits set, or at least that >>>>> vacuum/freeze time would be soon enough. I'm not sure if that's >>>>> actually true, though - I think the overhead of the cache might be >>>>> higher than he's imagining. However, there's a sure-fire way to find >>>>> out... code it up and see how it plays. >>>> >>>> I did a little experiment: I hacked SetHintBits() to do nothing, so that >>>> hint bits are never set. I then created a table with 100000 rows in it: >>>> >>>> postgres=# \d foo >>>> Table "public.foo" >>>> Column | Type | Modifiers >>>> --------+---------+----------- >>>> a | integer | >>>> >>>> postgres=# INSERT INTO foo SELECT generate_series(1, 100000); >>>> INSERT 0 100000 >>>> >>>> And ran queries on the table: >>>> >>>> postgres=# do $$ >>>> declare >>>> i int4; >>>> begin >>>> loop >>>> perform COUNT(*) FROM foo; >>>> end loop; >>>> end; >>>> $$; >>>> >>>> This test case is designed to exercise the visibility tests as much as >>>> possible. However, all the tuples are loaded in one transaction, so the >>>> one-element cache in TransactionLogFetch is 100% effective. >>>> >>>> I ran oprofile on that. It shows that about 15% of the time is spent in >>>> HeapTupleSatisfiesMVCC and its subroutines. 6.6% is spent in >>>> HeapTupleSatisfiesMVCC itself. Here's the breakdown of that: >>>> >>>> $ opreport -c --global-percent >>>> >>>> CPU: Intel Architectural Perfmon, speed 2266 MHz (estimated) >>>> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit >>>> mask of 0x00 (No unit mask) count 100000 >>>> samples % app name symbol name >>>> ... >>>> ------------------------------------------------------------------------------- >>>> 2143 0.4419 postgres postgres heapgettup_pagemode >>>> 73277 15.1091 postgres postgres heapgetpage >>>> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC >>>> 31858 6.5688 postgres postgres HeapTupleSatisfiesMVCC >>>> [self] >>>> 12809 2.6411 postgres postgres >>>> TransactionIdIsInProgress >>>> 12091 2.4931 postgres postgres XidInMVCCSnapshot >>>> 7150 1.4743 postgres postgres >>>> TransactionIdIsCurrentTransactionId >>>> 7056 1.4549 postgres postgres TransactionIdDidCommit >>>> 1839 0.3792 postgres postgres TransactionIdPrecedes >>>> 1467 0.3025 postgres postgres SetHintBits >>>> 1155 0.2382 postgres postgres TransactionLogFetch >>>> ------------------------------------------------------------------------------- >>>> ... >>>> >>>> I then ran the same test again with an unpatched version, to set the hint >>>> bits. After the hint bits were set, I ran oprofile again: >>>> >>>> ------------------------------------------------------------------------------- >>>> 275 0.4986 postgres heapgettup_pagemode >>>> 4459 8.0851 postgres heapgetpage >>>> 3005 5.4487 postgres HeapTupleSatisfiesMVCC >>>> 3005 5.4487 postgres HeapTupleSatisfiesMVCC [self] >>>> 1620 2.9374 postgres XidInMVCCSnapshot >>>> 110 0.1995 postgres TransactionIdPrecedes >>>> ------------------------------------------------------------------------------- >>>> >>>> So with hint bits set, HeapTupleSatisfiesMVCC accounts for 8% of the total >>>> CPU time, and without hint bits, 15%. >>>> >>>> Even if clog access was free, hint bits still give a significant speedup >>>> thanks to skipping all the other overhead like TransactionIdIsInProgress() >>>> and TransactionIdIsCurrentTransactionId(). Speeding up clog access is >>>> important; when the one-element cache isn't saving you the clog access >>>> becomes a dominant factor. But you have to address that other overhead too >>>> before you can get rid of hint bits. >>> >>> Here is a patch demonstrating the caching action, but without the >>> cache table, which isn't done yet -- It only works using the very last >>> transaction id fetched. I used macros so I could keep the changes >>> quite localized. >> >> While playing with the data structure to implement an xid cache inside >> TransactionLogFetch, I learned a nasty lesson. >> HeapTupleSatisifiesMVCC is super sensitive to any changes and even >> jumping through a couple of calls to TransactionLogFetch was >> measurable in a couple of performance test case I came up with. I >> hauntingly recalled Tom's warnings to beware unexpected performance >> downsides. >> >> In short, irregardless of implementation, I unfortunately don't think >> any caching as or under the clog abstraction is going to work without >> impacting at least some cases. >> >> OTOH, maybe some micro optimization of HeapTupleSatisifiesMVCC could >> provide the same benefit I'm going after here. If you save off the >> the last xid that came back committed in a static variable, you can >> check that at the same level as if it were a hint bit. From my >> testing, this is well and truly 'free' in the sense that it can >> defer/eliminate clog checks and hint bit i/o. >> >> The patch attached gives all the advantages I was after in the > > hm, this isn't quite right -- if you get a 'last xid cache hit', the > hint bit tuples should be set if they are not already (but the page > should not be dirtied). > > working on exanding the cache to # xid > 1. patch attached. this is essentially my original idea except it's injected directly in to tqual.c as a kind of a expansion of the 'last seen' concept. Because the cache loops are inlined and very tight (4 int compares), it's actually cheaper than jumping out of line into clog to test this_int = that_int; Still needs some more code docs and refactoring/cleanup (I'm not even gonna pretend this is up to pg coding standards), but it passes regression and gets the performance i'm looking for w/o the downsides I was seeing with other implementation. maybe some extra cycles could be eeked out by doing aligned reads on machine bit size (32 or 64). It's pretty fast as is though, at least on x86. Feel free to take a look. merlin
Attachment
Merlin Moncure <mmoncure@gmail.com> writes: > On Wed, Mar 30, 2011 at 2:35 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> btw I haven't forgotten your idea to move TransactionIdInProgress >> Down. I think this is a good idea, and will experiment with it pre and >> post cache. The reason it's done in that order is to avoid race conditions. If you change the order you will get wrong behavior if the other transaction ends between the TransactionIdDidCommit and the TransactionIdInProgress tests. I suppose you could recheck TransactionIdDidCommit a second time, but that hardly seems likely to result in performance gains. > aside: > Moving TransactionIdInProgress below TransactionIdDidCommit can help > in once sense: TransactionIdDidCommit grabs the XidStatus but discards > the knowledge if the transaction is known aborted. Doesn't the single-entry cache help with that? regards, tom lane
Merlin Moncure <mmoncure@gmail.com> writes: > On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> working on exanding the cache to # xid > 1. > patch attached. this is essentially my original idea except it's > injected directly in to tqual.c as a kind of a expansion of the 'last > seen' concept. Because the cache loops are inlined and very tight (4 > int compares), it's actually cheaper than jumping out of line into > clog to test this_int = that_int; This seems like a mess. Why is the criterion for caching commit states different from whether the hint bit can be set? And why are you cluttering tqual.c with it? Based on the initial description I'd expected to find this enlarging the single-entry cache in transam.c to have multiple entries. regards, tom lane
On Sat, Apr 2, 2011 at 8:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> working on exanding the cache to # xid > 1. > >> patch attached. this is essentially my original idea except it's >> injected directly in to tqual.c as a kind of a expansion of the 'last >> seen' concept. Because the cache loops are inlined and very tight (4 >> int compares), it's actually cheaper than jumping out of line into >> clog to test this_int = that_int; > > This seems like a mess. Why is the criterion for caching commit states > different from whether the hint bit can be set? And why are you The hint bits are always set. The page is flagged dirty if and only if you have to dip below the process local cache to get it. The rationale for this is simple: it caps your downside because unlike hint bit removal or BM_UNTIDY, you only have to dip into clog and pay the i/o once per page -- so that your downside is limited to pre-cache behavior (minus the overhead of maintaining the cache). > cluttering tqual.c with it? Based on the initial description I'd > expected to find this enlarging the single-entry cache in transam.c > to have multiple entries. This was my original idea --- abstracting it all under transam. For simplicity's sake I was playing with a dynahash of the transaction commit status, hit count, and log position keyed around the xid, and a small sweeper that came around every N misses and purged out the entries that did not have enough hit counts (or a defined minimum). The problem with that approach at least according to my initial testing was that putting two non-inline functions after the hint bit test into HeapTuplesSatisifiesMVCC ate up a lot of the savings I was looking for. I was testing on a VM, which maybe penalizes out of line code execution more than on hardware, but I wasn't happy with it and tried working the hint bit check angle. If you look upthread you can see my skepticism that a non-line cache lookup is going to work. You can trivially test this yourself if you're curious, my test was to just run: drop table t; create table t as select v from generate_series(1,500000) v; and repeatedly do select count(*) from v; If you disable the hint bit action, you should see a non insignificant difference in time even though the current last xid cache in transam is there. hm. If the exposure to transam is not to your liking (I'm not happy with it either), I could make an inline stub to TransactionIdDidCommit which checks the cache and failing anything useful there, moves to out of line logic. merlin
On Sun, Apr 3, 2011 at 6:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Sat, Apr 2, 2011 at 8:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Merlin Moncure <mmoncure@gmail.com> writes: >>> On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>>> working on exanding the cache to # xid > 1. >> >>> patch attached. this is essentially my original idea except it's >>> injected directly in to tqual.c as a kind of a expansion of the 'last >>> seen' concept. Because the cache loops are inlined and very tight (4 >>> int compares), it's actually cheaper than jumping out of line into >>> clog to test this_int = that_int; >> >> This seems like a mess. Why is the criterion for caching commit states >> different from whether the hint bit can be set? And why are you > > The hint bits are always set. The page is flagged dirty if and only > if you have to dip below the process local cache to get it. The > rationale for this is simple: it caps your downside because unlike > hint bit removal or BM_UNTIDY, you only have to dip into clog and pay > the i/o once per page -- so that your downside is limited to pre-cache > behavior (minus the overhead of maintaining the cache). I might have missed part of the thrust of your question. In the patch, the commit status is only stored if the LSN for the xid is known safe (instead of checking it just before setting the bit as SetHintBits does). This is indeed different, and it was done so as to avoid having to a. repeatedly check XLogNeedsFlush or b. cache the xid LSN as transam.c does. Perhaps this is weak sauce, so I think the first thing to do is have another go at doing it at the transam level.Stay tuned... merlin
On Mon, Apr 4, 2011 at 9:25 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Sun, Apr 3, 2011 at 6:40 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Sat, Apr 2, 2011 at 8:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Merlin Moncure <mmoncure@gmail.com> writes: >>>> On Thu, Mar 31, 2011 at 5:38 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>>>> working on exanding the cache to # xid > 1. >>> >>>> patch attached. this is essentially my original idea except it's >>>> injected directly in to tqual.c as a kind of a expansion of the 'last >>>> seen' concept. Because the cache loops are inlined and very tight (4 >>>> int compares), it's actually cheaper than jumping out of line into >>>> clog to test this_int = that_int; >>> >>> This seems like a mess. Why is the criterion for caching commit states >>> different from whether the hint bit can be set? And why are you >> >> The hint bits are always set. The page is flagged dirty if and only >> if you have to dip below the process local cache to get it. The >> rationale for this is simple: it caps your downside because unlike >> hint bit removal or BM_UNTIDY, you only have to dip into clog and pay >> the i/o once per page -- so that your downside is limited to pre-cache >> behavior (minus the overhead of maintaining the cache). > > I might have missed part of the thrust of your question. In the > patch, the commit status is only stored if the LSN for the xid is > known safe (instead of checking it just before setting the bit as > SetHintBits does). This is indeed different, and it was done so as to > avoid having to a. repeatedly check XLogNeedsFlush or b. cache the xid > LSN as transam.c does. Perhaps this is weak sauce, so I think the > first thing to do is have another go at doing it at the transam level. > Stay tuned... Ok, after having done some more testing, I don't think this is gonna work without at least some changes to tqual.c. It's exactly the stuff that was bugging you, or some variant thereof, that needs to happen with penalizing some cases. Letting transam.c handle 100% of the cache management has the following issues: *) No way to signal tqual.c to not dirty page if we got a cache hit -- the i/o mitigation strategy rests on being able to do this. *) transam.c cache implementation should also cache transaction LSN position, which is expensive memory wise for less benefit *) Lots of non-inline calls. Jumping out of HeapTupleSatisifiesMVCC into non-inline function is expensive, at least on my test machine. Since there are more cases now where you fall through thei hint bit check, you are elevating significantly the amount of extra calls you make out of line which eats up all the savings you get -- sometimes more. In particular: *) both TransactionIdIsCurrentTransactionId and TransactionIdIsInProgress are called *before* TransactionIdDidCommit. so, even if you are going to get a cache 'hit', you still have to pay for jumping out to these functions and running them. TransactionIdIsInProgress could be made inline possibly if you go put it under transam.c cache umbrella, it still has material cost in that case. *) Setting commit cache if and only if your LSN is known safe is important innovation and must be preserved -- if you are getting lots of 'commit cache hit' tuples, it saves a lot of time not having to constantly recheck the LSN flush status of the same transactions over and over. Here's the times I'm seeing on my workstation (all timings are somewhat noisy, accurate within 10ms or so). test: create table v as select generate_series(1,500000) v; select count(*) from v; stock: run 1: 3000 ms (setting hint bits) run 2+ 665 ms (hint bit optimized) transam.c cache non inline run 1+ 1000ms transam.c cache inline (made inline stub that proxies to existing function on cache miss) run 1+ 800ms tqual.c cache, checked at TransactionIdDidCommit run 1+ 790ms (1-2 less 'if' statements) tqual.c cache, checked at TransactionIdDidCommit, no LSN flush check run 1+ 740ms tqual.c cache, checked right after hint bit, no LSN flush check run 1+ 670ms The last case is near hint bit performance, because the check is happening more or less at the same place, and the check itself is a small number of int math operations (and can be optimized further still). I haven't yet timed the degenerate case where the commit cache is receiving tons of misses and is constantly flushing out pages. I'm not expecting this to be bad, but need to mention this -- I have some ideas about mitigation if it's a problem. So the question is, what is the way forwards? Follows is the commit cache code I'm hooking into various places, with some fixes from previous patch. The actual implementation of the cache structure I'm not stuck on, other than it has to be completely inline (no dynahash) and very tight. Where it gets called from is much more important. merlin #define COMMIT_CACHE_BUCKET_COUNT 4 #define COMMIT_CACHE_BUCKET_BLOCKSZ BLCKSZ #define COMMIT_CACHE_ROLLUPSZ 100 #define COMMIT_CACHE_HIT_THRESHOLD 5 typedef struct { int XidBucket; int HitCount; bool Clear; } CommitCacheBucketHeader; /* I know static definition here is lousy, but it's convenient for testing purposes */ static CommitCacheBucketHeader CommitCacheBucketList[COMMIT_CACHE_BUCKET_COUNT] = {{-1, 0, false}, {-1, 0, false},{-1, 0, false}, {-1, 0, false}}; static char CommitCacheData[COMMIT_CACHE_BUCKET_COUNT][COMMIT_CACHE_BUCKET_BLOCKSZ]; static int CommitCacheBucketRollupList[COMMIT_CACHE_ROLLUPSZ]; static int CommitCacheMissCount = 0; /* qsort comparison function */ static int int_cmp(const void *p1, const void *p2) { int v1 = *((const int *) p1); int v2 = *((const int *) p2); if (v1 < v2) return -1; if (v1 > v2) return 1; return 0; } static void RollUpCommitCache() { TransactionId LastBucket = CommitCacheBucketRollupList[0]; int BucketIdx; int HitCount = 0; intBucket; int BucketMinHits = COMMIT_CACHE_ROLLUPSZ + 1; int BucketMinIdx; int XidBucket; qsort(CommitCacheBucketRollupList, COMMIT_CACHE_ROLLUPSZ, sizeof(int), int_cmp); for (BucketIdx = 0; BucketIdx < COMMIT_CACHE_ROLLUPSZ; BucketIdx++) { if(CommitCacheBucketRollupList[BucketIdx]== LastBucket) HitCount++; else { if (HitCount >= COMMIT_CACHE_HIT_THRESHOLD) { CheckLastXidBlock: for (Bucket = 0; Bucket < COMMIT_CACHE_BUCKET_COUNT; Bucket++) { if (CommitCacheBucketList[Bucket].HitCount < BucketMinHits) { BucketMinIdx = Bucket; BucketMinHits = CommitCacheBucketList[Bucket].HitCount; } } XidBucket = CommitCacheBucketRollupList[BucketIdx]; if (HitCount > BucketMinHits) { if(XidBucket != CommitCacheBucketList[BucketMinIdx].XidBucket) { CommitCacheBucketList[BucketMinIdx].XidBucket = XidBucket; CommitCacheBucketList[BucketMinIdx].HitCount = HitCount; CommitCacheBucketList[BucketMinIdx].Clear = true; } } } HitCount = 1; LastBucket = CommitCacheBucketRollupList[BucketIdx]; } } if(HitCount >= COMMIT_CACHE_HIT_THRESHOLD) { BucketIdx--; goto CheckLastXidBlock; } for (Bucket = 0; Bucket < COMMIT_CACHE_BUCKET_COUNT; Bucket++) { if(CommitCacheBucketList[BucketMinIdx].Clear) memset(CommitCacheData[Bucket], 0, COMMIT_CACHE_BUCKET_BLOCKSZ); CommitCacheBucketList[BucketMinIdx].Clear = 0; CommitCacheBucketList[BucketMinIdx].HitCount= 0; } CommitCacheMissCount = 0; return; } static inline bool IsXidInCommitCacheI(TransactionId xid) { int XidBucket = xid / COMMIT_CACHE_BUCKET_BLOCKSZ; int Bucket; for(Bucket = 0; Bucket < COMMIT_CACHE_BUCKET_COUNT; Bucket++) { if(CommitCacheBucketList[Bucket].XidBucket== XidBucket) { int ByteOffset = xid / BITS_PER_BYTE; int BitOffset = xid % BITS_PER_BYTE; if (CommitCacheData[Bucket][ByteOffset]& (1 << BitOffset)) { CommitCacheBucketList[Bucket].HitCount++; return true; } break; } } CommitCacheBucketRollupList[CommitCacheMissCount++] = XidBucket; if (CommitCacheMissCount == COMMIT_CACHE_ROLLUPSZ) { RollUpCommitCache(); } return false; } static inline void SetXidInCommitCacheI(TransactionId xid) { int XidBucket = xid / COMMIT_CACHE_BUCKET_BLOCKSZ; int Bucket; for(Bucket = 0; Bucket < COMMIT_CACHE_BUCKET_COUNT; Bucket++) { if(XidBucket == CommitCacheBucketList[Bucket].XidBucket) { int ByteOffset = xid / BITS_PER_BYTE; int BitOffset = xid % BITS_PER_BYTE; CommitCacheData[Bucket][ByteOffset] |= (1 <<BitOffset); break; } } return; } static inline bool TransactionIdDidCommitCacheI(TransactionId transactionId) { return IsXidInCommitCacheI(transactionId); }
On Thu, Apr 7, 2011 at 1:28 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > int ByteOffset = xid / BITS_PER_BYTE; whoops, I just notice this was wrong -- the byte offset needs to be taking bucket into account. I need to clean this up some more obviously, but the issues at play remain the same.... merlin
On Thu, Apr 7, 2011 at 2:49 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Thu, Apr 7, 2011 at 1:28 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> int ByteOffset = xid / BITS_PER_BYTE; > > whoops, I just notice this was wrong -- the byte offset needs to be > taking bucket into account. I need to clean this up some more > obviously, but the issues at play remain the same.... So, where is the latest version of this patch? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 28, 2011 at 10:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Apr 7, 2011 at 2:49 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Thu, Apr 7, 2011 at 1:28 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> int ByteOffset = xid / BITS_PER_BYTE; >> >> whoops, I just notice this was wrong -- the byte offset needs to be >> taking bucket into account. I need to clean this up some more >> obviously, but the issues at play remain the same.... > > So, where is the latest version of this patch? https://commitfest.postgresql.org/action/patch_view?id=553 merlin
On Wed, Jun 29, 2011 at 9:39 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Jun 28, 2011 at 10:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Thu, Apr 7, 2011 at 2:49 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> On Thu, Apr 7, 2011 at 1:28 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>>> int ByteOffset = xid / BITS_PER_BYTE; >>> >>> whoops, I just notice this was wrong -- the byte offset needs to be >>> taking bucket into account. I need to clean this up some more >>> obviously, but the issues at play remain the same.... >> >> So, where is the latest version of this patch? > > https://commitfest.postgresql.org/action/patch_view?id=553 Oh, shoot. When I looked at this last night, I thought that link went somewhere else. Woops. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Apr 2, 2011 at 8:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Merlin Moncure <mmoncure@gmail.com> writes: >> aside: >> Moving TransactionIdInProgress below TransactionIdDidCommit can help >> in once sense: TransactionIdDidCommit grabs the XidStatus but discards >> the knowledge if the transaction is known aborted. > > Doesn't the single-entry cache help with that? Yes, it does. Merlin, I'm a little worried by some of the statements around this patch. It would be good to get a single clear analysis, then build the patch from that. * TransactionIdDidCommit grabs XidStatus and then keeps it - if the xact is known aborted or known committed. * In the top of the patch you mention that "It's tempting to try and expand the simple last transaction status in transam.c but this check happens too late in the critical visibility routinesto provide much benefit. For the fastest possible cache performance with the least amount of impact on scan heavycases, you have to maintain the cache here if you want to avoid dirtying the page based on interaction with thecache." We call TransactionIdIsKnownCompleted() before we touch shared memory in TransactionIdIsInProgress(), which seems early enough for me. Maybe I've misunderstood your comments. Also, the reason we set the hint bits is (in my understanding) so that other users will avoid having to do clog lookups. If we only cache xid status locally, we would actually generate more clog lookups, not less. I realise that reduces I/O, so it seems to be a straight tradeoff between I/O and clog access. I think we need to be clear what the goals of this are - different people may have different tuning goals - perhaps we need a parameter for that because in different situations either one might make sense. For me, it makes sense for us to set hint bits on already-dirty pages when they are written out by the bgwriter. At the moment we do very little to amortise the I/O that must already happen. You could easily get frustrated here and lose interest. I hope not. This is interesting stuff and you are well qualified to find a solution. I just want to be sure we define exactly what the problem is, and publish all of the analysis first. Writing the final patch is probably the easiest part of that task. Anyway, I don't think this patch is ready for commit this time around, but good hunting. There is some treasure to be found here, I'm certain. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jun 29, 2011 at 9:09 AM, Simon Riggs <simon@2ndquadrant.com> wrote: > On Sat, Apr 2, 2011 at 8:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Merlin Moncure <mmoncure@gmail.com> writes: > >>> aside: >>> Moving TransactionIdInProgress below TransactionIdDidCommit can help >>> in once sense: TransactionIdDidCommit grabs the XidStatus but discards >>> the knowledge if the transaction is known aborted. >> >> Doesn't the single-entry cache help with that? > > Yes, it does. correct. that line of analysis was refuted both in terms of benefit and on structural grounds by Tom. > I'm a little worried by some of the statements around this patch. It > would be good to get a single clear analysis, then build the patch > from that. > > * TransactionIdDidCommit grabs XidStatus and then keeps it - if the > xact is known aborted or known committed. > * In the top of the patch you mention that "It's tempting to try and > expand the simple last transaction status in > transam.c but this check happens too late in the critical visibility > routines to provide much benefit. For the fastest possible cache > performance with the least amount of impact on scan heavy cases, you have > to maintain the cache here if you want to avoid dirtying the page > based on interaction with the cache." > > We call TransactionIdIsKnownCompleted() before we touch shared memory > in TransactionIdIsInProgress(), which seems early enough for me. > Maybe I've misunderstood your comments. This is correct. If you are scanning tuples all with the same transaction, the transam.c cache will prevent some of the worst problems. However, going through all the mechanics in the transam.c covered case is still fairly expensive. You have to call several non-inlined functions, including XLogNeedsFlush(), over and over. This is more expensive than it looks and a transam.c cache implementation is hamstrung out of the gate if you are trying avoid i/o...it isn't really all that much cheaper than jumping out to the slru and will penalize repetitive scans (I can prove this with benchmarks). Bear in mind I started out with the intent of working in transam.c, and dropped it when it turned out not to work. > Also, the reason we set the hint bits is (in my understanding) so that > other users will avoid having to do clog lookups. If we only cache xid > status locally, we would actually generate more clog lookups, not > less. I realise that reduces I/O, so it seems to be a straight > tradeoff between I/O and clog access. That is not correct. Any cache 'miss' on a page continues to fall through to SetHintBitsCache() which dirties the page as it always has.This is a key innovation the patch IMNSHO. Your worstcase is the old behavior -- the maximum number of times the fault to clog can happen for a page is once. It's impossible to even get one more clog access than you did with stock postgres. > For me, it makes sense for us to set hint bits on already-dirty pages > when they are written out by the bgwriter. At the moment we do very > little to amortise the I/O that must already happen. I could agree with this. However if the bgwriter is forcing out pages before the hint bits can be set you have a problem. Also adding more work to the bgwriter may cause other secondary performance issues. merlin
On Wed, Jun 29, 2011 at 10:09 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > That is not correct. Any cache 'miss' on a page continues to fall > through to SetHintBitsCache() which dirties the page as it always has. er, SetHintBits(), not SetHintBitsCache() merlin