Thread: [GENERAL] Logical decoding CPU-bound w/ large number of tables
Heyo,
I'm attempting to use logical decoding with the streaming replication protocol to perform change-data-capture on PostgreSQL 9.5.4. I'm seeing the replication stream "stall" for long periods of time where the walsender process will be pinned at 100% CPU utilization, but no data is being sent to my client.
The stalls occur unpredictably on my production system, but generally seem to be correlated with schema operations. My source database has about 100,000 tables; it's a one-schema-per-tenant multi-tenant SaaS system.
I've reproduced the same symptoms with two different approaches on my local machine. With both, I have a replication client connected via the streaming protocol.
In reproduction approach 1, I've created a thread that inserts small sets of data, and a thread that creates a schema w/ 500 tables and then drops it. This approach has pinned CPU usage, but data does come out of it -- just excruciatingly slow when compared to the same test without the schema create & drop.
In reproduction approach 2, I've created a database w/ 100,000 tables on it and performed a "vacuum ful". The walsender goes to 100% CPU and no data comes out of the replication stream for hours.
I've performed a CPU sampling with the OSX `sample` tool based upon reproduction approach #1: https://gist.github.com/mfenniak/366d7ed19b2d804f41180572dc1600d8 It appears that most of the time is spent in the RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache invalidation callbacks, both of which appear to be invalidating caches based upon the cache value.
Has anyone else run into this kind of performance problem? Any thoughts on how it might be resolved? I don't mind putting in the work if someone could describe what is happening here, and have a discussion with me about what kind of changes might be necessary to improve the performance.
Thanks all,
Replicon | The leader in cloud time tracking applications - 7,800+ Customers - 70+ Countries - 1.5 Million Users
www.replicon.com | facebook |
We are hiring! | search jobs
Mathieu Fenniak <mathieu.fenniak@replicon.com> writes: > I'm attempting to use logical decoding with the streaming replication > protocol to perform change-data-capture on PostgreSQL 9.5.4. I'm seeing > the replication stream "stall" for long periods of time where the walsender > process will be pinned at 100% CPU utilization, but no data is being sent > to my client. > I've performed a CPU sampling with the OSX `sample` tool based upon > reproduction approach #1: > https://gist.github.com/mfenniak/366d7ed19b2d804f41180572dc1600d8 > It appears that most of the time is spent in the > RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache > invalidation callbacks, both of which appear to be invalidating caches > based upon the cache value. Hmm ... as for RelfilenodeMapInvalidateCallback, the lack of calls to hash_search() from it in your trace says that it usually isn't doing anything useful. All the time is being spent in hash_seq_search, uselessly iterating over the hashtable. I'm inclined to think that we need a smarter data structure there, maybe an independent hashtable tracking the reverse map from relation OID to filenode map entry. As for CatalogCacheIdInvalidate, I wonder how many of those cycles are doing something useful, and how many are being wasted in the outer loop that just iterates over the cache list. We could trivially get rid of that outer search by using syscache.c's array, as in the attached patch. It'd be interesting to see if this patch helps your scenario #1. (Patch is against HEAD but seems to apply cleanly to 9.5) Most likely, your scenario #2 is completely stuck on the RelfilenodeMapInvalidateCallback issue, though it would be good to get a trace to confirm that. regards, tom lane diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c index c27186f..b19044c 100644 *** a/src/backend/utils/cache/catcache.c --- b/src/backend/utils/cache/catcache.c *************** CatCacheRemoveCList(CatCache *cache, Cat *** 422,428 **** /* ! * CatalogCacheIdInvalidate * * Invalidate entries in the specified cache, given a hash value. * --- 422,428 ---- /* ! * CatCacheInvalidate * * Invalidate entries in the specified cache, given a hash value. * *************** CatCacheRemoveCList(CatCache *cache, Cat *** 440,510 **** * This routine is only quasi-public: it should only be used by inval.c. */ void ! CatalogCacheIdInvalidate(int cacheId, uint32 hashValue) { ! slist_iter cache_iter; ! CACHE1_elog(DEBUG2, "CatalogCacheIdInvalidate: called"); /* ! * inspect caches to find the proper cache */ - slist_foreach(cache_iter, &CacheHdr->ch_caches) - { - CatCache *ccp = slist_container(CatCache, cc_next, cache_iter.cur); - Index hashIndex; - dlist_mutable_iter iter; ! if (cacheId != ccp->id) ! continue; ! ! /* ! * We don't bother to check whether the cache has finished ! * initialization yet; if not, there will be no entries in it so no ! * problem. ! */ ! /* ! * Invalidate *all* CatCLists in this cache; it's too hard to tell ! * which searches might still be correct, so just zap 'em all. ! */ ! dlist_foreach_modify(iter, &ccp->cc_lists) ! { ! CatCList *cl = dlist_container(CatCList, cache_elem, iter.cur); ! if (cl->refcount > 0) ! cl->dead = true; ! else ! CatCacheRemoveCList(ccp, cl); ! } ! /* ! * inspect the proper hash bucket for tuple matches ! */ ! hashIndex = HASH_INDEX(hashValue, ccp->cc_nbuckets); ! dlist_foreach_modify(iter, &ccp->cc_bucket[hashIndex]) { ! CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); ! ! if (hashValue == ct->hash_value) { ! if (ct->refcount > 0 || ! (ct->c_list && ct->c_list->refcount > 0)) ! { ! ct->dead = true; ! /* list, if any, was marked dead above */ ! Assert(ct->c_list == NULL || ct->c_list->dead); ! } ! else ! CatCacheRemoveCTup(ccp, ct); ! CACHE1_elog(DEBUG2, "CatalogCacheIdInvalidate: invalidated"); #ifdef CATCACHE_STATS ! ccp->cc_invals++; #endif ! /* could be multiple matches, so keep looking! */ ! } } - break; /* need only search this one cache */ } } --- 440,496 ---- * This routine is only quasi-public: it should only be used by inval.c. */ void ! CatCacheInvalidate(CatCache *cache, uint32 hashValue) { ! Index hashIndex; ! dlist_mutable_iter iter; ! CACHE1_elog(DEBUG2, "CatCacheInvalidate: called"); /* ! * We don't bother to check whether the cache has finished initialization ! * yet; if not, there will be no entries in it so no problem. */ ! /* ! * Invalidate *all* CatCLists in this cache; it's too hard to tell which ! * searches might still be correct, so just zap 'em all. ! */ ! dlist_foreach_modify(iter, &cache->cc_lists) ! { ! CatCList *cl = dlist_container(CatCList, cache_elem, iter.cur); ! if (cl->refcount > 0) ! cl->dead = true; ! else ! CatCacheRemoveCList(cache, cl); ! } ! /* ! * inspect the proper hash bucket for tuple matches ! */ ! hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets); ! dlist_foreach_modify(iter, &cache->cc_bucket[hashIndex]) ! { ! CatCTup *ct = dlist_container(CatCTup, cache_elem, iter.cur); ! if (hashValue == ct->hash_value) { ! if (ct->refcount > 0 || ! (ct->c_list && ct->c_list->refcount > 0)) { ! ct->dead = true; ! /* list, if any, was marked dead above */ ! Assert(ct->c_list == NULL || ct->c_list->dead); ! } ! else ! CatCacheRemoveCTup(cache, ct); ! CACHE1_elog(DEBUG2, "CatCacheInvalidate: invalidated"); #ifdef CATCACHE_STATS ! cache->cc_invals++; #endif ! /* could be multiple matches, so keep looking! */ } } } *************** build_dummy_tuple(CatCache *cache, int n *** 1823,1829 **** * the specified relation, find all catcaches it could be in, compute the * correct hash value for each such catcache, and call the specified * function to record the cache id and hash value in inval.c's lists. ! * CatalogCacheIdInvalidate will be called later, if appropriate, * using the recorded information. * * For an insert or delete, tuple is the target tuple and newtuple is NULL. --- 1809,1815 ---- * the specified relation, find all catcaches it could be in, compute the * correct hash value for each such catcache, and call the specified * function to record the cache id and hash value in inval.c's lists. ! * SysCacheInvalidate will be called later, if appropriate, * using the recorded information. * * For an insert or delete, tuple is the target tuple and newtuple is NULL. diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c index 55e5c8c..347fdde 100644 *** a/src/backend/utils/cache/inval.c --- b/src/backend/utils/cache/inval.c *************** LocalExecuteInvalidationMessage(SharedIn *** 552,558 **** { InvalidateCatalogSnapshot(); ! CatalogCacheIdInvalidate(msg->cc.id, msg->cc.hashValue); CallSyscacheCallbacks(msg->cc.id, msg->cc.hashValue); } --- 552,558 ---- { InvalidateCatalogSnapshot(); ! SysCacheInvalidate(msg->cc.id, msg->cc.hashValue); CallSyscacheCallbacks(msg->cc.id, msg->cc.hashValue); } diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index edbc151..066ce72 100644 *** a/src/backend/utils/cache/syscache.c --- b/src/backend/utils/cache/syscache.c *************** SearchSysCacheList(int cacheId, int nkey *** 1339,1344 **** --- 1339,1365 ---- } /* + * SysCacheInvalidate + * + * Invalidate entries in the specified cache, given a hash value. + * See CatCacheInvalidate() for more info. + * + * This routine is only quasi-public: it should only be used by inval.c. + */ + void + SysCacheInvalidate(int cacheId, uint32 hashValue) + { + if (cacheId < 0 || cacheId >= SysCacheSize) + elog(ERROR, "invalid cache ID: %d", cacheId); + + /* if this cache isn't initialized yet, no need to do anything */ + if (!PointerIsValid(SysCache[cacheId])) + return; + + CatCacheInvalidate(SysCache[cacheId], hashValue); + } + + /* * Certain relations that do not have system caches send snapshot invalidation * messages in lieu of catcache messages. This is for the benefit of * GetCatalogSnapshot(), which can then reuse its existing MVCC snapshot diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h index 299d246..5add424 100644 *** a/src/include/utils/catcache.h --- b/src/include/utils/catcache.h *************** extern void ReleaseCatCacheList(CatCList *** 190,196 **** extern void ResetCatalogCaches(void); extern void CatalogCacheFlushCatalog(Oid catId); ! extern void CatalogCacheIdInvalidate(int cacheId, uint32 hashValue); extern void PrepareToInvalidateCacheTuple(Relation relation, HeapTuple tuple, HeapTuple newtuple, --- 190,196 ---- extern void ResetCatalogCaches(void); extern void CatalogCacheFlushCatalog(Oid catId); ! extern void CatCacheInvalidate(CatCache *cache, uint32 hashValue); extern void PrepareToInvalidateCacheTuple(Relation relation, HeapTuple tuple, HeapTuple newtuple, diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 36805eb..73991dd 100644 *** a/src/include/utils/syscache.h --- b/src/include/utils/syscache.h *************** struct catclist; *** 140,145 **** --- 140,147 ---- extern struct catclist *SearchSysCacheList(int cacheId, int nkeys, Datum key1, Datum key2, Datum key3, Datum key4); + extern void SysCacheInvalidate(int cacheId, uint32 hashValue); + extern bool RelationInvalidatesSnapshotsOnly(Oid relid); extern bool RelationHasSysCache(Oid relid); extern bool RelationSupportsSysCache(Oid relid); -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Hi, On 2017-05-05 14:24:07 -0600, Mathieu Fenniak wrote: > The stalls occur unpredictably on my production system, but generally seem > to be correlated with schema operations. My source database has about > 100,000 tables; it's a one-schema-per-tenant multi-tenant SaaS system. I'm unfortunately not entirely surprised you're seeing some issues in that case. We're invalidating internal caches a bit bit overjudiciously, and that invalidation is triggered by schema changes. > I've performed a CPU sampling with the OSX `sample` tool based upon > reproduction approach #1: > https://gist.github.com/mfenniak/366d7ed19b2d804f41180572dc1600d8 It > appears that most of the time is spent in the > RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache > invalidation callbacks, both of which appear to be invalidating caches > based upon the cache value. I think optimizing those has some value (and I see Tom is looking at that aspect, but the bigger thing would probably be to do fewer lookups. > Has anyone else run into this kind of performance problem? Any thoughts on > how it might be resolved? I don't mind putting in the work if someone > could describe what is happening here, and have a discussion with me about > what kind of changes might be necessary to improve the performance. If you could provide an easily runnable sql script that reproduces the issue, I'll have a look. I think I have a rough idea what to do. Greetings, Andres Freund
Hi, On 2017-05-05 20:59:09 -0400, Tom Lane wrote: > Hmm ... as for RelfilenodeMapInvalidateCallback, the lack of calls to > hash_search() from it in your trace says that it usually isn't doing > anything useful. All the time is being spent in hash_seq_search, > uselessly iterating over the hashtable. I'm inclined to think that > we need a smarter data structure there, maybe an independent hashtable > tracking the reverse map from relation OID to filenode map entry. Yea, that might be worthwhile. Let me try to address the issue that we do way too much invalidation, then we can check whether this is still exercised hotly. On the other hand, it's still a dumb invalidation approach, so if somebody feels like working on this... > As for CatalogCacheIdInvalidate, I wonder how many of those cycles > are doing something useful, and how many are being wasted in the outer > loop that just iterates over the cache list. We could trivially get > rid of that outer search by using syscache.c's array, as in the > attached patch. It'd be interesting to see if this patch helps your > scenario #1. (Patch is against HEAD but seems to apply cleanly to 9.5) I've seen this be a significant fraction of CPU time completely independent of logical decoding, so I'd guess this is worthwhile independently. Not sure what a good benchmark for this would be. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2017-05-05 14:24:07 -0600, Mathieu Fenniak wrote: >> It appears that most of the time is spent in the >> RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache >> invalidation callbacks, both of which appear to be invalidating caches >> based upon the cache value. > I think optimizing those has some value (and I see Tom is looking at > that aspect, but the bigger thing would probably be to do fewer lookups. I'm confused --- the lookup side of things is down in the noise in Mathieu's trace. Further reducing the number of lookups doesn't seem like it helps this scenario at all. It might matter once we've whacked down the cost of invalidations ... regards, tom lane
On 2017-05-05 21:32:27 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2017-05-05 14:24:07 -0600, Mathieu Fenniak wrote: > >> It appears that most of the time is spent in the > >> RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache > >> invalidation callbacks, both of which appear to be invalidating caches > >> based upon the cache value. > > > I think optimizing those has some value (and I see Tom is looking at > > that aspect, but the bigger thing would probably be to do fewer lookups. > > I'm confused --- the lookup side of things is down in the noise in > Mathieu's trace. Err, sorry. Completely mangled that sentence. Executing fewer invalidations. We currently are likely re-executing the same set of invalidations constantly in Mathieu's case. Background: When decoding a transaction during logical decoding we're currently re-executing *all* a transaction's own cache invalidations, if it has any, at every new command-id observed in the WAL stream. That's because currently invalidations are only sent at commit, so we don't know from "when" they are. But I think there's some very low-hanging fruits reducing the frequency at which those are executed. In many cases where there's just a few schema changes in a transaction, this doesn't hurt badly. But if you have a transaction that does a bootload of schema changes *and* a has a lot of other changes, it gets expensive. Mathieu: The above also indicates a possible workaround, you can try separating larger amounts of data manipulations from schema changes, into separate transactions. Greetings, Andres Freund
Hi Andres, Tom,
Andres, it seems like the problem is independent of having large data manipulations mixed with schema changes. The below test case demonstrates it with just schema changes.
Tom, I've tested your patch, and it seems to have a positive impact for sure. I've documented a test case to reproduce this issue (below), and your avoid-search-in-catcache-invalidate.patch reduces the test case time from 63 seconds per run to 27 seconds per run. Here's an updated sample capture with the patch applied: https://gist.github.com/mfenniak/cef36699deb426efd8e433027553a37e
Here's the test case I've been using for the above measurements:
a) creating 10,000 tables in 200 schemas
DO $$
DECLARE
i INTEGER;
j INTEGER;
schema TEXT;
BEGIN
FOR i IN 1..200 LOOP
schema := 'test_' || i;
EXECUTE format('CREATE SCHEMA IF NOT EXISTS %I', schema);
FOR j IN 1..500 LOOP
EXECUTE format('CREATE TABLE IF NOT EXISTS %I.%I (f1 int, f2 int)', schema, 'test_' || j);
END LOOP;
END LOOP;
END;
$$;
b) start logical replication w/ pg_recvlogical, printing out timestamps of messages and timestamps of transactions:
pg_recvlogical --create-slot --start -f - --if-not-exists --plugin=test_decoding --slot=cputest --dbname=mathieu.fenniak --username=postgres --option=include-timestamp | gawk '{ print strftime("%Y-%m-%d %H:%M:%S"), $0; fflush(); }'
c) drop a few schemas in a transaction
begin;drop schema test_1 cascade;drop schema test_2 cascade;drop schema test_3 cascade;drop schema test_4 cascade;drop schema test_5 cascade;drop schema test_6 cascade;commit;
Six schema drops had a 63s second delay between the actual commit and pg_recvlogical outputting it w/ stock 9.5.4:
2017-05-10 08:35:12 BEGIN 956
2017-05-10 08:36:15 COMMIT 956 (at 2017-05-10 08:35:12.106773-06)
Six schema drops w/ Tom's patch (applied to 9.5.4) had a 27 second delay:
2017-05-10 08:48:08 BEGIN 942
2017-05-10 08:48:35 COMMIT 942 (at 2017-05-10 08:48:08.182015-06)
(other test details: wal_level=logical; max_wal_senders=5; wal_sender_timeout = 3600s; max_locks_per_transaction = 65536; max_replication_slots = 5)
On Fri, May 5, 2017 at 7:38 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-05-05 21:32:27 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-05-05 14:24:07 -0600, Mathieu Fenniak wrote:
> >> It appears that most of the time is spent in the
> >> RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache
> >> invalidation callbacks, both of which appear to be invalidating caches
> >> based upon the cache value.
>
> > I think optimizing those has some value (and I see Tom is looking at
> > that aspect, but the bigger thing would probably be to do fewer lookups.
>
> I'm confused --- the lookup side of things is down in the noise in
> Mathieu's trace.
Err, sorry. Completely mangled that sentence. Executing fewer
invalidations. We currently are likely re-executing the same set of
invalidations constantly in Mathieu's case.
Background: When decoding a transaction during logical decoding we're
currently re-executing *all* a transaction's own cache invalidations, if
it has any, at every new command-id observed in the WAL stream. That's
because currently invalidations are only sent at commit, so we don't
know from "when" they are. But I think there's some very low-hanging
fruits reducing the frequency at which those are executed.
In many cases where there's just a few schema changes in a transaction,
this doesn't hurt badly. But if you have a transaction that does a
bootload of schema changes *and* a has a lot of other changes, it gets
expensive.
Mathieu: The above also indicates a possible workaround, you can try
separating larger amounts of data manipulations from schema changes,
into separate transactions.
Greetings,
Andres Freund
Mathieu Fenniak <mathieu.fenniak@replicon.com> writes: > Andres, it seems like the problem is independent of having large data > manipulations mixed with schema changes. The below test case demonstrates > it with just schema changes. > Tom, I've tested your patch, and it seems to have a positive impact for > sure. I've documented a test case to reproduce this issue (below), and > your avoid-search-in-catcache-invalidate.patch reduces the test case time > from 63 seconds per run to 27 seconds per run. Thanks for the test case. Your machine is apparently a bit faster than mine, or else Assert overhead is really hurting in this case, because on a debug-enabled build of HEAD I see the replication time as about 115 seconds, with the principal culprits according to "perf" being + 53.11% 52.96% 253596 postmaster postgres [.] CatalogCacheIdInvalidate + 31.15% 31.06% 148779 postmaster postgres [.] hash_seq_search + 5.70% 5.69% 27223 postmaster postgres [.] CallSyscacheCallbacks + 3.36% 3.35% 16057 postmaster postgres [.] LocalExecuteInvalidationMessage + 1.32% 1.32% 6306 postmaster postgres [.] ReorderBufferCommit I concur that the previously-posted patch to avoid searching in CatalogCacheIdInvalidate makes for about a 2X speedup; for me, that brings the test case down to 55 seconds, and now the top hits are + 59.28% 59.05% 150349 postmaster postgres [.] hash_seq_search + 8.40% 8.36% 21250 postmaster postgres [.] CallSyscacheCallbacks + 6.37% 6.34% 16115 postmaster postgres [.] LocalExecuteInvalidationMessage + 5.69% 5.67% 14418 postmaster postgres [.] CatCacheInvalidate + 3.14% 3.12% 7942 postmaster postgres [.] SysCacheInvalidate + 1.72% 1.71% 4354 postmaster postgres [.] ReorderBufferCommit + 1.37% 1.33% 3512 postmaster postgres [.] hash_search_with_hash_value + 1.15% 1.15% 2913 postmaster postgres [.] InvalidateCatalogSnapshot I looked at the hash_seq_search time a bit more, and realized that actually there's a pretty easy fix for that, which is to reduce the initial size of RelfilenodeMapHash from 1024 entries to 64. You could quibble about where to set that exactly, but 1024 is just way too many --- in your example, there are never more than 5 entries in the hash, despite the presence of 10000 tables in the database. I also checked it while running the regression tests, and while a few transactions get up to several hundred entries, they're mostly less than 50. So rather than build a second index structure for that hashtable, I propose we just do what's in the first attached patch. With that, I was down to about 21s run time, and now perf says + 23.17% 23.08% 21254 postmaster postgres [.] CallSyscacheCallbacks + 16.97% 16.91% 15579 postmaster postgres [.] LocalExecuteInvalidationMessage + 16.10% 16.03% 14766 postmaster postgres [.] CatCacheInvalidate + 12.45% 12.42% 11474 postmaster postgres [.] hash_seq_search + 8.66% 8.64% 7959 postmaster postgres [.] SysCacheInvalidate + 4.72% 4.70% 4331 postmaster postgres [.] ReorderBufferCommit + 3.16% 3.14% 2894 postmaster postgres [.] InvalidateCatalogSnapshot + 2.56% 2.50% 2344 postmaster postgres [.] hash_search_with_hash_value + 1.27% 1.27% 1169 postmaster postgres [.] RelfilenodeMapInvalidateCallbac Looking at CallSyscacheCallbacks, it's got exactly the same disease of linearly scanning a list to find relevant entries, when we could easily pre-index the list. The second patch attached gets me to 13s, with profile + 22.78% 22.70% 14568 postmaster postgres [.] CatCacheInvalidate + 17.75% 17.69% 11355 postmaster postgres [.] LocalExecuteInvalidationMessage + 17.11% 17.04% 10997 postmaster postgres [.] hash_seq_search + 7.25% 7.22% 4634 postmaster postgres [.] SysCacheInvalidate + 5.05% 5.03% 3229 postmaster postgres [.] CallSyscacheCallbacks + 4.09% 4.08% 2615 postmaster postgres [.] ReorderBufferCommit + 3.56% 3.55% 2276 postmaster postgres [.] InvalidateCatalogSnapshot + 3.51% 3.40% 2240 postmaster postgres [.] hash_search_with_hash_value + 1.76% 1.75% 1122 postmaster postgres [.] RelfilenodeMapInvalidateCallback We're at a point of diminishing returns here; I think any further improvement would require reducing the number of invalidation calls, as Andres was speculating about upthread. Still, this shows that it doesn't take very much work to get a 10X improvement in the overhead associated with inval activities. We've never seen this overhead stick out quite this much before, and maybe logical replication will always be an atypical workload, but I think this may be worth committing even if Andres does managed to cut the number of calls. It would be interesting to see how much these patches help for your real use-case, as opposed to this toy example. Assuming that the results are positive, I would advocate for back-patching these changes as far as 9.4 where logical decoding came in. BTW, I also noticed that we're getting scarily close to exceeding MAX_SYSCACHE_CALLBACKS. There are 30-some calls to CacheRegisterSyscacheCallback in our code, and while I think not all of them can be reached in a single process, we demonstrably get as high as 21 registered callbacks in some regression test runs. That's not leaving a lot of daylight for add-on modules. The second patch attached includes increasing MAX_SYSCACHE_CALLBACKS from 32 to 64. I think we'd be well advised to apply and back-patch that, even if we don't use the rest of the patch. regards, tom lane diff --git a/src/backend/utils/cache/relfilenodemap.c b/src/backend/utils/cache/relfilenodemap.c index c790309..612f0f3 100644 --- a/src/backend/utils/cache/relfilenodemap.c +++ b/src/backend/utils/cache/relfilenodemap.c @@ -123,7 +123,7 @@ InitializeRelfilenodeMap(void) * error. */ RelfilenodeMapHash = - hash_create("RelfilenodeMap cache", 1024, &ctl, + hash_create("RelfilenodeMap cache", 64, &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); /* Watch for invalidation events. */ diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c index 347fdde..6714a30 100644 *** a/src/backend/utils/cache/inval.c --- b/src/backend/utils/cache/inval.c *************** static int maxSharedInvalidMessagesArray *** 178,195 **** /* * Dynamically-registered callback functions. Current implementation * assumes there won't be very many of these at once; could improve if needed. */ ! #define MAX_SYSCACHE_CALLBACKS 32 #define MAX_RELCACHE_CALLBACKS 10 static struct SYSCACHECALLBACK { int16 id; /* cache number */ SyscacheCallbackFunction function; Datum arg; } syscache_callback_list[MAX_SYSCACHE_CALLBACKS]; static int syscache_callback_count = 0; static struct RELCACHECALLBACK --- 178,202 ---- /* * Dynamically-registered callback functions. Current implementation * assumes there won't be very many of these at once; could improve if needed. + * + * To avoid searching in CallSyscacheCallbacks, all callbacks for a given + * syscache are linked into a list pointed to by syscache_callback_links[id]. + * The link values are syscache_callback_list[] index plus 1, or 0 for none. */ ! #define MAX_SYSCACHE_CALLBACKS 64 #define MAX_RELCACHE_CALLBACKS 10 static struct SYSCACHECALLBACK { int16 id; /* cache number */ + int16 link; /* next callback index+1 for same cache */ SyscacheCallbackFunction function; Datum arg; } syscache_callback_list[MAX_SYSCACHE_CALLBACKS]; + static int16 syscache_callback_links[SysCacheSize]; + static int syscache_callback_count = 0; static struct RELCACHECALLBACK *************** CacheRegisterSyscacheCallback(int cachei *** 1386,1392 **** --- 1393,1417 ---- if (syscache_callback_count >= MAX_SYSCACHE_CALLBACKS) elog(FATAL, "out of syscache_callback_list slots"); + Assert(cacheid >= 0 && cacheid < SysCacheSize); + + if (syscache_callback_links[cacheid] == 0) + { + /* first callback for this cache */ + syscache_callback_links[cacheid] = syscache_callback_count + 1; + } + else + { + /* add to end of chain, so that older callbacks are called first */ + int i = syscache_callback_links[cacheid] - 1; + + while (syscache_callback_list[i].link > 0) + i = syscache_callback_list[i].link - 1; + syscache_callback_list[i].link = syscache_callback_count + 1; + } + syscache_callback_list[syscache_callback_count].id = cacheid; + syscache_callback_list[syscache_callback_count].link = 0; syscache_callback_list[syscache_callback_count].function = func; syscache_callback_list[syscache_callback_count].arg = arg; *************** CallSyscacheCallbacks(int cacheid, uint3 *** 1426,1436 **** { int i; ! for (i = 0; i < syscache_callback_count; i++) { struct SYSCACHECALLBACK *ccitem = syscache_callback_list + i; ! if (ccitem->id == cacheid) ! (*ccitem->function) (ccitem->arg, cacheid, hashvalue); } } --- 1451,1464 ---- { int i; ! Assert(cacheid >= 0 && cacheid < SysCacheSize); ! i = syscache_callback_links[cacheid] - 1; ! while (i >= 0) { struct SYSCACHECALLBACK *ccitem = syscache_callback_list + i; ! Assert(ccitem->id == cacheid); ! (*ccitem->function) (ccitem->arg, cacheid, hashvalue); ! i = ccitem->link - 1; } } diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c index 066ce72..f0a16e3 100644 *** a/src/backend/utils/cache/syscache.c --- b/src/backend/utils/cache/syscache.c *************** static const struct cachedesc cacheinfo[ *** 971,978 **** } }; - #define SysCacheSize ((int) lengthof(cacheinfo)) - static CatCache *SysCache[SysCacheSize]; static bool CacheInitialized = false; --- 971,976 ---- *************** InitCatalogCache(void) *** 1003,1008 **** --- 1001,1009 ---- int i, j; + StaticAssertStmt(SysCacheSize == (int) lengthof(cacheinfo), + "SysCacheSize does not match syscache.c's array"); + Assert(!CacheInitialized); SysCacheRelationOidSize = SysCacheSupportingRelOidSize = 0; diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h index 73991dd..e20284d 100644 *** a/src/include/utils/syscache.h --- b/src/include/utils/syscache.h *************** enum SysCacheIdentifier *** 108,113 **** --- 108,115 ---- TYPEOID, USERMAPPINGOID, USERMAPPINGUSERSERVER + + #define SysCacheSize (USERMAPPINGUSERSERVER + 1) }; extern void InitCatalogCache(void); -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Hi Tom, Mathieu, On 2017-05-10 17:02:11 -0400, Tom Lane wrote: > Mathieu Fenniak <mathieu.fenniak@replicon.com> writes: > > Andres, it seems like the problem is independent of having large data > > manipulations mixed with schema changes. The below test case demonstrates > > it with just schema changes. > > > Tom, I've tested your patch, and it seems to have a positive impact for > > sure. I've documented a test case to reproduce this issue (below), and > > your avoid-search-in-catcache-invalidate.patch reduces the test case time > > from 63 seconds per run to 27 seconds per run. > > Thanks for the test case. Indeed! > + 59.28% 59.05% 150349 postmaster postgres [.] hash_seq_search > + 8.40% 8.36% 21250 postmaster postgres [.] CallSyscacheCallbacks > + 6.37% 6.34% 16115 postmaster postgres [.] LocalExecuteInvalidationMessage > + 5.69% 5.67% 14418 postmaster postgres [.] CatCacheInvalidate > + 3.14% 3.12% 7942 postmaster postgres [.] SysCacheInvalidate > + 1.72% 1.71% 4354 postmaster postgres [.] ReorderBufferCommit > + 1.37% 1.33% 3512 postmaster postgres [.] hash_search_with_hash_value > + 1.15% 1.15% 2913 postmaster postgres [.] InvalidateCatalogSnapshot > > I looked at the hash_seq_search time a bit more, and realized that > actually there's a pretty easy fix for that, which is to reduce > the initial size of RelfilenodeMapHash from 1024 entries to 64. Heh, that's not a bad idea. Given the table resizes automatically, there seems little reason not to go there. > You could quibble about where to set that exactly, but 1024 is > just way too many --- in your example, there are never more than > 5 entries in the hash, despite the presence of 10000 tables in > the database. That's likely because there's no DML. > We're at a point of diminishing returns here; I think any further > improvement would require reducing the number of invalidation calls, > as Andres was speculating about upthread. Still, this shows that > it doesn't take very much work to get a 10X improvement in the > overhead associated with inval activities. We've never seen this > overhead stick out quite this much before, and maybe logical replication > will always be an atypical workload, but I think this may be worth > committing even if Andres does managed to cut the number of calls. I've seen these callsites prominently in profiles not using logical decoding. Never quite as big as here, but still. Temp table heavy workloads IIRC are one of the easy way to trigger it. > It would be interesting to see how much these patches help for your real > use-case, as opposed to this toy example. Assuming that the results are > positive, I would advocate for back-patching these changes as far as 9.4 > where logical decoding came in. +1. > BTW, I also noticed that we're getting scarily close to exceeding > MAX_SYSCACHE_CALLBACKS. There are 30-some calls to > CacheRegisterSyscacheCallback in our code, and while I think not all of > them can be reached in a single process, we demonstrably get as high as 21 > registered callbacks in some regression test runs. That's not leaving a > lot of daylight for add-on modules. The second patch attached includes > increasing MAX_SYSCACHE_CALLBACKS from 32 to 64. I think we'd be well > advised to apply and back-patch that, even if we don't use the rest of > the patch. +1 - I'd advocate for doing so all the way. Greetings, Andres Freund
Hi Tom, Andres,
I've taken your patches, Tom, and applied them to a test deployment of my actual application.
The most accessible way I have to reproduce this issue is to run a maintenance task that we typically run during a software deployment, which will remove some tenant schemas from our database system. I ran that task with and without the patches, and monitored how long logical replication stalled. This test was on a database with 22 schemas containing 400 tables each (~8800 tables).
As a baseline, PostgreSQL 9.5.6 stalled completely for 21m 13s
PostgreSQL 9.5.6 + all three patches stalled completely for 4m 11s.
Definitely a fantastic improvement. :-)
This test scenario (performing the maintenance task) is unfortunately very similar to the synthetic test... but it's the most consistent manner I've found to reproduce this stall. I've observed the replication stall in production at somewhat arbitrary times, so I'm not sure if there may be different performance profiles in more production-like usage.
Here's a CPU performance sampling from the 4m period w/ the patched PostgreSQL. This is from the `perf` tool on Linux, rather than the earlier samples from my development machine on OSX. `perf report`: https://gist.github.com/mfenniak/9a3484c426a497f3903051d9fdf8b221 Raw perf data (49kb bzip2): https://www.dropbox.com/s/3jfxtg9kjzjztmp/linux-pg-walsender-patched.perf.data.bz2?dl=0
Mathieu
On Wed, May 10, 2017 at 3:10 PM, Andres Freund <andres@anarazel.de> wrote:
Hi Tom, Mathieu,
On 2017-05-10 17:02:11 -0400, Tom Lane wrote:
> Mathieu Fenniak <mathieu.fenniak@replicon.com> writes:
> > Andres, it seems like the problem is independent of having large data
> > manipulations mixed with schema changes. The below test case demonstrates
> > it with just schema changes.
>
> > Tom, I've tested your patch, and it seems to have a positive impact for
> > sure. I've documented a test case to reproduce this issue (below), and
> > your avoid-search-in-catcache-invalidate.patch reduces the test case time
> > from 63 seconds per run to 27 seconds per run.
>
> Thanks for the test case.
Indeed!
> + 59.28% 59.05% 150349 postmaster postgres [.] hash_seq_search
> + 8.40% 8.36% 21250 postmaster postgres [.] CallSyscacheCallbacks
> + 6.37% 6.34% 16115 postmaster postgres [.] LocalExecuteInvalidationMessage
> + 5.69% 5.67% 14418 postmaster postgres [.] CatCacheInvalidate
> + 3.14% 3.12% 7942 postmaster postgres [.] SysCacheInvalidate
> + 1.72% 1.71% 4354 postmaster postgres [.] ReorderBufferCommit
> + 1.37% 1.33% 3512 postmaster postgres [.] hash_search_with_hash_value
> + 1.15% 1.15% 2913 postmaster postgres [.] InvalidateCatalogSnapshot
>
> I looked at the hash_seq_search time a bit more, and realized that
> actually there's a pretty easy fix for that, which is to reduce
> the initial size of RelfilenodeMapHash from 1024 entries to 64.
Heh, that's not a bad idea. Given the table resizes automatically, there
seems little reason not to go there.
> You could quibble about where to set that exactly, but 1024 is
> just way too many --- in your example, there are never more than
> 5 entries in the hash, despite the presence of 10000 tables in
> the database.
That's likely because there's no DML.
> We're at a point of diminishing returns here; I think any further
> improvement would require reducing the number of invalidation calls,
> as Andres was speculating about upthread. Still, this shows that
> it doesn't take very much work to get a 10X improvement in the
> overhead associated with inval activities. We've never seen this
> overhead stick out quite this much before, and maybe logical replication
> will always be an atypical workload, but I think this may be worth
> committing even if Andres does managed to cut the number of calls.
I've seen these callsites prominently in profiles not using logical
decoding. Never quite as big as here, but still. Temp table heavy
workloads IIRC are one of the easy way to trigger it.
> It would be interesting to see how much these patches help for your real
> use-case, as opposed to this toy example. Assuming that the results are
> positive, I would advocate for back-patching these changes as far as 9.4
> where logical decoding came in.
+1.
> BTW, I also noticed that we're getting scarily close to exceeding
> MAX_SYSCACHE_CALLBACKS. There are 30-some calls to
> CacheRegisterSyscacheCallback in our code, and while I think not all of
> them can be reached in a single process, we demonstrably get as high as 21
> registered callbacks in some regression test runs. That's not leaving a
> lot of daylight for add-on modules. The second patch attached includes
> increasing MAX_SYSCACHE_CALLBACKS from 32 to 64. I think we'd be well
> advised to apply and back-patch that, even if we don't use the rest of
> the patch.
+1 - I'd advocate for doing so all the way.
Greetings,
Andres Freund
Mathieu Fenniak <mathieu.fenniak@replicon.com> writes: > I've taken your patches, Tom, and applied them to a test deployment of my > actual application. > The most accessible way I have to reproduce this issue is to run a > maintenance task that we typically run during a software deployment, which > will remove some tenant schemas from our database system. I ran that task > with and without the patches, and monitored how long logical replication > stalled. This test was on a database with 22 schemas containing 400 tables > each (~8800 tables). > As a baseline, PostgreSQL 9.5.6 stalled completely for 21m 13s > PostgreSQL 9.5.6 + all three patches stalled completely for 4m 11s. > Definitely a fantastic improvement. :-) Thanks for confirming that these patches help in a real-world scenario. Obviously there's still room for improvement, but I think it's worth pushing what I've done, so I'll go do that. I hope Andres will follow up with trying to improve matters from the other angle. regards, tom lane