Re: logical decoding and replication of sequences, take 2 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: logical decoding and replication of sequences, take 2 |
Date | |
Msg-id | 8a2f25d8-23d4-4ab8-2a23-7c8d2d976209@enterprisedb.com Whole thread Raw |
In response to | RE: logical decoding and replication of sequences, take 2 ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>) |
Responses |
Re: logical decoding and replication of sequences, take 2
Re: logical decoding and replication of sequences, take 2 Re: logical decoding and replication of sequences, take 2 |
List | pgsql-hackers |
On 12/3/23 13:55, Hayato Kuroda (Fujitsu) wrote: > Dear Tomas, > >>> I did also performance tests (especially case 3). First of all, there are some >>> variants from yours. >>> >>> 1. patch 0002 was reverted because it has an issue. So this test checks whether >>> refactoring around ReorderBufferSequenceIsTransactional seems really >> needed. >> >> FWIW I also did the benchmarks without the 0002 patch, for the same >> reason. I forgot to mention that. > > Oh, good news. So your bench markings are quite meaningful. > >> >> Interesting, so what exactly does the transaction do? > > It is quite simple - PSA the script file. It was executed with 64 multiplicity. > The definition of alter_sequence() is same as you said. > (I did use normal bash script for running them, but your approach may be smarter) > >> Anyway, I don't >> think this is very surprising - I believe it behaves like this because >> of having to search in many hash tables (one in each toplevel xact). And >> I think the solution I explained before (maintaining a single toplevel >> hash, instead of many per-top-level hashes). > > Agreed. And I can benchmark again for new ones, maybe when we decide new > approach. > Thanks for the script. Are you also measuring the time it takes to decode this using test_decoding? FWIW I did more comprehensive suite of tests over the weekend, with a couple more variations. I'm attaching the updated scripts, running it should be as simple as ./run.sh BRANCH TRANSACTIONS RUNS so perhaps ./run.sh master 1000 3 to do 3 runs with 1000 transactions per client. And it'll run a bunch of combinations hard-coded in the script, and write the timings into a CSV file (with "master" in each row). I did this on two machines (i5 with 4 cores, xeon with 16/32 cores). I did this with current master, the basic patch (without the 0002 part), and then with the optimized approach (single global hash table, see the 0004 part). That's what master / patched / optimized in the results is. Interestingly enough, the i5 handled this much faster, it seems to be better in single-core tasks. The xeon is still running, so the results for "optimized" only have one run (out of 3), but shouldn't change much. Attached is also a table summarizing this, and visualizing the timing change (vs. master) in the last couple columns. Green is "faster" than master (but we don't really expect that), and "red" means slower than master (the more red, the slower). There results are grouped by script (see the attached .tgz), with either 32 or 96 clients (which does affect the timing, but not between master and patch). Some executions have no pg_sleep() calls, some have 0.001 wait (but that doesn't seem to make much difference). Overall, I'd group the results into about three groups: 1) good cases [nextval, nextval-40, nextval-abort] These are cases that slow down a bit, but the slowdown is mostly within reasonable bounds (we're making the decoding to do more stuff, so it'd be a bit silly to require that extra work to make no impact). And I do think this is reasonable, because this is pretty much an extreme / worst case behavior. People don't really do just nextval() calls, without doing anything else. Not to mention doing aborts for 100% transactions. So in practice this is going to be within noise (and in those cases the results even show speedup, which seems a bit surprising). It's somewhat dependent on CPU too - on xeon there's hardly any regression. 2) nextval-40-abort Here the slowdown is clear, but I'd argue it generally falls in the same group as (1). Yes, I'd be happier if it didn't behave like this, but if someone can show me a practical workload affected by this ... 3) irrelevant cases [all the alters taking insane amounts of time] I absolutely refuse to care about these extreme cases where decoding 100k transactions takes 5-10 minutes (on i5), or up to 30 minutes (on xeon). If this was a problem for some practical workload, we'd have already heard about it I guess. And even if there was such workload, it wouldn't be up to this patch to fix that. There's clearly something misbehaving in the snapshot builder. I was hopeful the global hash table would be an improvement, but that doesn't seem to be the case. I haven't done much profiling yet, but I'd guess most of the overhead is due to ReorderBufferQueueSequence() starting and aborting a transaction in the non-transactinal case. Which is unfortunate, but I don't know if there's a way to optimize that. Some time ago I floated the idea of maybe "queuing" the sequence changes and only replay them on the next commit, somehow. But we did ran into problems with which snapshot to use, that I didn't know how to solve. Maybe we should try again. The idea is we'd queue the non-transactional changes somewhere (can't be in the transaction, because we must keep them even if it aborts), and then "inject" them into the next commit. That'd mean we wouldn't do the separate start/abort for each change. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: