Thread: [PATCH] pg_dump: lock tables in batches
Hi hackers, A colleague of mine reported a slight inconvenience with pg_dump. He is dumping the data from a remote server. There are several thousands of tables in the database. Making a dump locally and/or using pg_basebackup and/or logical replication is not an option. So what pg_dump currently does is sending LOCK TABLE queries one after another. Every query needs an extra round trip. So if we have let's say 2000 tables and every round trip takes 100 ms then ~3.5 minutes are spent in the not most useful way. What he proposes is taking the locks in batches. I.e. instead of: LOCK TABLE foo IN ACCESS SHARE MODE; LOCK TABLE bar IN ACCESS SHARE MODE; do: LOCK TABLE foo, bar, ... IN ACCESS SHARE MODE; The proposed patch makes this change. It's pretty straightforward and as a side effect saves a bit of network traffic too. Thoughts? -- Best regards, Aleksander Alekseev
Attachment
On Wed, Dec 7, 2022 at 12:09 PM Aleksander Alekseev <aleksander@timescale.com> wrote:
>
> Hi hackers,
>
> A colleague of mine reported a slight inconvenience with pg_dump.
>
> He is dumping the data from a remote server. There are several
> thousands of tables in the database. Making a dump locally and/or
> using pg_basebackup and/or logical replication is not an option. So
> what pg_dump currently does is sending LOCK TABLE queries one after
> another. Every query needs an extra round trip. So if we have let's
> say 2000 tables and every round trip takes 100 ms then ~3.5 minutes
> are spent in the not most useful way.
>
> What he proposes is taking the locks in batches. I.e. instead of:
>
> LOCK TABLE foo IN ACCESS SHARE MODE;
> LOCK TABLE bar IN ACCESS SHARE MODE;
>
> do:
>
> LOCK TABLE foo, bar, ... IN ACCESS SHARE MODE;
>
> The proposed patch makes this change. It's pretty straightforward and
> as a side effect saves a bit of network traffic too.
>
+1 for that change. It will improve the dump for databases with thousands of relations.
The code LGTM and it passes in all tests and compiles without any warning.
Regards,
Fabrízio de Royes Mello
Aleksander Alekseev <aleksander@timescale.com> writes: > What he proposes is taking the locks in batches. I have a strong sense of deja vu here. I'm pretty sure I experimented with this idea last year and gave up on it. I don't recall exactly why, but either it didn't show any meaningful performance improvement for me or there was some actual downside (that I'm not remembering right now). This would've been in the leadup to 989596152 and adjacent commits. I took a quick look through the threads cited in those commit messages and didn't find anything about it, but I think the discussion had gotten scattered across more threads. Some digging in the archives could be useful. regards, tom lane
Hi, On 2022-12-07 10:44:33 -0500, Tom Lane wrote: > Aleksander Alekseev <aleksander@timescale.com> writes: > > What he proposes is taking the locks in batches. > > I have a strong sense of deja vu here. I'm pretty sure I experimented > with this idea last year and gave up on it. I don't recall exactly > why, but either it didn't show any meaningful performance improvement > for me or there was some actual downside (that I'm not remembering > right now). > This would've been in the leadup to 989596152 and adjacent commits. > I took a quick look through the threads cited in those commit messages > and didn't find anything about it, but I think the discussion had > gotten scattered across more threads. Some digging in the archives > could be useful. IIRC the case we were looking at around 989596152 were CPU bound workloads, rather than latency bound workloads. It'd not be surprising to have cases where batching LOCKs helps latency, but not CPU bound. I wonder if "manual" batching is the best answer. Alexander, have you considered using libpq level pipelining? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-12-07 10:44:33 -0500, Tom Lane wrote: >> I have a strong sense of deja vu here. I'm pretty sure I experimented >> with this idea last year and gave up on it. I don't recall exactly >> why, but either it didn't show any meaningful performance improvement >> for me or there was some actual downside (that I'm not remembering >> right now). > IIRC the case we were looking at around 989596152 were CPU bound workloads, > rather than latency bound workloads. It'd not be surprising to have cases > where batching LOCKs helps latency, but not CPU bound. Yeah, perhaps. Anyway my main point is that I don't want to just assume this is a win; I want to see some actual performance tests. > I wonder if "manual" batching is the best answer. Alexander, have you > considered using libpq level pipelining? I'd be a bit nervous about how well that works with older servers. regards, tom lane
Hi, On 2022-12-07 12:28:03 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2022-12-07 10:44:33 -0500, Tom Lane wrote: > >> I have a strong sense of deja vu here. I'm pretty sure I experimented > >> with this idea last year and gave up on it. I don't recall exactly > >> why, but either it didn't show any meaningful performance improvement > >> for me or there was some actual downside (that I'm not remembering > >> right now). > > > IIRC the case we were looking at around 989596152 were CPU bound workloads, > > rather than latency bound workloads. It'd not be surprising to have cases > > where batching LOCKs helps latency, but not CPU bound. > > Yeah, perhaps. Anyway my main point is that I don't want to just assume > this is a win; I want to see some actual performance tests. FWIW, one can simulate network latency with 'netem' on linux. Works even for 'lo'. ping -c 3 -n localhost 64 bytes from ::1: icmp_seq=1 ttl=64 time=0.035 ms 64 bytes from ::1: icmp_seq=2 ttl=64 time=0.049 ms 64 bytes from ::1: icmp_seq=3 ttl=64 time=0.043 ms tc qdisc add dev lo root netem delay 10ms 64 bytes from ::1: icmp_seq=1 ttl=64 time=20.1 ms 64 bytes from ::1: icmp_seq=2 ttl=64 time=20.2 ms 64 bytes from ::1: icmp_seq=3 ttl=64 time=20.2 ms tc qdisc delete dev lo root netem 64 bytes from ::1: icmp_seq=1 ttl=64 time=0.036 ms 64 bytes from ::1: icmp_seq=2 ttl=64 time=0.047 ms 64 bytes from ::1: icmp_seq=3 ttl=64 time=0.050 ms > > I wonder if "manual" batching is the best answer. Alexander, have you > > considered using libpq level pipelining? > > I'd be a bit nervous about how well that works with older servers. I don't think there should be any problem - E.g. pgjdbc has been using pipelining for ages. Not sure if it's the right answer, just to be clear. I suspect that eventually we're going to need to have a special "acquire pg_dump locks" function that is cheaper than retail lock acquisition and perhaps deals more gracefully with exceeding max_locks_per_transaction. Which would presumably not be pipelined. Greetings, Andres Freund
On Wed, Dec 7, 2022 at 2:28 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-12-07 10:44:33 -0500, Tom Lane wrote:
> >> I have a strong sense of deja vu here. I'm pretty sure I experimented
> >> with this idea last year and gave up on it. I don't recall exactly
> >> why, but either it didn't show any meaningful performance improvement
> >> for me or there was some actual downside (that I'm not remembering
> >> right now).
>
> > IIRC the case we were looking at around 989596152 were CPU bound workloads,
> > rather than latency bound workloads. It'd not be surprising to have cases
> > where batching LOCKs helps latency, but not CPU bound.
>
> Yeah, perhaps. Anyway my main point is that I don't want to just assume
> this is a win; I want to see some actual performance tests.
>
Here we have some numbers about the Aleksander's patch:
1) Setup script
CREATE DATABASE t1000;
CREATE DATABASE t10000;
CREATE DATABASE t100000;
\c t1000
SELECT format('CREATE TABLE t%s(c1 INTEGER PRIMARY KEY, c2 TEXT, c3 TIMESTAMPTZ);', i) FROM generate_series(1, 1000) AS i \gexec
\c t10000
SELECT format('CREATE TABLE t%s(c1 INTEGER PRIMARY KEY, c2 TEXT, c3 TIMESTAMPTZ);', i) FROM generate_series(1, 10000) AS i \gexec
\c t100000
SELECT format('CREATE TABLE t%s(c1 INTEGER PRIMARY KEY, c2 TEXT, c3 TIMESTAMPTZ);', i) FROM generate_series(1, 100000) AS i \gexec
2) Execution script
CREATE DATABASE t10000;
CREATE DATABASE t100000;
\c t1000
SELECT format('CREATE TABLE t%s(c1 INTEGER PRIMARY KEY, c2 TEXT, c3 TIMESTAMPTZ);', i) FROM generate_series(1, 1000) AS i \gexec
\c t10000
SELECT format('CREATE TABLE t%s(c1 INTEGER PRIMARY KEY, c2 TEXT, c3 TIMESTAMPTZ);', i) FROM generate_series(1, 10000) AS i \gexec
\c t100000
SELECT format('CREATE TABLE t%s(c1 INTEGER PRIMARY KEY, c2 TEXT, c3 TIMESTAMPTZ);', i) FROM generate_series(1, 100000) AS i \gexec
2) Execution script
time pg_dump -s t1000 > /dev/null
time pg_dump -s t10000 > /dev/null
time pg_dump -s t100000 > /dev/null
3) HEAD execution
time pg_dump -s t10000 > /dev/null
time pg_dump -s t100000 > /dev/null
3) HEAD execution
$ time pg_dump -s t1000 > /dev/null
0.02user 0.01system 0:00.36elapsed 8%CPU (0avgtext+0avgdata 11680maxresident)k
0inputs+0outputs (0major+1883minor)pagefaults 0swaps
$ time pg_dump -s t10000 > /dev/null
0.30user 0.10system 0:05.04elapsed 8%CPU (0avgtext+0avgdata 57772maxresident)k
0inputs+0outputs (0major+14042minor)pagefaults 0swaps
$ time pg_dump -s t100000 > /dev/null
3.42user 2.13system 7:50.09elapsed 1%CPU (0avgtext+0avgdata 517900maxresident)k
0inputs+0outputs (0major+134636minor)pagefaults 0swaps
4) PATCH execution
$ time pg_dump -s t1000 > /dev/null
0.02user 0.00system 0:00.28elapsed 9%CPU (0avgtext+0avgdata 11700maxresident)k
0inputs+0outputs (0major+1886minor)pagefaults 0swaps
$ time pg_dump -s t10000 > /dev/null
0.18user 0.03system 0:02.17elapsed 10%CPU (0avgtext+0avgdata 57592maxresident)k
0inputs+0outputs (0major+14072minor)pagefaults 0swaps
$ time pg_dump -s t100000 > /dev/null
1.97user 0.32system 0:21.39elapsed 10%CPU (0avgtext+0avgdata 517932maxresident)k
0inputs+0outputs (0major+134892minor)pagefaults 0swaps
0.02user 0.01system 0:00.36elapsed 8%CPU (0avgtext+0avgdata 11680maxresident)k
0inputs+0outputs (0major+1883minor)pagefaults 0swaps
$ time pg_dump -s t10000 > /dev/null
0.30user 0.10system 0:05.04elapsed 8%CPU (0avgtext+0avgdata 57772maxresident)k
0inputs+0outputs (0major+14042minor)pagefaults 0swaps
$ time pg_dump -s t100000 > /dev/null
3.42user 2.13system 7:50.09elapsed 1%CPU (0avgtext+0avgdata 517900maxresident)k
0inputs+0outputs (0major+134636minor)pagefaults 0swaps
4) PATCH execution
$ time pg_dump -s t1000 > /dev/null
0.02user 0.00system 0:00.28elapsed 9%CPU (0avgtext+0avgdata 11700maxresident)k
0inputs+0outputs (0major+1886minor)pagefaults 0swaps
$ time pg_dump -s t10000 > /dev/null
0.18user 0.03system 0:02.17elapsed 10%CPU (0avgtext+0avgdata 57592maxresident)k
0inputs+0outputs (0major+14072minor)pagefaults 0swaps
$ time pg_dump -s t100000 > /dev/null
1.97user 0.32system 0:21.39elapsed 10%CPU (0avgtext+0avgdata 517932maxresident)k
0inputs+0outputs (0major+134892minor)pagefaults 0swaps
5) Summary
HEAD patch
1k tables 0:00.36 0:00.28
10k tables 0:05.04 0:02.17
100k tables 7:50.09 0:21.39
1k tables 0:00.36 0:00.28
10k tables 0:05.04 0:02.17
100k tables 7:50.09 0:21.39
Seems we get very good performance gain using Aleksander's patch. I used the "-s" to not waste time issuing COPY for each relation (even all is empty) and evidence the difference due to roundtrip for LOCK TABLE. This patch will also improve the pg_upgrade execution over database with thousands of relations.
Regards,
--
Fabrízio de Royes Mello
Fabrízio de Royes Mello
Hi, On 2022-12-07 18:14:01 -0300, Fabrízio de Royes Mello wrote: > Here we have some numbers about the Aleksander's patch: Hm. Were they taken in an assertion enabled build or such? Just testing the t10000 case on HEAD, I get 0:01.23 elapsed for an unpatched pg_dump in an optimized build. And that's on a machine with not all that fast cores. Greetings, Andres Freund
Hi, On 2022-12-07 13:32:42 -0800, Andres Freund wrote: > On 2022-12-07 18:14:01 -0300, Fabrízio de Royes Mello wrote: > > Here we have some numbers about the Aleksander's patch: > > Hm. Were they taken in an assertion enabled build or such? Just testing the > t10000 case on HEAD, I get 0:01.23 elapsed for an unpatched pg_dump in an > optimized build. And that's on a machine with not all that fast cores. Comparing the overhead on the server side. CREATE OR REPLACE FUNCTION exec(v_sql text) RETURNS text LANGUAGE plpgsql AS $$BEGIN EXECUTE v_sql;RETURN v_sql;END;$$; Acquire locks in separate statements, three times: SELECT exec(string_agg(format('LOCK t%s;', i), '')) FROM generate_series(1, 10000) AS i; 1267.909 ms 1116.008 ms 1108.383 ms Acquire all locks in a single statement, three times: SELECT exec('LOCK '||string_agg(format('t%s', i), ', ')) FROM generate_series(1, 10000) AS i; 1210.732 ms 1101.390 ms 1105.331 ms So there's some performance difference that's independent of the avoided roundtrips - but it's pretty small. With an artificial delay of 100ms, the perf difference between the batching patch and not using the batching patch is huge. Huge enough that I don't have the patience to wait for the non-batched case to complete. With batching pg_dump -s -h localhost t10000 took 0:16.23 elapsed, without I cancelled after 603 tables had been locked, which took 2:06.43. This made me try out using pipeline mode. Seems to work nicely from what I can tell. The timings are a tad slower than the "manual" batch mode - I think that's largely due to the extended protocol just being overcomplicated. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > With an artificial delay of 100ms, the perf difference between the batching > patch and not using the batching patch is huge. Huge enough that I don't have > the patience to wait for the non-batched case to complete. Clearly, if you insert a sufficiently large artificial round-trip delay, even squeezing a single command out of a pg_dump run will appear worthwhile. What I'm unsure about is whether it's worthwhile at realistic round-trip delays (where "realistic" means that the dump performance would otherwise be acceptable). I think the reason I didn't pursue this last year is that experimentation convinced me the answer was "no". > With batching pg_dump -s -h localhost t10000 took 0:16.23 elapsed, without I > cancelled after 603 tables had been locked, which took 2:06.43. Is "-s" mode actually a relevant criterion here? With per-table COPY commands added into the mix you could not possibly get better than 2x improvement, and likely a good deal less. regards, tom lane
On Wed, Dec 7, 2022 at 2:53 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Is "-s" mode actually a relevant criterion here? With per-table COPY > commands added into the mix you could not possibly get better than 2x > improvement, and likely a good deal less. Don't we hit this code path in pg_upgrade? You won't see huge round-trip times, of course, but you also won't see COPY. Absolute performance aside, isn't there another concern that, the longer it takes for us to lock the tables, the bigger the window there is for someone to interfere with them between our catalog query and the lock itself? --Jacob
Hi, On 2022-12-07 17:53:05 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > With an artificial delay of 100ms, the perf difference between the batching > > patch and not using the batching patch is huge. Huge enough that I don't have > > the patience to wait for the non-batched case to complete. > > Clearly, if you insert a sufficiently large artificial round-trip delay, > even squeezing a single command out of a pg_dump run will appear > worthwhile. What I'm unsure about is whether it's worthwhile at > realistic round-trip delays (where "realistic" means that the dump > performance would otherwise be acceptable). I think the reason I didn't > pursue this last year is that experimentation convinced me the answer > was "no". It seems to be a win even without any artificial delay. Not a *huge* win, but a noticable win. And even just a few ms make it quite painful. > > With batching pg_dump -s -h localhost t10000 took 0:16.23 elapsed, without I > > cancelled after 603 tables had been locked, which took 2:06.43. > > Is "-s" mode actually a relevant criterion here? With per-table COPY > commands added into the mix you could not possibly get better than 2x > improvement, and likely a good deal less. Well, -s isn't something used all that rarely, so it'd not be insane to optimize it in isolation. But more importantly, I think the potential win without -s is far bigger than 2x, because the COPYs can be done in parallel, whereas the locking happens in the non-parallel stage. With just a 5ms delay, very well within normal network latency range, I get: pg_dump.master -h localhost -j10 -f /tmp/pg_dump_backup -Fd t10000 2m7.830s pg_dump.pipeline -h localhost -j10 -f /tmp/pg_dump_backup -Fd t10000 0m24.183s pg_dump.batch -h localhost -j10 -f /tmp/pg_dump_backup -Fd t10000 0m24.321s Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2022-12-07 17:53:05 -0500, Tom Lane wrote: >> Is "-s" mode actually a relevant criterion here? With per-table COPY >> commands added into the mix you could not possibly get better than 2x >> improvement, and likely a good deal less. > Well, -s isn't something used all that rarely, so it'd not be insane to > optimize it in isolation. But more importantly, I think the potential win > without -s is far bigger than 2x, because the COPYs can be done in parallel, > whereas the locking happens in the non-parallel stage. True, and there's the reduce-the-lock-window issue that Jacob mentioned. > With just a 5ms delay, very well within normal network latency range, I get: > [ a nice win ] OK. I'm struggling to figure out why I rejected this idea last year. I know that I thought about it and I'm fairly certain I actually tested it. Maybe I only tried it with near-zero-latency local loopback; but I doubt that, because the potential for network latency was certainly a factor in that whole discussion. One idea is that I might've tried it before getting rid of all the other per-object queries, at which point it wouldn't have stood out quite so much. But I'm just guessing. I have a nagging feeling there was something else. Oh well, I guess we can always revert it if we discover a problem later. regards, tom lane
Le 08/12/2022 à 01:03, Tom Lane a écrit : > Andres Freund <andres@anarazel.de> writes: >> On 2022-12-07 17:53:05 -0500, Tom Lane wrote: >>> Is "-s" mode actually a relevant criterion here? With per-table COPY >>> commands added into the mix you could not possibly get better than 2x >>> improvement, and likely a good deal less. >> Well, -s isn't something used all that rarely, so it'd not be insane to >> optimize it in isolation. But more importantly, I think the potential win >> without -s is far bigger than 2x, because the COPYs can be done in parallel, >> whereas the locking happens in the non-parallel stage. > True, and there's the reduce-the-lock-window issue that Jacob mentioned. > >> With just a 5ms delay, very well within normal network latency range, I get: >> [ a nice win ] > OK. I'm struggling to figure out why I rejected this idea last year. > I know that I thought about it and I'm fairly certain I actually > tested it. Maybe I only tried it with near-zero-latency local > loopback; but I doubt that, because the potential for network > latency was certainly a factor in that whole discussion. > > One idea is that I might've tried it before getting rid of all the > other per-object queries, at which point it wouldn't have stood out > quite so much. But I'm just guessing. I have a nagging feeling > there was something else. > > Oh well, I guess we can always revert it if we discover a problem later. > > regards, tom lane > Hi, I have done a review of this patch, it applies well on current master and compiles without problem. make check/installcheck and world run without failure, pg_dump tests with pgtap enabled work fine too. I have also given a try to the bench mentioned in the previous posts and I have the same performances gain with the -s option. As it seems to have a consensus to apply this patch I will change the commitfest status to ready for committers. Regards, -- Gilles Darold
Gilles Darold <gilles@migops.com> writes: > As it seems to have a consensus to apply this patch I will change the > commitfest status to ready for committers. Yeah. I still have a nagging worry about why I didn't do this already, but without evidence I can't fairly block the patch. Hence, pushed. I did cut the LOCK TABLE query length limit from 1MB to 100K, because I doubt there is any measurable performance difference, and I'm a little worried about overstressing smaller servers. regards, tom lane