Thread: [COMMITTERS] pgsql: Add support for temporary replication slots
Add support for temporary replication slots This allows creating temporary replication slots that are removed automatically at the end of the session or on error. From: Petr Jelinek <petr.jelinek@2ndquadrant.com> Branch ------ master Details ------- http://git.postgresql.org/pg/commitdiff/a924c327e2793d2025b19e18de7917110dc8afd8 Modified Files -------------- contrib/test_decoding/Makefile | 2 +- contrib/test_decoding/expected/ddl.out | 4 +- contrib/test_decoding/expected/slot.out | 58 +++++++++++++++++++++++++++ contrib/test_decoding/sql/slot.sql | 20 ++++++++++ doc/src/sgml/func.sgml | 16 ++++++-- doc/src/sgml/protocol.sgml | 13 ++++++- src/backend/catalog/system_views.sql | 11 ++++++ src/backend/replication/repl_gram.y | 22 +++++++---- src/backend/replication/repl_scanner.l | 1 + src/backend/replication/slot.c | 69 ++++++++++++++++++++++++++------- src/backend/replication/slotfuncs.c | 24 ++++++++---- src/backend/replication/walsender.c | 28 +++++++------ src/backend/storage/lmgr/proc.c | 3 ++ src/backend/tcop/postgres.c | 3 ++ src/include/catalog/pg_proc.h | 6 +-- src/include/nodes/replnodes.h | 1 + src/include/replication/slot.h | 4 +- src/test/regress/expected/rules.out | 3 +- 18 files changed, 237 insertions(+), 51 deletions(-)
Peter Eisentraut <peter_e@gmx.net> writes: > Add support for temporary replication slots Some of the slower buildfarm members are failing test-decoding-check since this went in. At least on prairiedog, it looks like a race condition in the test: ** /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/expected/slot.out Mon Dec 12 09:24:32 2016 --- /Users/buildfarm/bf-data/HEAD/pgsql.build/contrib/test_decoding/./regression_output/results/slot.out Mon Dec 12 10:01:312016 *************** *** 32,38 **** -- should fail because the temporary slot was dropped automatically SELECT pg_drop_replication_slot('regression_slot_t'); ! ERROR: replication slot "regression_slot_t" does not exist -- test switching between slots in a session SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot1', 'test_decoding', true); ?column? --- 32,38 ---- -- should fail because the temporary slot was dropped automatically SELECT pg_drop_replication_slot('regression_slot_t'); ! ERROR: replication slot "regression_slot_t" is active for PID 17615 -- test switching between slots in a session SELECT 'init' FROM pg_create_logical_replication_slot('regression_slot1', 'test_decoding', true); ?column? regards, tom lane
I wrote: > Peter Eisentraut <peter_e@gmx.net> writes: >> Add support for temporary replication slots > Some of the slower buildfarm members are failing test-decoding-check since > this went in. At least on prairiedog, it looks like a race condition in > the test: Having now looked a bit more closely, that's almost certainly what it is: the test is assuming that the old backend exits instantaneously, which it doesn't. You might want to borrow the wait-for-previous-session-to-die loop I put into src/test/modules/test_extensions/sql/test_extensions.sql recently. (Make sure you get the fixed version ;-)) regards, tom lane
On 12/12/16 16:55, Tom Lane wrote: > I wrote: >> Peter Eisentraut <peter_e@gmx.net> writes: >>> Add support for temporary replication slots > >> Some of the slower buildfarm members are failing test-decoding-check since >> this went in. At least on prairiedog, it looks like a race condition in >> the test: > > Having now looked a bit more closely, that's almost certainly what it is: > the test is assuming that the old backend exits instantaneously, which > it doesn't. You might want to borrow the wait-for-previous-session-to-die > loop I put into src/test/modules/test_extensions/sql/test_extensions.sql > recently. (Make sure you get the fixed version ;-)) > Yes you are correct, we tried naively to first drop another slot in hopes that it will give the other backend enough time to close, but apparently that wasn't robust enough for slower machines. Attached is the fixed test using the loop from test_extensions (and yes I would have missed the pg_stat_clear_snapshot() call without the hint, thanks :) ). -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Petr Jelinek <petr.jelinek@2ndquadrant.com> writes: > Yes you are correct, we tried naively to first drop another slot in > hopes that it will give the other backend enough time to close, but > apparently that wasn't robust enough for slower machines. > Attached is the fixed test using the loop from test_extensions (and yes > I would have missed the pg_stat_clear_snapshot() call without the hint, > thanks :) ). Pushed, thanks. regards, tom lane
I wrote: > Petr Jelinek <petr.jelinek@2ndquadrant.com> writes: >> Attached is the fixed test using the loop from test_extensions (and yes >> I would have missed the pg_stat_clear_snapshot() call without the hint, >> thanks :) ). > Pushed, thanks. Hm, buildfarm says this didn't fix it. Where exactly does the dropping of the slot happen ... is it not complete by the time the backend exits? regards, tom lane
On 13/12/16 00:39, Tom Lane wrote: > I wrote: >> Petr Jelinek <petr.jelinek@2ndquadrant.com> writes: >>> Attached is the fixed test using the loop from test_extensions (and yes >>> I would have missed the pg_stat_clear_snapshot() call without the hint, >>> thanks :) ). > >> Pushed, thanks. > > Hm, buildfarm says this didn't fix it. Where exactly does the dropping > of the slot happen ... is it not complete by the time the backend exits? > Yes, I've seen. The cleanup of slots is done in ProcKill(), I wonder, since it's on_shmem_exit hook, and the pgstat_beshutdown_hook is as well, maybe the pgstat_beshutdown_hook is called before ProcKill and the query is lucky to hit right in between. I am not quite sure what would be the best way to work around that. We could wait for the slot to disappear instead of trying to drop it, but then the test would hang on failure. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Petr Jelinek <petr.jelinek@2ndquadrant.com> writes: > On 13/12/16 00:39, Tom Lane wrote: >> Hm, buildfarm says this didn't fix it. Where exactly does the dropping >> of the slot happen ... is it not complete by the time the backend exits? > Yes, I've seen. The cleanup of slots is done in ProcKill(), I wonder, > since it's on_shmem_exit hook, and the pgstat_beshutdown_hook is as > well, maybe the pgstat_beshutdown_hook is called before ProcKill and the > query is lucky to hit right in between. Hm. That seems like a pretty bogus place to do it. An awful lot of the backend infrastructure is already gone by then, if I recall the ordering correctly. Maybe ShutdownPostgres would be a saner place; but it really depends on what you think the module layering is for this facility. I would definitely not think it is proc.c's responsibility, though. regards, tom lane
On 13/12/16 01:40, Tom Lane wrote: > Petr Jelinek <petr.jelinek@2ndquadrant.com> writes: >> On 13/12/16 00:39, Tom Lane wrote: >>> Hm, buildfarm says this didn't fix it. Where exactly does the dropping >>> of the slot happen ... is it not complete by the time the backend exits? > >> Yes, I've seen. The cleanup of slots is done in ProcKill(), I wonder, >> since it's on_shmem_exit hook, and the pgstat_beshutdown_hook is as >> well, maybe the pgstat_beshutdown_hook is called before ProcKill and the >> query is lucky to hit right in between. > > Hm. That seems like a pretty bogus place to do it. An awful lot of the > backend infrastructure is already gone by then, if I recall the ordering > correctly. Maybe ShutdownPostgres would be a saner place; but it really > depends on what you think the module layering is for this facility. > I would definitely not think it is proc.c's responsibility, though. > Well, the problem is that that's the place where the currently active slot is released (if there was an active one). So we'd need to move that part somewhere else as well. I am not sure what's the reasoning for releasing it at that specific spot so CCing Andres. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2016-12-12 19:40:37 -0500, Tom Lane wrote: > Petr Jelinek <petr.jelinek@2ndquadrant.com> writes: > > Yes, I've seen. The cleanup of slots is done in ProcKill(), I wonder, > > since it's on_shmem_exit hook, and the pgstat_beshutdown_hook is as > > well, maybe the pgstat_beshutdown_hook is called before ProcKill and the > > query is lucky to hit right in between. > > Hm. That seems like a pretty bogus place to do it. Well, that'd not be Pet[e]r's fault. Robert and I had some preexisting slot cleanpucode there (which I'd like to consolidate btw). > backend infrastructure is already gone by then, if I recall the ordering > correctly. Maybe ShutdownPostgres would be a saner place; but it really > depends on what you think the module layering is for this facility. > I would definitely not think it is proc.c's responsibility, though. Slots quite possibly can used by bgworkers, so I don't think ShutdownPostgres would be right. I don't think there's a perfect place for it atm, so it's ProcKill()... - Andres
On 2016-12-13 01:42:48 +0100, Petr Jelinek wrote: > On 13/12/16 01:40, Tom Lane wrote: > > Petr Jelinek <petr.jelinek@2ndquadrant.com> writes: > >> On 13/12/16 00:39, Tom Lane wrote: > >>> Hm, buildfarm says this didn't fix it. Where exactly does the dropping > >>> of the slot happen ... is it not complete by the time the backend exits? > > > >> Yes, I've seen. The cleanup of slots is done in ProcKill(), I wonder, > >> since it's on_shmem_exit hook, and the pgstat_beshutdown_hook is as > >> well, maybe the pgstat_beshutdown_hook is called before ProcKill and the > >> query is lucky to hit right in between. > > > > Hm. That seems like a pretty bogus place to do it. An awful lot of the > > backend infrastructure is already gone by then, if I recall the ordering > > correctly. Maybe ShutdownPostgres would be a saner place; but it really > > depends on what you think the module layering is for this facility. > > I would definitely not think it is proc.c's responsibility, though. > > > > Well, the problem is that that's the place where the currently active > slot is released (if there was an active one). So we'd need to move that > part somewhere else as well. I am not sure what's the reasoning for > releasing it at that specific spot so CCing Andres. Why don't we just instead make the loop over pg_replication_slots? Andres
On 13/12/16 01:49, Andres Freund wrote: > On 2016-12-13 01:42:48 +0100, Petr Jelinek wrote: >> On 13/12/16 01:40, Tom Lane wrote: >>> Petr Jelinek <petr.jelinek@2ndquadrant.com> writes: >>>> On 13/12/16 00:39, Tom Lane wrote: >>>>> Hm, buildfarm says this didn't fix it. Where exactly does the dropping >>>>> of the slot happen ... is it not complete by the time the backend exits? >>> >>>> Yes, I've seen. The cleanup of slots is done in ProcKill(), I wonder, >>>> since it's on_shmem_exit hook, and the pgstat_beshutdown_hook is as >>>> well, maybe the pgstat_beshutdown_hook is called before ProcKill and the >>>> query is lucky to hit right in between. >>> >>> Hm. That seems like a pretty bogus place to do it. An awful lot of the >>> backend infrastructure is already gone by then, if I recall the ordering >>> correctly. Maybe ShutdownPostgres would be a saner place; but it really >>> depends on what you think the module layering is for this facility. >>> I would definitely not think it is proc.c's responsibility, though. >>> >> >> Well, the problem is that that's the place where the currently active >> slot is released (if there was an active one). So we'd need to move that >> part somewhere else as well. I am not sure what's the reasoning for >> releasing it at that specific spot so CCing Andres. > > Why don't we just instead make the loop over pg_replication_slots? > I mentioned that as possible solution upthread, I am only worried that the failure scenario is basically infinite loop. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2016-12-13 01:57:01 +0100, Petr Jelinek wrote: > I mentioned that as possible solution upthread, I am only worried that > the failure scenario is basically infinite loop. I don't see the problem with that. If you're really concerned you can set a statement timeout. Andres
Andres Freund <andres@anarazel.de> writes: > On 2016-12-13 01:57:01 +0100, Petr Jelinek wrote: >> I mentioned that as possible solution upthread, I am only worried that >> the failure scenario is basically infinite loop. > I don't see the problem with that. If you're really concerned you can > set a statement timeout. I don't think "change the test" is the right response here. I think the problem is that we're disconnecting from the slot at the wrong step of backend shutdown, and that we need to fix that before it bites us on some more painful parts of our anatomies. You can't just throw darts at the code when deciding where to do things, and proc.c is NOT the place that should be concerned with replication slots. regards, tom lane
On 2016-12-12 20:08:01 -0500, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2016-12-13 01:57:01 +0100, Petr Jelinek wrote: > >> I mentioned that as possible solution upthread, I am only worried that > >> the failure scenario is basically infinite loop. > > > I don't see the problem with that. If you're really concerned you can > > set a statement timeout. > > I don't think "change the test" is the right response here. > I think the problem is that we're disconnecting from the slot at the > wrong step of backend shutdown, and that we need to fix that before > it bites us on some more painful parts of our anatomies. You can't > just throw darts at the code when deciding where to do things, and > proc.c is NOT the place that should be concerned with replication > slots. And why is that / where would be more appropriate? It's not ShutdownPostgres() (usable from bgworkers). And it has to be after LWLockReleaseAll(), because otherwise we'll potentially just deadlock against ourselves. Greetings, Andres Freund
On 13/12/16 02:08, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: >> On 2016-12-13 01:57:01 +0100, Petr Jelinek wrote: >>> I mentioned that as possible solution upthread, I am only worried that >>> the failure scenario is basically infinite loop. > >> I don't see the problem with that. If you're really concerned you can >> set a statement timeout. > > I don't think "change the test" is the right response here. > I think the problem is that we're disconnecting from the slot at the > wrong step of backend shutdown, and that we need to fix that before > it bites us on some more painful parts of our anatomies. You can't > just throw darts at the code when deciding where to do things, and > proc.c is NOT the place that should be concerned with replication > slots. > It has been concerned with replication slots since 9.4 so it's not like it's newly invented concern. As Andres says, we don't have better place to put it to at the moment. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 13/12/16 02:00, Andres Freund wrote: > On 2016-12-13 01:57:01 +0100, Petr Jelinek wrote: >> I mentioned that as possible solution upthread, I am only worried that >> the failure scenario is basically infinite loop. > > I don't see the problem with that. If you're really concerned you can > set a statement timeout. > Okay in case we decide it's the right way to go attached patch does that. I also added some more tests based on your feedback while I am at it. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 12/12/16 8:28 PM, Petr Jelinek wrote: > On 13/12/16 02:00, Andres Freund wrote: >> On 2016-12-13 01:57:01 +0100, Petr Jelinek wrote: >>> I mentioned that as possible solution upthread, I am only worried that >>> the failure scenario is basically infinite loop. >> >> I don't see the problem with that. If you're really concerned you can >> set a statement timeout. >> > > Okay in case we decide it's the right way to go attached patch does > that. I also added some more tests based on your feedback while I am at it. This looks mostly reasonable to me, but the location and xid output from pg_logical_slot_get_changes() is not portable/repeatable, so it should be omitted from the output. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 14/12/16 15:35, Peter Eisentraut wrote: > On 12/12/16 8:28 PM, Petr Jelinek wrote: >> On 13/12/16 02:00, Andres Freund wrote: >>> On 2016-12-13 01:57:01 +0100, Petr Jelinek wrote: >>>> I mentioned that as possible solution upthread, I am only worried that >>>> the failure scenario is basically infinite loop. >>> >>> I don't see the problem with that. If you're really concerned you can >>> set a statement timeout. >>> >> >> Okay in case we decide it's the right way to go attached patch does >> that. I also added some more tests based on your feedback while I am at it. > > This looks mostly reasonable to me, but the location and xid output from > pg_logical_slot_get_changes() is not portable/repeatable, so it should > be omitted from the output. > Sigh, yes you are right. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-committers
Attachment
On 12/15/16 3:46 AM, Petr Jelinek wrote: >>> Okay in case we decide it's the right way to go attached patch does >>> that. I also added some more tests based on your feedback while I am at it. >> >> This looks mostly reasonable to me, but the location and xid output from >> pg_logical_slot_get_changes() is not portable/repeatable, so it should >> be omitted from the output. >> > > Sigh, yes you are right. committed -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services