Thread: BUG #15959: 'DROP EXTENSION pglogical' while an unused logical replication slot exists causes slot corruption
BUG #15959: 'DROP EXTENSION pglogical' while an unused logical replication slot exists causes slot corruption
From
PG Bug reporting form
Date:
The following bug has been logged on the website: Bug reference: 15959 Logged by: Matt W Email address: wise@wiredgeek.net PostgreSQL version: 10.6 Operating system: Linux Description: This is a problem we ran into on multiple production databases before we discovered the series of steps required to make it happen. The pattern presented itself to us as we were doing a migration from a set of current existing "source" databases into newer "replica" databases. Overall the bug presents itself if you are using a Logical Replication Slot on a "replica master" database that uses pglogical to replicate from a "source master" database. The data flow looks like this: ``` [Source_DB] ----PGLogical----> [Replica_DB] ----LogicalReplicationSlot---> pg_recvlogical ``` Short Description: If you have a logical replication slot created (but not being actively consumed from) and you issue a 'DROP EXTENSION pglogical', it puts the database into a bad state. Later when the consumer for that slot comes in and tries to start replicating they will receive the following error: pg_recvlogical: unexpected termination of replication stream: ERROR: could not find pg_class entry for 16387 Detailed Setup: To replicate the issue fully, check out the code at https://github.com/diranged/postgres-logical-replication-pgclass-bug and follow the instructions. Business Impact: As soon as the logical replication slot is broken, there are two critical impacts. First, if you rely on a fully in-tact stream of data replicating out of your database into some other data path (for example, with https://github.com/Nextdoor/pg-bifrost), you start losing data at the moment in which the slot is broken. There is no way that we know of to "skip" the broken record and move forward. Second, as soon as the replication slot breaks, Postgres begins backing up WAL data on disk. If this goes unnoticed, the database can run itself out of space and cause major problems. This is particularly painful in Amazon RDS where you don't have control of moving the WAL data onto different volumes. Versions Affected: I've tested this on Postgres 10.6 -> 10.10,
Re: BUG #15959: 'DROP EXTENSION pglogical' while an unused logicalreplication slot exists causes slot corruption
From
Peter Eisentraut
Date:
On 2019-08-14 23:35, PG Bug reporting form wrote: > Short Description: > If you have a logical replication slot created (but not being actively > consumed from) and you issue a 'DROP EXTENSION pglogical', it puts the > database into a bad state. Later when the consumer for that slot comes in > and tries to start replicating they will receive the following error: > > pg_recvlogical: unexpected termination of replication stream: ERROR: > could not find pg_class entry for 16387 > > Detailed Setup: > To replicate the issue fully, check out the code at > https://github.com/diranged/postgres-logical-replication-pgclass-bug and > follow the instructions. What version of pglogical are you using? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #15959: 'DROP EXTENSION pglogical' while an unused logicalreplication slot exists causes slot corruption
From
Matt Wise
Date:
Sorry for the delay - in my test case, I don't specify the version of pglogical (https://github.com/diranged/postgres-logical-replication-pgclass-bug/blob/master/Dockerfile#L4). In the case that we hit in production, this was actually on an Amazon RDS database, so I am unsure of the specific pglogical version that we were running. We were using Postgres 10.6 though.
On Mon, Aug 19, 2019 at 12:12 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 2019-08-14 23:35, PG Bug reporting form wrote:
> Short Description:
> If you have a logical replication slot created (but not being actively
> consumed from) and you issue a 'DROP EXTENSION pglogical', it puts the
> database into a bad state. Later when the consumer for that slot comes in
> and tries to start replicating they will receive the following error:
>
> pg_recvlogical: unexpected termination of replication stream: ERROR:
> could not find pg_class entry for 16387
>
> Detailed Setup:
> To replicate the issue fully, check out the code at
> https://github.com/diranged/postgres-logical-replication-pgclass-bug and
> follow the instructions.
What version of pglogical are you using?
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services