Re: Proposal: Conflict log history table for Logical Replication - Mailing list pgsql-hackers

From shveta malik
Subject Re: Proposal: Conflict log history table for Logical Replication
Date
Msg-id CAJpy0uB-ScuYXzKPuhvnZ2Dk0n5_sPHe9bDgSHMYswrkCH1iqg@mail.gmail.com
Whole thread Raw
In response to Re: Proposal: Conflict log history table for Logical Replication  (Dilip Kumar <dilipbalaut@gmail.com>)
Responses Re: Proposal: Conflict log history table for Logical Replication
List pgsql-hackers
On Thu, Aug 7, 2025 at 3:08 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Thu, Aug 7, 2025 at 1:43 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Thu, Aug 7, 2025 at 12:25 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> Thanks Shveta for your opinion on the design.
>
> > > On Tue, Aug 5, 2025 at 5:54 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > > >
>
> > > > This proposal aims to address these limitations by introducing a
> > > > conflict log history table, providing a structured, and queryable
> > > > record of all logical replication conflicts.  This should be a
> > > > configurable option whether to log into the conflict log history
> > > > table, server logs or both.
> > > >
> > >
> > > +1 for the idea.
>
> Thanks
>
> > >
> > > > This proposal has two main design questions:
> > > > ===================================
> > > >
> > > > 1. How do we store conflicting tuples from different tables?
> > > > Using a JSON column to store the row data seems like the most flexible
> > > > solution, as it can accommodate different table schemas.
> > >
> > > Yes, that is one option. I have not looked into details myself, but
> > > you can also explore 'anyarray' used in pg_statistics to store 'Column
> > > data values of the appropriate kind'.
>
> I think conversion from row to json and json to row is convenient and
> also other extensions like pgactive/bdr also provide as JSON.

Okay. Agreed.

> But we
> can explore this alternative options as well, thanks
>
> > > > 2. Should this be a system table or a user table?
> > > > a) System Table: Storing this in a system catalog is simple, but
> > > > catalogs aren't designed for ever-growing data. While pg_large_object
> > > > is an exception, this is not what we generally do IMHO.
> > > > b) User Table: This offers more flexibility. We could allow a user to
> > > > specify the table name during CREATE SUBSCRIPTION.  Then we choose to
> > > > either create the table internally or let the user create the table
> > > > with a predefined schema.
> > > >
> > > > A potential drawback is that a user might drop or alter the table.
> > > > However, we could mitigate this risk by simply logging a WARNING if
> > > > the table is configured but an insertion fails.
> > >
> > > I believe it makes more sense for this to be a catalog table rather
> > > than a user table. I wanted to check if we already have a large
> > > catalog table of this kind, and I think pg_statistic could be an
> > > example of a sizable catalog table. To get a rough idea of how size
> > > scales with data, I ran a quick experiment: I created 1000 tables,
> > > each with 2 JSON columns, 1 text column, and 2 integer columns. Then,
> > > I inserted 1000 rows into each table and ran ANALYZE to collect
> > > statistics. Here’s what I observed on a fresh database before and
> > > after:
> > >
> > > Before:
> > > pg_statistic row count: 412
> > > Table size: ~256 kB
> > >
> > > After:
> > > pg_statistic row count: 6,412
> > > Table size: ~5.3 MB
> > >
> > > Although it isn’t an exact comparison, this gives us some insight into
> > > how the statistics catalog table size grows with the number of rows.
> > > It doesn’t seem excessively large with 6k rows, given the fact that
> > > pg_statistic itself is a complex table having many 'anyarray'-type
> > > columns.
>
> Yeah that's good analysis, apart from this pg_largeobject is also a
> catalog which grows with each large object and growth rate for that
> will be very high because it stores large object data in catalog.
>
> > >
> > > That said, irrespective of what we decide, it would be ideal to offer
> > > users an option for automatic purging, perhaps via a retention period
> > > parameter like conflict_stats_retention_period (say default to 30
> > > days), or a manual purge API such as purge_conflict_stats('older than
> > > date'). I wasn’t able to find any such purge mechanism for PostgreSQL
> > > stats tables, but Oracle does provide such purging options for some of
> > > their statistics tables (not related to conflicts), see [1], [2].
> > > And to manage it better, it could be range partitioned on timestamp.
>
> Yeah that's an interesting suggestion to timestamp based partitioning
> it for purging.
>
> > It seems BDR also has one such conflict-log table which is a catalog
> > table and is also partitioned on time. It has a default retention
> > period of 30 days. See 'bdr.conflict_history' mentioned under
> > 'catalogs' in [1]
> >
> > [1]: https://www.enterprisedb.com/docs/pgd/latest/reference/tables-views-functions/#user-visible-catalogs-and-views
>
> Actually bdr is an extension and this table is under extension
> namespace (bdr.conflict_history) so this is not really a catalog but
> its a extension managed table.

Yes, right. Sorry for confusion.

> So logically for PostgreSQL its an
> user table but yeah this is created and managed by the extension.
>

Any idea if the user can alter/drop or perform any DML on it? I could
not find any details on this part.

> --
> Regards,
> Dilip Kumar
> Google



pgsql-hackers by date:

Previous
From: Yugo Nagata
Date:
Subject: Re: Allow to collect statistics on virtual generated columns
Next
From: Shinya Kato
Date:
Subject: Enhance statistics reset functions to return reset timestamp