Thread: Problem with transition tables on partitioned tables with foreign-table partitions

Hi,

While working on something else, I noticed that while we disallow
transition tables on foreign tables, we allow transition tables on
partitioned tables with foreign-table partitions, which produces
incorrect results.  Here is an example using postgres_fdw:

create table parent (a text, b int) partition by list (a);
create table loct (a text, b int);
create foreign table child (a text, b int)
  server loopback options (table_name 'loct');
alter table parent attach partition child for values in ('AAA');

create function dump_insert() returns trigger language plpgsql as
$$
  begin
    raise notice 'trigger = %, new table = %',
                 TG_NAME,
                 (select string_agg(new_table::text, ', ' order by a)
from new_table);
    return null;
  end;
$$;
create trigger parent_insert_trig
  after insert on parent referencing new table as new_table
  for each statement execute procedure dump_insert();

create function intercept_insert() returns trigger language plpgsql as
$$
  begin
    new.b = new.b + 1000;
    return new;
  end;
$$;
create trigger intercept_insert_loct
  before insert on loct
  for each row execute procedure intercept_insert();

insert into parent values ('AAA', 42);
NOTICE:  trigger = parent_insert_trig, new table = (AAA,42)
INSERT 0 1

The trigger shows the original tuple created by the core, not the
actual tuple inserted into the foreign-table partition, as
postgres_fdw does not collect the actual tuple, of course!

UPDATE/DELETE also produce incorrect results:

create function dump_update() returns trigger language plpgsql as
$$
  begin
    raise notice 'trigger = %, old table = %, new table = %',
                 TG_NAME,
                 (select string_agg(old_table::text, ', ' order by a)
from old_table),
                 (select string_agg(new_table::text, ', ' order by a)
from new_table);
    return null;
  end;
$$;
create trigger parent_update_trig
  after update on parent referencing old table as old_table new table
as new_table
  for each statement execute procedure dump_update();

update parent set b = b + 1;
NOTICE:  trigger = parent_update_trig, old table = <NULL>, new table = <NULL>
UPDATE 1

create function dump_delete() returns trigger language plpgsql as
$$
  begin
    raise notice 'trigger = %, old table = %',
                 TG_NAME,
                 (select string_agg(old_table::text, ', ' order by a)
from old_table);
    return null;
  end;
$$;
create trigger parent_delete_trig
  after delete on parent referencing old table as old_table
  for each statement execute procedure dump_delete();

delete from parent;
NOTICE:  trigger = parent_delete_trig, old table = <NULL>
DELETE 1

In both cases the triggers fail to show transition tuples.  The cause
of this is that postgres_fdw mistakenly performs direct modify for
UPDATE/DELETE on the partition, which skips
ExecARUpdateTriggers()/ExecARDeleteTriggers() entirely.

To fix, I think we could disallow creating transition-table triggers
on such partitioned tables, but I think that that is too restrictive
because some users might have been using such triggers, avoiding this
problem by e.g., modifying only plain-table partitions.  So I would
like to propose to fix this by the following: 1) disable using direct
modify to modify foreign-table partitions if there are any
transition-table triggers on the partitioned table, and then 2) throw
an error in ExecARInsertTriggers()/ExecARUpdateTriggers()/ExecARDeleteTriggers()
if they collects transition tuple(s) from a foreign-table partition.

Attached is a WIP patch for that.

Best regards,
Etsuro Fujita

Attachment
Hi Amit-san,

On Tue, Jul 1, 2025 at 4:42 PM Amit Langote <amitlangote09@gmail.com> wrote:
> On Tue, Jul 1, 2025 at 11:55 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > While working on something else, I noticed that while we disallow
> > transition tables on foreign tables, we allow transition tables on
> > partitioned tables with foreign-table partitions, which produces
> > incorrect results.  Here is an example using postgres_fdw:
> >
> > create table parent (a text, b int) partition by list (a);
> > create table loct (a text, b int);
> > create foreign table child (a text, b int)
> >   server loopback options (table_name 'loct');
> > alter table parent attach partition child for values in ('AAA');
> >
> > create function dump_insert() returns trigger language plpgsql as
> > $$
> >   begin
> >     raise notice 'trigger = %, new table = %',
> >                  TG_NAME,
> >                  (select string_agg(new_table::text, ', ' order by a)
> > from new_table);
> >     return null;
> >   end;
> > $$;
> > create trigger parent_insert_trig
> >   after insert on parent referencing new table as new_table
> >   for each statement execute procedure dump_insert();
> >
> > create function intercept_insert() returns trigger language plpgsql as
> > $$
> >   begin
> >     new.b = new.b + 1000;
> >     return new;
> >   end;
> > $$;
> > create trigger intercept_insert_loct
> >   before insert on loct
> >   for each row execute procedure intercept_insert();
> >
> > insert into parent values ('AAA', 42);
> > NOTICE:  trigger = parent_insert_trig, new table = (AAA,42)
> > INSERT 0 1
> >
> > The trigger shows the original tuple created by the core, not the
> > actual tuple inserted into the foreign-table partition, as
> > postgres_fdw does not collect the actual tuple, of course!
>
> Maybe I'm missing something, but given that the intercept_insert()
> function is applied during the "remote" operation, isn't it expected
> that the parent table's trigger for a "local" operation shows the
> original tuple?

That is the question of how we define the after image of a row
inserted into a foreign table, but consider the case where the
partition is a plain table:

create table parent (a text, b int) partition by list (a);
create table child partition of parent for values in ('AAA');
create trigger intercept_insert_child
  before insert on child
  for each row execute procedure intercept_insert();
insert into parent values ('AAA', 42);
NOTICE:  trigger = parent_insert_trig, new table = (AAA,1042)
INSERT 0 1

The trigger shows the final tuple, not the original tuple.  So from a
consistency perspective, I thought it would be good if the trigger
does so even in the case where the partition is a foreign table.

> >  So I would
> > like to propose to fix this by the following: 1) disable using direct
> > modify to modify foreign-table partitions if there are any
> > transition-table triggers on the partitioned table, and then 2) throw
> > an error in ExecARInsertTriggers()/ExecARUpdateTriggers()/ExecARDeleteTriggers()
> > if they collects transition tuple(s) from a foreign-table partition.
>
> Is (2) intended to catch cases that occur during a foreign insert and
> foreign/non-direct update/delete?

That is right; the patch forces the FDW to perform ExecForeign*
functions, and then throws an error in ExecAR* functions.  One good
thing about this is that we are able to avoid throwing the error when
local/remote row-level BEFORE triggers return NULL.

Thanks for the comments!

Best regards,
Etsuro Fujita



On Wed, Jul 2, 2025 at 7:05 PM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> On Tue, Jul 1, 2025 at 4:42 PM Amit Langote <amitlangote09@gmail.com> wrote:
> > On Tue, Jul 1, 2025 at 11:55 AM Etsuro Fujita <etsuro.fujita@gmail.com> wrote:
> > > While working on something else, I noticed that while we disallow
> > > transition tables on foreign tables, we allow transition tables on
> > > partitioned tables with foreign-table partitions, which produces
> > > incorrect results.  Here is an example using postgres_fdw:
> > >
> > > create table parent (a text, b int) partition by list (a);
> > > create table loct (a text, b int);
> > > create foreign table child (a text, b int)
> > >   server loopback options (table_name 'loct');
> > > alter table parent attach partition child for values in ('AAA');
> > >
> > > create function dump_insert() returns trigger language plpgsql as
> > > $$
> > >   begin
> > >     raise notice 'trigger = %, new table = %',
> > >                  TG_NAME,
> > >                  (select string_agg(new_table::text, ', ' order by a)
> > > from new_table);
> > >     return null;
> > >   end;
> > > $$;
> > > create trigger parent_insert_trig
> > >   after insert on parent referencing new table as new_table
> > >   for each statement execute procedure dump_insert();
> > >
> > > create function intercept_insert() returns trigger language plpgsql as
> > > $$
> > >   begin
> > >     new.b = new.b + 1000;
> > >     return new;
> > >   end;
> > > $$;
> > > create trigger intercept_insert_loct
> > >   before insert on loct
> > >   for each row execute procedure intercept_insert();
> > >
> > > insert into parent values ('AAA', 42);
> > > NOTICE:  trigger = parent_insert_trig, new table = (AAA,42)
> > > INSERT 0 1
> > >
> > > The trigger shows the original tuple created by the core, not the
> > > actual tuple inserted into the foreign-table partition, as
> > > postgres_fdw does not collect the actual tuple, of course!
> >
> > Maybe I'm missing something, but given that the intercept_insert()
> > function is applied during the "remote" operation, isn't it expected
> > that the parent table's trigger for a "local" operation shows the
> > original tuple?
>
> That is the question of how we define the after image of a row
> inserted into a foreign table, but consider the case where the
> partition is a plain table:
>
> create table parent (a text, b int) partition by list (a);
> create table child partition of parent for values in ('AAA');
> create trigger intercept_insert_child
>   before insert on child
>   for each row execute procedure intercept_insert();
> insert into parent values ('AAA', 42);
> NOTICE:  trigger = parent_insert_trig, new table = (AAA,1042)
> INSERT 0 1
>
> The trigger shows the final tuple, not the original tuple.  So from a
> consistency perspective, I thought it would be good if the trigger
> does so even in the case where the partition is a foreign table.

Ok, but if you define the trigger on the foreign table partition
(child) as follows, you do get what I think is the expected result?

create trigger intercept_insert_foreign_child
  before insert on child
  for each row execute procedure intercept_insert();

insert into parent values ('AAA', 42);
NOTICE:  trigger = parent_insert_trig, new table = (AAA,1042)

-- 2042, because row modified by both triggers
table parent;
  a  |  b
-----+------
 AAA | 2042
(1 row)

Or perhaps you're saying that the row returned by this line in ExecInsert():

       /*
         * insert into foreign table: let the FDW do it
         */
        slot = resultRelInfo->ri_FdwRoutine->ExecForeignInsert(estate,
                                                               resultRelInfo,
                                                               slot,
                                                               planSlot);

is not the expected "after image", and thus should not be added to the
parent's transition table?

IIUC, to prevent that, we now hit the following error in:

void
ExecARInsertTriggers(EState *estate, ResultRelInfo *relinfo,
                     TupleTableSlot *slot, List *recheckIndexes,
                     TransitionCaptureState *transition_capture)
{
    TriggerDesc *trigdesc = relinfo->ri_TrigDesc;

    if (relinfo->ri_FdwRoutine && transition_capture &&
        transition_capture->tcs_insert_new_table)
    {
        Assert(relinfo->ri_RootResultRelInfo);
        ereport(ERROR,
                (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
                 errmsg("cannot collect transition tuples from child
foreign tables")));
    }

> > >  So I would
> > > like to propose to fix this by the following: 1) disable using direct
> > > modify to modify foreign-table partitions if there are any
> > > transition-table triggers on the partitioned table, and then 2) throw
> > > an error in ExecARInsertTriggers()/ExecARUpdateTriggers()/ExecARDeleteTriggers()
> > > if they collects transition tuple(s) from a foreign-table partition.
> >
> > Is (2) intended to catch cases that occur during a foreign insert and
> > foreign/non-direct update/delete?
>
> That is right; the patch forces the FDW to perform ExecForeign*
> functions, and then throws an error in ExecAR* functions.  One good
> thing about this is that we are able to avoid throwing the error when
> local/remote row-level BEFORE triggers return NULL.

Given my question above, I’m not sure I fully understand the intention
behind adding these checks.

--
Thanks, Amit Langote