Home > mailing lists

PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables) - Mailing list pgsql-hackers

From	Kirk Wolak
Subject	PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)
Date	August 15 19:10:58
Msg-id	CACLU5mS07WGPpq6=m8aC5tUMusNpL8FhBPnYem86iSawEgcavw@mail.gmail.com Whole thread Raw
Responses	Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables) Re: PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables)
List	pgsql-hackers

Tree view

Hackers,

I am asking for feedback on this. We just migrated a client to PG and all is well. Except the developers requested a copy of recent production data for testing.

We have a handful of tables that have 90% of the data going back 30 years.

We NEVER restore this data to Dev or Staging. We used a special RMAN backup where these tables had a "WHERE clause" applied to them during the backup/dump process.

It would be awesome if pg_dump offered something similar. I am willing to code this up, but there are questions. And I want to do it in a way that makes the most sense to the PG Community.

My initial thought is a simple filtering file, of the format:
schema.table = WHERE ...

"schema"."Table" = WHERE ...

"schema"."t1" = LIMIT 50000

If the --filter-data <filename> is not specified, the code would not change anything. If it was, it would see if the table was in the file with a filter, it would read that filter, and apply it to the resulting COPY command.

I don't believe this impacts pg_restore. But if we wanted to include any kind of messaging in the restore process that "pg_dump --filter-data was used, this is NOT a complete dump!", then I would appreciate that, and include it in pg_restore.

Just to make the point. The full pg_dump takes 60 minutes (During which, we must turn off certain features to avoid throwing errors/locking issues). Excluding these tables takes 2-3 minutes. (Side dumping limited versions of them with \COPY takes 3 minutes). And frankly we have enough backups of the many years of data, we don't need daily snapshots of them, or to carry them around.

Thanks in advance. I am hoping that I am not the only one that would benefit from a filtered dump (as opposed to all or nothing).

Finally, I considered using an entire query, which could allow data-masking and more complex queries if there are FKs involved. But that seemed like a much bigger ask (and a potential foot-gun).

PS: A Quick Hack feature we could leverage would be to flag the ignore-table-data to generate an EMPTY .dat file, and the internal toc.dat reference to load that file... THEN simply overwrite that file with our manual \COPY command. This would be almost a trivial change, and would work for what we do/need. BUT it feels "off" a bit.

pgsql-hackers by date:

From: Nathan Bossart
Date: 15 August, 18:51:24
Subject: Re: shmem_startup_hook called twice on Windows

From: Nathan Bossart
Date: 15 August, 19:17:31
Subject: Re: Remove Instruction Synchronization Barrier in spin_delay() for ARM64 architecture

PoC: pg_dump --filter-data (like Oracle Where Clause on RMAN for specific tables) - Mailing list pgsql-hackers

Previous

Next