Best way to delete big amount of records from big table - Mailing list pgsql-performance
From | Ekaterina Amez |
---|---|
Subject | Best way to delete big amount of records from big table |
Date | |
Msg-id | 6e456b07-a396-f9e3-6a22-23a96b73420c@zunibal.com Whole thread Raw |
Responses |
Re: Best way to delete big amount of records from big table
Re: Best way to delete big amount of records from big table Re: Best way to delete big amount of records from big table |
List | pgsql-performance |
Hello list, I'm trying to clean up a database with millions of records of useless-but-don't-remove-just-in-case data. This database has all tables in public schema so I've created a new schema "old_data" to move there all this data. I have several tables with 20million of records or so that I've managed to clean up relatively fast without special effort (not having to drop indexes or constraints) What I've made with these tables is easy as these ones are going to be emptied (I have to keep tables) so I only have to insert data into old_data.new_table and truncate cascade. But also I'm cleaning tables with 150million records where I'm going to remove 60% of existing data and after a few tests I'm not sure what's the best approach as all seem to take similar time to run. These tables are grouped in 4 tables group with master, detail, master_history, detail_history structure. None of the tables have primary key nor foreign key or any constraint but the sequence used for what should be the PK column, though this column is not defined as PK. I've decided to delete from the last one in chunks (10 days of data per chunk but it coud be any other quantity) so I've created a function. I've tested it with indexes (in master_hist for filtering data and in detail_hist for the fk and pk), without indexes, after analyzing table, and no matter what I always end up with more or less the same execution time. I can afford the time it's getting to run but I'd like to know if it's there a better way to do this. I'm testing on version 9.2 BUT production server is 8.4 (legacy application, supposed to be in at least 9.2 but recently discovered it was 8.4, planning upgrade but not now). Config parameters are default ones. Table definition: CREATE TABLE master ( id integer serial NOT NULL, device_id int4 NOT NULL, col1 int4 NULL DEFAULT 0, data_date bpchar(17) NULL, -- field to filter data data_file_date bpchar(14) NULL ); -- 9 of 20 records to be removed CREATE TABLE detail ( id integer serial NOT NULL, parent_id int4 NOT NULL, col1 float8 NULL, col2 int4 NOT NULL ); -- 2304 of 5120 records to be removed CREATE TABLE master_history ( id integer serial NOT NULL, device_id int4 NOT NULL, col1 int4 NULL DEFAULT 0, data_date bpchar(17) NULL, -- field to filter data data_file_date bpchar(14) NULL ); --355687 of 586999 records to be removed CREATE TABLE detail_history ( id integer serial NOT NULL, parent_id int4 NOT NULL, col1 float8 NULL, col2 int4 NOT NULL ); -- 91055872 of 150.271.744 records to be removed And the function: CREATE or replace FUNCTION delete_test() RETURNS integer AS $$ DECLARE _begin_date date; _end_date date := '2019-08-01'; _begin_exec timestamp := clock_timestamp(); _end_exec timestamp ; _begin_exec_partial timestamp; _end_exec_partial timestamp; _time double precision; _num_regs integer; BEGIN for _begin_date in (select '2018-05-01'::date + s.a * '10 days'::interval from (select generate_series(0,1000) as a) as s) loop if (_begin_date > _end_date) then raise log 'STOP!!!!!'; exit; end if; raise log 'Date %', _begin_date; _begin_exec_partial := clock_timestamp(); delete from public.detail_history t1 where exists (select 1 from public.master_history t2 where t2.id = t1.parent_id and t2.data_date >= rpad(to_char(_begin_date, 'YYYYMMDD'), 17, '0') and t2.data_date < rpad(to_char((_begin_date + interval '10 days'), 'YYYYMMDD'), 17, '0')); GET DIAGNOSTICS _num_regs = ROW_COUNT; _end_exec_partial := clock_timestamp(); _time := 1000 * ( extract(epoch from _end_exec_partial) - extract(epoch from _begin_exec_partial) ); raise log 'Records removed % in % ms', _num_regs, _time; end loop; _end_exec := clock_timestamp(); _time := 1000 * ( extract(epoch from _end_exec) - extract(epoch from _begin_exec) ); raise log 'Total time: %', _time; return 0; END; $$ LANGUAGE plpgsql; Delete execution plan in 8.4 is: test_eka=# explain delete from public.detail_hist t1 test_eka-# where exists test_eka-# (select 1 from public.master_hist t2 test_eka(# where t2.id = t1.parent_id test_eka(# and t2.data_date >= '20180501000000000000000' test_eka(# and t2.data_date < '20190101000000000000000'); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=33431.46..5890182.88 rows=156649104 width=6) Hash Cond: (t1.parent_id = t2.id) -> Seq Scan on detail_hist t1 (cost=0.00..2564256.04 rows=156649104 width=10) -> Hash (cost=30922.13..30922.13 rows=152906 width=4) -> Unique (cost=30157.60..30922.13 rows=152906 width=4) -> Sort (cost=30157.60..30539.87 rows=152906 width=4) Sort Key: t2.id -> Seq Scan on master_hist t2 (cost=0.00..14897.65 rows=152906 width=4) Filter: ((data_date >= '20180501000000000000000'::bpchar) AND (data_date < '20190101000000000000000'::bpchar)) After PK-FK creation (with IX over FK) QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Hash Join (cost=26678.41..5883424.77 rows=156648960 width=6) Hash Cond: (t1.id_param_espec_este = t2.id_param_espec_este_historico) -> Seq Scan on param_espec_este_datos_historico_tbl t1 (cost=0.00..2564254.60 rows=156648960 width=10) -> Hash (cost=24169.09..24169.09 rows=152906 width=4) -> Unique (cost=23404.56..24169.09 rows=152906 width=4) -> Sort (cost=23404.56..23786.82 rows=152906 width=4) Sort Key: t2.id_param_espec_este_historico -> Index Scan using fecha_gps_pe_este_hist_idx on param_espec_este_historico_tbl t2 (cost=0.00..8144.60 rows=152906 width=4) Index Cond: ((fecha_gps_parametros_espectrales >= '20180501000000000000000'::bpchar) AND (fecha_gps_parametros_espectrales < '20190101000000000000000'::bpchar)) Any ideas are welcome. Kind regards, Ekaterina.
pgsql-performance by date: