Thread: Slow query, possibly not using index
CREATE TABLE media.block (
id uuid NOT NULL,
"size" int8 NOT NULL,
nrefs int8 NOT NULL DEFAULT 0,
block bytea NOT NULL,
hs256 bytea NOT NULL,
CONSTRAINT block_pkey PRIMARY KEY (id),
CONSTRAINT chk_nrefs CHECK ((nrefs >= 0))
)
WITH (
toast_tuple_target=8160
)
TABLESPACE data_slow
;
alter table media.block alter column block set storage main;
alter table media.block alter column hs256 set storage main;
CREATE INDEX idx_block_unused ON media.block USING btree (id) WHERE (nrefs = 0);
CREATE UNIQUE INDEX uidx_block_hs256 ON media.block USING btree (hs256);
SELECT i.relname "Table Name",indexrelname "Index Name",
pg_size_pretty(pg_total_relation_size(relid)) As "Total Size",
pg_size_pretty(pg_indexes_size(relid)) as "Total Size of all Indexes",
pg_size_pretty(pg_relation_size(relid)) as "Table Size",
pg_size_pretty(pg_relation_size(indexrelid)) "Index Size",
reltuples::bigint "Estimated table row count"
FROM pg_stat_all_indexes i JOIN pg_class c ON i.relid=c.oid
where i.relid ='media.block'::regclass
Table Name|Index Name |Total Size|Total Size of all Indexes|Table Size|Index Size|Estimated table row count|
----------+----------------+----------+-------------------------+----------+----------+-------------------------+
block |block_pkey |352 GB |5584 MB |347 GB |1986 MB | 38958848|
block |uidx_block_hs256|352 GB |5584 MB |347 GB |3226 MB | 38958848|
block |idx_block_unused|352 GB |5584 MB |347 GB |372 MB | 38958848|
explain analyze select id from media.block b where nrefs =0 limit 1
explain select id from media.block b where nrefs =0 limit 1
-----------------------------------------------------------------------------------------------+
Limit (cost=0.38..0.76 rows=1 width=16) |
-> Index Only Scan using idx_block_unused on block b (cost=0.38..869.83 rows=2274 width=16)|
Les <nagylzs@gmail.com> writes: > If I try to select a single unused block this way: > explain analyze select id from media.block b where nrefs =0 limit 1 > then it runs for more than 10 minutes (I'm not sure how long, I cancelled > the query after 10 minutes). Are you sure it isn't blocked on a lock? Another theory is that the index contains many thousands of references to now-dead rows, and the query is vainly searching for a live entry. Given that EXPLAIN thinks there are only about 2300 live entries, and yet you say the index is 400MB, this seems pretty plausible. Have you disabled autovacuum, or something like that? (REINDEX could help here, at least till the index gets bloated again.) You might think that even so, it shouldn't take that long ... but indexes on UUID columns are a well known performance antipattern. The index entry order is likely to have precisely zip to do with the table's physical order, resulting in exceedingly-random access to the table, which'll be horribly expensive when the table is so much bigger than RAM. Can you replace the UUID column with a simple serial (identity) column? > I believe it is not actually using the index, because reading a single > (random?) entry from an index should not run for >10 minutes. You should believe what EXPLAIN tells you about the plan shape. (Its rowcount estimates are only estimates, though.) regards, tom lane
> If I try to select a single unused block this way:
> explain analyze select id from media.block b where nrefs =0 limit 1
> then it runs for more than 10 minutes (I'm not sure how long, I cancelled
> the query after 10 minutes).
Are you sure it isn't blocked on a lock?
Another theory is that the index contains many thousands of references
to now-dead rows, and the query is vainly searching for a live entry.
Given that EXPLAIN thinks there are only about 2300 live entries,
and yet you say the index is 400MB, this seems pretty plausible.
Have you disabled autovacuum, or something like that? (REINDEX
could help here, at least till the index gets bloated again.)
You might think that even so, it shouldn't take that long ... but
indexes on UUID columns are a well known performance antipattern.
The index entry order is likely to have precisely zip to do with
the table's physical order, resulting in exceedingly-random access
to the table, which'll be horribly expensive when the table is so
much bigger than RAM. Can you replace the UUID column with a simple
serial (identity) column?
CREATE OR REPLACE FUNCTION public.gen_ulid()
RETURNS uuid
LANGUAGE sql
AS $function$
SELECT (lpad(to_hex(floor(extract(epoch FROM clock_timestamp()) * 1000)::bigint), 12, '0') || encode(gen_random_bytes(10), 'hex'))::uuid;
$function$
;
> I believe it is not actually using the index, because reading a single
> (random?) entry from an index should not run for >10 minutes.
You should believe what EXPLAIN tells you about the plan shape.
(Its rowcount estimates are only estimates, though.)
does write-only include updates? that would create the dead rows tom is referring to.
> I believe it is not actually using the index, because reading a single (random?) entry from an index should not run for >10 minutes.
it is using the index. You can disable it, and see how long your query takes ("set enable_indexonlyscan = off", optionally enable_indexscan too), chances are it will take even longer. If somehow it does not, by luck, then you no vacuum has run for a while, for some reason (perhaps canceled due to frequent table activity?). That too you can check.
Les <nagylzs@gmail.com> writes: >>> If I try to select a single unused block this way: >>> explain analyze select id from media.block b where nrefs =0 limit 1 >>> then it runs for more than 10 minutes (I'm not sure how long, I cancelled >>> the query after 10 minutes). >> You might think that even so, it shouldn't take that long ... but >> indexes on UUID columns are a well known performance antipattern. > I'm aware of the problems with random UUID values. I was using this > function to create ulids from the beginning: Oh, well that would have been useful information to provide at the outset. Now that we know the index order is correlated with creation time, I wonder if it is also correlated with nrefs, in such a way that scanning in index order is disastrous because all the desired rows are at the end of the index. Also, you deny deleting any rows, but that wasn't the point. Do you ever update nrefs from zero to nonzero? That'd also leave dead entries behind in this index. If that is a routine event that is correlated with creation time, it gets easier to believe that your index could have lots of dead entries at the front. We'd still have to figure out why autovacuum is failing to clean out those entries in a timely fashion, but that seems like a plausible way for the performance problem to exist. > I can try to do VACUUM on this table, but I'm limited on resources. I think > it will take days to do this. Maybe I can try to dump the whole database > and restore it on another machine. Pretty hard to believe that dump-and-restore would be faster than VACUUM. > (Is there a way to check the number of dead rows?) I think contrib/pgstattuple might help. regards, tom lane
> I'm aware of the problems with random UUID values. I was using this
> function to create ulids from the beginning:
Oh, well that would have been useful information to provide at the
outset.
Now that we know the index order is correlated with creation
time, I wonder if it is also correlated with nrefs, in such a way that
scanning in index order is disastrous because all the desired rows are
at the end of the index.
Also, you deny deleting any rows, but that wasn't the point. Do you
ever update nrefs from zero to nonzero? That'd also leave dead
entries behind in this index. If that is a routine event that is
correlated with creation time, it gets easier to believe that your
index could have lots of dead entries at the front.
CREATE OR REPLACE FUNCTION media.trg_aiud_file_block()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
begin
if TG_OP='INSERT' then
update media.block set nrefs = nrefs + 1 where id = new.block_id;
return new;
end if;
if TG_OP='UPDATE' then
if old.block_id is distinct from new.block_id then
update media.block set nrefs = nrefs + 1 where id = new.block_id;
update media.block set nrefs = nrefs - 1 where id = old.block_id;
end if;
return new;
end if;
if TG_OP='DELETE' then
update media.block set nrefs = nrefs - 1 where id = old.block_id;
return old;
end if;
end;
$function$
;
We'd still have to figure out why autovacuum is failing to clean out
those entries in a timely fashion, but that seems like a plausible
way for the performance problem to exist.
name | setting | unit | min_val | max_val | boot_val | reset_val | pending_restart
---------------------------------------+------------+------+---------+------------+------------+------------+-----------------
autovacuum | on | | | | on | on | f
autovacuum_analyze_scale_factor | 0.1 | | 0 | 100 | 0.1 | 0.1 | f
autovacuum_analyze_threshold | 50 | | 0 | 2147483647 | 50 | 50 | f
autovacuum_freeze_max_age | 200000000 | | 100000 | 2000000000 | 200000000 | 200000000 | f
autovacuum_max_workers | 3 | | 1 | 262143 | 3 | 3 | f
autovacuum_multixact_freeze_max_age | 400000000 | | 10000 | 2000000000 | 400000000 | 400000000 | f
autovacuum_naptime | 60 | s | 1 | 2147483 | 60 | 60 | f
autovacuum_vacuum_cost_delay | 2 | ms | -1 | 100 | 2 | 2 | f
autovacuum_vacuum_cost_limit | -1 | | -1 | 10000 | -1 | -1 | f
autovacuum_vacuum_insert_scale_factor | 0.2 | | 0 | 100 | 0.2 | 0.2 | f
autovacuum_vacuum_insert_threshold | 1000 | | -1 | 2147483647 | 1000 | 1000 | f
autovacuum_vacuum_scale_factor | 0.2 | | 0 | 100 | 0.2 | 0.2 | f
autovacuum_vacuum_threshold | 50 | | 0 | 2147483647 | 50 | 50 | f
autovacuum_work_mem | -1 | kB | -1 | 2147483647 | -1 | -1 | f
log_autovacuum_min_duration | 600000 | ms | -1 | 2147483647 | 600000 | 600000 | f
vacuum_cost_delay | 0 | ms | 0 | 100 | 0 | 0 | f
vacuum_cost_limit | 200 | | 1 | 10000 | 200 | 200 | f
vacuum_cost_page_dirty | 20 | | 0 | 10000 | 20 | 20 | f
vacuum_cost_page_hit | 1 | | 0 | 10000 | 1 | 1 | f
vacuum_cost_page_miss | 2 | | 0 | 10000 | 2 | 2 | f
vacuum_defer_cleanup_age | 0 | | 0 | 1000000 | 0 | 0 | f
vacuum_failsafe_age | 1600000000 | | 0 | 2100000000 | 1600000000 | 1600000000 | f
vacuum_freeze_min_age | 50000000 | | 0 | 1000000000 | 50000000 | 50000000 | f
vacuum_freeze_table_age | 150000000 | | 0 | 2000000000 | 150000000 | 150000000 | f
vacuum_multixact_failsafe_age | 1600000000 | | 0 | 2100000000 | 1600000000 | 1600000000 | f
vacuum_multixact_freeze_min_age | 5000000 | | 0 | 1000000000 | 5000000 | 5000000 | f
vacuum_multixact_freeze_table_age | 150000000 | | 0 | 2000000000 | 150000000 | 150000000 | f
(27 rows)
> I can try to do VACUUM on this table, but I'm limited on resources. I think
> it will take days to do this. Maybe I can try to dump the whole database
> and restore it on another machine.
Pretty hard to believe that dump-and-restore would be faster than
VACUUM.
> (Is there a way to check the number of dead rows?)
I think contrib/pgstattuple might help.
All right, I started pgstattuple() and I'll also do pgstatindex(), but it takes a while. I'll get back with the results.
table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent
--------------+-------------+--------------+---------------+------------------+----------------+--------------------+-------------+--------------
372521984000 | 39652836 | 299148572428 | 80.3 | 3578319 | 26977942540 | 7.24 | 44638265312 | 11.98
(1 row)
version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
4 | 2 | 389677056 | 546 | 114 | 23069 | 0 | 24384 | 90.03 | 0
(1 row)
=# select * from pgstatindex('media.idx_block_unused');
version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
4 | 2 | 389677056 | 546 | 114 | 23069 | 0 | 24384 | 90.03 | 0
(1 row)
version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
4 | 0 | 8192 | 0 | 0 | 0 | 0 | 0 | NaN | NaN
(1 row)
-----------------------------------------------------------------------------------------------------------------------------------------+
Limit (cost=0.14..0.46 rows=1 width=16) (actual time=0.010..0.011 rows=0 loops=1) |
-> Index Only Scan using idx_block_unused on block b (cost=0.14..698.91 rows=2231 width=16) (actual time=0.008..0.009 rows=0 loops=1)|
Heap Fetches: 0 |
Planning Time: 0.174 ms |
Execution Time: 0.030 ms |
=# select * from pgstatindex('media.idx_block_unused');
version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
4 | 2 | 389677056 | 546 | 114 | 23069 | 0 | 24384 | 90.03 | 0
(1 row)After reindex:=# select * from pgstatindex('media.idx_block_unused');
version | tree_level | index_size | root_block_no | internal_pages | leaf_pages | empty_pages | deleted_pages | avg_leaf_density | leaf_fragmentation
---------+------------+------------+---------------+----------------+------------+-------------+---------------+------------------+--------------------
4 | 0 | 8192 | 0 | 0 | 0 | 0 | 0 | NaN | NaN
(1 row)explain analyze select id from media.block b where nrefs =0 limit 1QUERY PLAN |
-----------------------------------------------------------------------------------------------------------------------------------------+
Limit (cost=0.14..0.46 rows=1 width=16) (actual time=0.010..0.011 rows=0 loops=1) |
-> Index Only Scan using idx_block_unused on block b (cost=0.14..698.91 rows=2231 width=16) (actual time=0.008..0.009 rows=0 loops=1)|
Heap Fetches: 0 |
Planning Time: 0.174 ms |
Execution Time: 0.030 ms |It is actually empty.Now I only need to figure out why autovacuum did not work on the index.
Thank youLaszlo
On Mon, 28 Aug 2023 at 19:21, Les <nagylzs@gmail.com> wrote: > More important question is, how can I find out why the index was not auto vacuumed. You should have a look at pg_stat_user_tables. It'll let you know if the table is being autovacuumed and how often. If you're concerned about autovacuum not running properly, then you might want to lower log_autovacuum_min_duration. Generally, anything that takes a conflicting lock will cause autovacuum to cancel so that the conflicting locker can get through. Things like ALTER TABLE or even an ANALYZE running will cancel most autovacuum runs on tables. Also, this is a fairly large table and you do have the standard autovacuum settings. Going by pgstattuple, the table has 39652836 tuples. Autovacuum will trigger when the statistics indicate that 20% of tuples are dead, which is about 8 million tuples. Perhaps that's enough for the index scan to have to skip over a large enough number of dead tuples to make it slow. You might want to consider lowering the autovacuum scale factor for this table. Also, ensure you're not doing anything like calling pg_stat_reset(); It might be worth showing us the output of: select * from pg_stat_user_tables where relid = 'media.block'::regclass; David
> More important question is, how can I find out why the index was not auto vacuumed.
You should have a look at pg_stat_user_tables. It'll let you know if
the table is being autovacuumed and how often. If you're concerned
about autovacuum not running properly, then you might want to lower
log_autovacuum_min_duration. Generally, anything that takes a
conflicting lock will cause autovacuum to cancel so that the
conflicting locker can get through. Things like ALTER TABLE or even
an ANALYZE running will cancel most autovacuum runs on tables.
Also, this is a fairly large table and you do have the standard
autovacuum settings. Going by pgstattuple, the table has 39652836
tuples. Autovacuum will trigger when the statistics indicate that 20%
of tuples are dead, which is about 8 million tuples. Perhaps that's
enough for the index scan to have to skip over a large enough number
of dead tuples to make it slow. You might want to consider lowering
the autovacuum scale factor for this table.
Also, ensure you're not doing anything like calling pg_stat_reset();
It might be worth showing us the output of:
select * from pg_stat_user_tables where relid = 'media.block'::regclass;
select * from pg_stat_user_tables where relid = 'media.block'::regclass;
-------------------+-----------------------------+
relid |25872 |
schemaname |media |
relname |block |
seq_scan |8 |
seq_tup_read |139018370 |
idx_scan |45023556 |
idx_tup_fetch |37461539 |
n_tup_ins |7556051 |
n_tup_upd |7577720 |
n_tup_del |0 |
n_tup_hot_upd |0 |
n_live_tup |39782042 |
n_dead_tup |5938057 |
n_mod_since_analyze|1653427 |
n_ins_since_vacuum |5736676 |
last_vacuum | |
last_autovacuum |2023-08-17 22:39:29.383 +0200|
last_analyze | |
last_autoanalyze |2023-08-22 16:02:56.093 +0200|
vacuum_count |0 |
autovacuum_count |1 |
analyze_count |0 |