unoptimized nested loops - Mailing list pgsql-general
From | Tim Kelly |
---|---|
Subject | unoptimized nested loops |
Date | |
Msg-id | 62965CE8.7030500@dialectronics.com Whole thread Raw |
Responses |
Re: unoptimized nested loops
Re: unoptimized nested loops |
List | pgsql-general |
Hello Everyone, I am stumped as to what I am doing wrong. I have two tables metadata: parent table, 1.28m records data: child table, 1.24m records metadata contains descriptions of the records in data. data has two fields of concern, the id field, which is a foreign key to an identical field in metadata, and the content field, which contains text ranging from a few hundred to a few thousand characters. The id fields are alphanumeric for historical reasons. Table descriptions below, some fields omitted for brevity: c_db=> \d metadata Table "public.metadata" Column | Type | Modifiers ------------------+--------------------------+----------- id | character varying(60) | not null author | character varying(90) | Indexes: "metadata_pkey" PRIMARY KEY, btree (id) Referenced by: TABLE "data" CONSTRAINT "fk_metadata" FOREIGN KEY (id) REFERENCES metadata(id) c_db=> \d data Table "public.data" Column | Type | Modifiers ---------+-------------------------+----------- id | character varying(30) | not null content | text | Indexes: "data_pkey" PRIMARY KEY, btree (id) Foreign-key constraints: "fk_metadata" FOREIGN KEY (id) REFERENCES metadata(id) I am regularly needing to search the content field in data for short text segments. I can reduce the number of rows needing to be searched by looking for particular types of entries in metadata. However, postgresql is apparently refusing to do so, and insists on searching the entire content column in the data table for the text segment of interest. It takes more time to search the effort to reduce than the entire data table straight up. To be specific with two approaches: c_db=> select count(id) from data; count --------- 1248954 (1 row) Time: 648.358 ms ic_db=> select count(id) from data where content like '%some text%'; count --------- 1167810 (1 row) Time: 180144.251 ms c_db=>select count(id) from metadata where author like '%Kelly%'; count ------- 3558 (1 row) Time: 1625.455 ms c_db=>select count(id) from data where data.content like '%some text%' and data.id in (select id from metadata where metadata.author like '%Kelly%'); count ------- 152 (1 row) Time: 211010.598 ms c_db=> explain analyze select count(id) from data where data.content like '%some text%' and data.id in (select id from metadata where metadata.author like '%Kelly%'); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=48203.30..48203.31 rows=1 width=124) (actual time=213021.968..213021.969 rows=1 loops=1) -> Nested Loop Semi Join (cost=0.00..48202.99 rows=125 width=124) (actual time=51392.697..213021.848 rows=152 loops=1) -> Seq Scan on data (cost=0.00..47132.93 rows=125 width=135) (actual time=0.176..183040.366 rows=1167810 loops=1) Filter: (data ~~ '%some text%'::text) -> Index Scan using metadata_pkey on metadata (cost=0.00..8.55 rows=1 width=11) (actual time=0.025..0.025 rows=0 loops=1167810) Index Cond: ((metadata.id)::text = (data.id)::text) Filter: ((metadata.author)::text ~~ '%Kelly%'::text) Total runtime: 213022.028 ms (8 rows) Time: 213024.273 ms Alternatively, using an inner join, c_db=> select count(id) from data inner join metadata on data.id = metadata.id where data.content like '%some text%' and metadata.author like '%Kelly%'; count ------- 152 (1 row) Time: 212211.047 ms c_db=> explain analyze select count(id) from data inner join metadata on data.id = metadata.id where data.context like '%some text%' and metadata.author like '%Kelly%'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=48203.30..48203.31 rows=1 width=124) (actual time=212800.026..212800.026 rows=1 loops=1) -> Nested Loop (cost=0.00..48202.99 rows=125 width=124) (actual time=22463.361..212799.911 rows=152 loops=1) -> Seq Scan on data (cost=0.00..47132.93 rows=125 width=135) (actual time=0.542..182952.708 rows=1167810 loops=1) Filter: (data ~~ '%some text%'::text) -> Index Scan using metadata_pkey on metadata (cost=0.00..8.55 rows=1 width=11) (actual time=0.025..0.025 rows=0 loops=1167810) Index Cond: ((metadata.id)::text = (data.id)::text) Filter: ((metadata.author)::text ~~ '%Kelly%'::text) Total runtime: 212800.076 ms (8 rows) Time: 212805.008 ms I do not see evidence that the nested loop is trying to reduce overhead by using the smaller set. It seems to want to scan on data first either way. I have run vacuum analyze, in the hopes that the optimizer is miscalculating, to no avail. I seem to be unable to force postgresql to use the smaller set to reduce the search, even with the use of "in" on a subquery. Does anyone have any ideas about what I am doing wrong on such a fundamentally normal operation? I am sure I am missing something really obvious, but I can't even find discussions about forcing the use of smaller sets. I apologize for having to ask about something that is almost certainly trivial. I am not subscribed to the list; please reply-all or offline. Thank you, tim
pgsql-general by date: