Thread: Text pattern JOINs that use indexes
I'm having a lot of trouble getting JOINs to work well with text patterns. I have a smallish table (4000 rows) containing prefixes that I want to inner-join to a large table (500000 rows) of strings. In other words, I want to look up the few entries in the 500000 row table which start with the strings in the 4000 row table. PostgreSQL insists on doing a sequential scan of the large table, even though it is indexed on the field I'm using for the join. Anyone got a solution? Here are the "explains": explain select * from files where name like 'foo%'; QUERY PLAN ------------------------------------------------------------------------ ----- Index Scan using files_name_key on files (cost=0.00..6.01 rows=1 width=97) Index Cond: ((name >= 'foo'::text) AND (name < 'fop'::text)) Filter: (name ~~ 'foo%'::text) explain select * from test join files on files.name like test.filename || '%'; QUERY PLAN --------------------------------------------------------------------- Nested Loop (cost=20.00..9450496.28 rows=1888140 width=129) Join Filter: ("outer".name ~~ ("inner".filename || '%'::text)) -> Seq Scan on files (cost=0.00..9776.28 rows=377628 width=97) -> Materialize (cost=20.00..30.00 rows=1000 width=32) -> Seq Scan on test (cost=0.00..20.00 rows=1000 width=32)
Richard Brooksby <rb@ravenbrook.com> writes: > explain select * from files where name like 'foo%'; This is indexable because the LIKE pattern is a constant at plan time, and so the planner can see that there's a useful range constraint extractable from the pattern. > explain select * from test join files on files.name like test.filename > || '%'; This is not indexable, because the LIKE pattern is not constant. regards, tom lane
On 15 Mar 2004, at 19:29, Tom Lane wrote: > Richard Brooksby <rb@ravenbrook.com> writes: >> explain select * from files where name like 'foo%'; > > This is indexable because the LIKE pattern is a constant at plan time, > and so the planner can see that there's a useful range constraint > extractable from the pattern. > >> explain select * from test join files on files.name like test.filename >> || '%'; > > This is not indexable, because the LIKE pattern is not constant. Thanks, Tom, I can now see why the planner is making the choice it does. I suppose in theory if I could guarantee that "test.filename" didn't contain '%' then the planner could do better, if it was clever enough. Do you have a suggestion for how I achieve what I want? My current solution is a function with nested FOR loops, but it seems a great shame to have to write it all out by hand. > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html Yes indeed I have, especially question 4.8! Thanks again.
Richard Brooksby, can you forward the solution as you have it now? I am very interested in how this question turns out. joe > > Thanks, Tom, I can now see why the planner is making the choice it > does. I suppose in theory if I could guarantee that "test.filename" > didn't contain '%' then the planner could do better, if it was clever > enough. > > Do you have a suggestion for how I achieve what I want? My current > solution is a function with nested FOR loops, but it seems a great > shame to have to write it all out by hand. > -- joe speigle www.sirfsup.com
Richard Brooksby, can you forward the solution as you have it now? I am very interested in how this question turns out. joe > > Thanks, Tom, I can now see why the planner is making the choice it > does. I suppose in theory if I could guarantee that "test.filename" > didn't contain '%' then the planner could do better, if it was clever > enough. > > Do you have a suggestion for how I achieve what I want? My current > solution is a function with nested FOR loops, but it seems a great > shame to have to write it all out by hand. > -- joe speigle www.sirfsup.com
On 16 Mar 2004, at 18:07, joseph speigle wrote: >> Thanks, Tom, I can now see why the planner is making the choice it >> does. I suppose in theory if I could guarantee that "test.filename" >> didn't contain '%' then the planner could do better, if it was clever >> enough. >> >> Do you have a suggestion for how I achieve what I want? My current >> solution is a function with nested FOR loops, but it seems a great >> shame to have to write it all out by hand. > > can you forward the solution as you have it now? I am very interested > in how this question turns out. I'm afraid I can't forward the exact code as it contains client confidential stuff, but here's basically what I do. Imagine the "foo_prefixes" table contains a small number (thousands) of prefixes with ids, and the bar_strings table contains a large number (millions) of strings with ids. You want a view showing the strings which match the prefixes. You can't write: CREATE VIEW foo_strings AS SELECT foo.id AS foo_id, bar.string AS bar_string FROM foo_prefixes, bar_strings WHERE bar_strings.string LIKE foo_prefixes.prefix || '%'; Well, you can write that, but it won't use a btree index on bar_strings(string) because the planned doesn't know that the prefix doesn't contain wildcards. So instead we have to plan each lookup with a constant string: CREATE OR REPLACE FUNCTION foo_strings() RETURNS SETOF record AS ' DECLARE r record; s record; BEGIN FOR r IN SELECT id, prefix FROM foo_prefixes LOOP FOR s IN EXECUTE '' SELECT '' || r.id || '' AS foo_id, string AS bar_string FROM bar_strings WHERE string LIKE '''''' || r.prefix || ''%'''''' LOOP RETURN NEXT s; END LOOP; END LOOP; RETURN; END; ' LANGUAGE 'plpgsql'; CREATE VIEW foo_strings AS SELECT * FROM foo_strings() AS (foo_id int, bar_string text); --- Richard Brooksby <rb@ravenbrook.com> Senior Consultant Ravenbrook Limited <http://www.ravenbrook.com/> PO Box 205, Cambridge CB2 1AN, United Kingdom Voice: +44 777 9996245 Fax: +44 870 1641432
Richard Brooksby <rb@ravenbrook.com> writes: > Well, you can write that, but it won't use a btree index on > bar_strings(string) because the planned doesn't know that the prefix > doesn't contain wildcards. So instead we have to plan each lookup with > a constant string: Another possibility is to manually hack up the query with the index boundary conditions that the planner won't generate because it's not sure about the wildcard situation. Something like select * from prefixes, strings where string like (prefix || '%') and string >= prefix and string <= (prefix || '~~~~~~~'); The trick here is to generate the upper bound string correctly --- I've cheated quite a lot in the above example by assuming that '~' sorts larger than any character you'd actually have in your strings. But if you know your data well you may be able to do this reliably. Beware that the above will almost certainly not work if you're not running in C locale. Other locales have bizarre sorting rules that will cause the >=/<= range to not match the prefix very well. However, if you are getting indexscan plans with plain constant patterns then you are in C locale, because the planner can't solve that problem either... regards, tom lane