Thread: Text pattern JOINs that use indexes

Text pattern JOINs that use indexes

From

Richard Brooksby

Date:

15 March 2004, 14:48:34

I'm having a lot of trouble getting JOINs to work well with text
patterns.  I have a smallish table (4000 rows) containing prefixes that
I want to inner-join to a large table (500000 rows) of strings.  In
other words, I want to look up the few entries in the 500000 row table
which start with the strings in the 4000 row table.  PostgreSQL insists
on doing a sequential scan of the large table, even though it is
indexed on the field I'm using for the join.

Anyone got a solution?

Here are the "explains":

explain select * from files where name like 'foo%';
                                  QUERY PLAN
------------------------------------------------------------------------
-----
  Index Scan using files_name_key on files  (cost=0.00..6.01 rows=1
width=97)
    Index Cond: ((name >= 'foo'::text) AND (name < 'fop'::text))
    Filter: (name ~~ 'foo%'::text)


explain select * from test join files on files.name like test.filename
|| '%';
                              QUERY PLAN
---------------------------------------------------------------------
  Nested Loop  (cost=20.00..9450496.28 rows=1888140 width=129)
    Join Filter: ("outer".name ~~ ("inner".filename || '%'::text))
    ->  Seq Scan on files  (cost=0.00..9776.28 rows=377628 width=97)
    ->  Materialize  (cost=20.00..30.00 rows=1000 width=32)
          ->  Seq Scan on test  (cost=0.00..20.00 rows=1000 width=32)

Re: Text pattern JOINs that use indexes

From

Tom Lane

Date:

15 March 2004, 15:29:53

Richard Brooksby <rb@ravenbrook.com> writes:
> explain select * from files where name like 'foo%';

This is indexable because the LIKE pattern is a constant at plan time,
and so the planner can see that there's a useful range constraint
extractable from the pattern.

> explain select * from test join files on files.name like test.filename
> || '%';

This is not indexable, because the LIKE pattern is not constant.

            regards, tom lane

Re: Text pattern JOINs that use indexes

From

Richard Brooksby

Date:

15 March 2004, 17:00:49

On 15 Mar 2004, at 19:29, Tom Lane wrote:

> Richard Brooksby <rb@ravenbrook.com> writes:
>> explain select * from files where name like 'foo%';
>
> This is indexable because the LIKE pattern is a constant at plan time,
> and so the planner can see that there's a useful range constraint
> extractable from the pattern.
>
>> explain select * from test join files on files.name like test.filename
>> || '%';
>
> This is not indexable, because the LIKE pattern is not constant.

Thanks, Tom, I can now see why the planner is making the choice it
does.  I suppose in theory if I could guarantee that "test.filename"
didn't contain '%' then the planner could do better, if it was clever
enough.

Do you have a suggestion for how I achieve what I want?  My current
solution is a function with nested FOR loops, but it seems a great
shame to have to write it all out by hand.

> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html

Yes indeed I have, especially question 4.8!

Thanks again.

Re: Text pattern JOINs that use indexes

From

joseph speigle

Date:

16 March 2004, 14:06:12

Richard Brooksby,

can you forward the solution as you have it now?  I am very interested in how this question turns out.

joe

>
> Thanks, Tom, I can now see why the planner is making the choice it
> does.  I suppose in theory if I could guarantee that "test.filename"
> didn't contain '%' then the planner could do better, if it was clever
> enough.
>
> Do you have a suggestion for how I achieve what I want?  My current
> solution is a function with nested FOR loops, but it seems a great
> shame to have to write it all out by hand.
>

--
joe speigle
www.sirfsup.com

Re: Text pattern JOINs that use indexes

From

joseph speigle

Date:

16 March 2004, 14:07:17

Richard Brooksby,

can you forward the solution as you have it now?  I am very interested in how this question turns out.

joe

>
> Thanks, Tom, I can now see why the planner is making the choice it
> does.  I suppose in theory if I could guarantee that "test.filename"
> didn't contain '%' then the planner could do better, if it was clever
> enough.
>
> Do you have a suggestion for how I achieve what I want?  My current
> solution is a function with nested FOR loops, but it seems a great
> shame to have to write it all out by hand.
>

--
joe speigle
www.sirfsup.com

Re: Text pattern JOINs that use indexes

From

Richard Brooksby

Date:

16 March 2004, 14:32:26

On 16 Mar 2004, at 18:07, joseph speigle wrote:

>> Thanks, Tom, I can now see why the planner is making the choice it
>> does.  I suppose in theory if I could guarantee that "test.filename"
>> didn't contain '%' then the planner could do better, if it was clever
>> enough.
>>
>> Do you have a suggestion for how I achieve what I want?  My current
>> solution is a function with nested FOR loops, but it seems a great
>> shame to have to write it all out by hand.
>
> can you forward the solution as you have it now?  I am very interested
> in how this question turns out.

I'm afraid I can't forward the exact code as it contains client
confidential stuff, but here's basically what I do.  Imagine the
"foo_prefixes" table contains a small number (thousands) of prefixes
with ids, and the bar_strings table contains a large number (millions)
of strings with ids.  You want a view showing the strings which match
the prefixes.  You can't write:

CREATE VIEW foo_strings AS
     SELECT foo.id AS foo_id, bar.string AS bar_string
     FROM foo_prefixes, bar_strings
     WHERE bar_strings.string LIKE foo_prefixes.prefix || '%';

Well, you can write that, but it won't use a btree index on
bar_strings(string) because the planned doesn't know that the prefix
doesn't contain wildcards.  So instead we have to plan each lookup with
a constant string:

CREATE OR REPLACE FUNCTION foo_strings() RETURNS SETOF record AS '
     DECLARE
         r record;
         s record;
     BEGIN
         FOR r IN
             SELECT id, prefix FROM foo_prefixes
         LOOP
             FOR s IN EXECUTE ''
                 SELECT '' || r.id || '' AS foo_id,
                        string AS bar_string
                 FROM bar_strings
                 WHERE string LIKE '''''' || r.prefix || ''%''''''
             LOOP
                 RETURN NEXT s;
             END LOOP;
         END LOOP;
         RETURN;
     END;
' LANGUAGE 'plpgsql';

CREATE VIEW foo_strings AS
     SELECT * FROM foo_strings() AS (foo_id int, bar_string text);

---
Richard Brooksby <rb@ravenbrook.com>
Senior Consultant
Ravenbrook Limited <http://www.ravenbrook.com/>
PO Box 205, Cambridge CB2 1AN, United Kingdom
Voice: +44 777 9996245  Fax: +44 870 1641432

Re: Text pattern JOINs that use indexes

From

Tom Lane

Date:

16 March 2004, 19:06:31

Richard Brooksby <rb@ravenbrook.com> writes:
> Well, you can write that, but it won't use a btree index on
> bar_strings(string) because the planned doesn't know that the prefix
> doesn't contain wildcards.  So instead we have to plan each lookup with
> a constant string:

Another possibility is to manually hack up the query with the index
boundary conditions that the planner won't generate because it's not
sure about the wildcard situation.  Something like

    select * from prefixes, strings where
        string like (prefix || '%')
        and string >= prefix
        and string <= (prefix || '~~~~~~~');

The trick here is to generate the upper bound string correctly ---
I've cheated quite a lot in the above example by assuming that '~'
sorts larger than any character you'd actually have in your strings.
But if you know your data well you may be able to do this reliably.

Beware that the above will almost certainly not work if you're not
running in C locale.  Other locales have bizarre sorting rules that
will cause the >=/<= range to not match the prefix very well.  However,
if you are getting indexscan plans with plain constant patterns then
you are in C locale, because the planner can't solve that problem either...

            regards, tom lane