Thread: Finding rows with text columns beginning with other text columns

Finding rows with text columns beginning with other text columns

From

Christoph Zwerschke

Date:

09 May 2010, 19:06:52

Assume we have a table "a" with a text column "txt"
and an index on that column.

A query like the following will then be very perfomant
since it can use the index:

select * from a where txt like 'a%'

(Assume also that the server is using the C locale or the index
is set up with text_pattern_ops, so that this really works.)

Now take a second, similar table "b" (can be the same table).

We want to find all entries in b where txt begins with an
existing txt entry in a:

select * from b join a on b.txt like a.txt||'%'

On the first glance you would expect that this is performant
since it can use the index, but sadly it doesn't work.
The problem seems to be that Postgres can not guarantee that
column a.txt does not contain a '%', so it cannot optimize.

I feel there should be a performat way to query these entries,
but I can't come up with anything. Can anybody help me?

Thanks,
-- Christoph

Re: Finding rows with text columns beginning with other text columns

From

Alban Hertroys

Date:

10 May 2010, 06:50:47

On 10 May 2010, at 24:01, Christoph Zwerschke wrote:

> We want to find all entries in b where txt begins with an
> existing txt entry in a:
>
> select * from b join a on b.txt like a.txt||'%'
>
> On the first glance you would expect that this is performant
> since it can use the index, but sadly it doesn't work.
> The problem seems to be that Postgres can not guarantee that
> column a.txt does not contain a '%', so it cannot optimize.
>
> I feel there should be a performat way to query these entries,
> but I can't come up with anything. Can anybody help me?


Have you tried using substring instead of like?

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.


!DSPAM:737,4be7d6ec10411051620847!

Re: Finding rows with text columns beginning with other text columns

From

Christoph Zwerschke

Date:

10 May 2010, 16:29:51

Am 10.05.2010 11:50 schrieb Alban Hertroys:
 > On 10 May 2010, at 24:01, Christoph Zwerschke wrote:
 >
 >> select * from b join a on b.txt like a.txt||'%'
 >>
 >> I feel there should be a performat way to query these entries,
 >> but I can't come up with anything. Can anybody help me?
 >
 > Have you tried using substring instead of like?

How exactly? I tried this:

     substr(b.txt, 1, length(a.txt)) = a.txt

but it cannot be optimized and results in a nested loop, too.

It only works with a fixed length:

     substr(b.txt, 1, 3) = a.txt

So theoretically I could do something like

select * from b join a
on substr(b.txt, 1, 1) = a.txt and length(b.txt) = 1
union select * from b join a
on substr(b.txt, 1, 2) = a.txt and length(b.txt) = 2
union select * from b join a
on substr(b.txt, 1, 3) = a.txt and length(b.txt) = 3
union ...

... up to the maximum possible string length in a.txt. Not very elegant.

If the question is not finding text cols in b starting with text cols in
a, but text cols in b starting with text cols in a as their first word,
then the following join condition works very well:

     split_part(b.txt, ' ', 1) = a.txt

But I'm still looking for a simple solution to the original problem.

-- Christoph

Re: Finding rows with text columns beginning with other text columns

From

Alban Hertroys

Date:

10 May 2010, 18:35:16

On 10 May 2010, at 21:24, Christoph Zwerschke wrote:

> Am 10.05.2010 11:50 schrieb Alban Hertroys:
> > On 10 May 2010, at 24:01, Christoph Zwerschke wrote:
> >
> >> select * from b join a on b.txt like a.txt||'%'
> >>
> >> I feel there should be a performat way to query these entries,
> >> but I can't come up with anything. Can anybody help me?
> >
> > Have you tried using substring instead of like?
>
> How exactly? I tried this:
>
>    substr(b.txt, 1, length(a.txt)) = a.txt
>
> but it cannot be optimized and results in a nested loop, too.

I feared as much, but it was worth a try.

Thinking more on the issue, I don't see a way to prevent the nested loop as there's no way to decide beforehand what
partof the string to index for b.txt. It depends on a.txt after all. 

You would basically need a cross-table index, those are not supported. If it were, you could create a functional index
ofsubstrings of b.txt with string lengths from a.txt (eeps, that'd be a table product!). 

Your best solution is probably to add a column to b that contains the substring of b.txt that would match a.txt.

Alban Hertroys

--
If you can't see the forest for the trees,
cut the trees and you'll see there is no forest.

!DSPAM:737,4be87c0b10418212361837!

Re: Finding rows with text columns beginning with other text columns

From

Christoph Zwerschke

Date:

11 May 2010, 18:04:09

Am 10.05.2010 23:34 schrieb Alban Hertroys:
 > Thinking more on the issue, I don't see a way to prevent the nested
 > loop as there's no way to decide beforehand what part of the string to
 > index for b.txt. It depends on a.txt after all.

Yes, that seems to be the gist of the matter. I just felt I might have
missed something. Thanks for your answers.

-- Christoph