Thread: Finding rows with text columns beginning with other text columns
Assume we have a table "a" with a text column "txt" and an index on that column. A query like the following will then be very perfomant since it can use the index: select * from a where txt like 'a%' (Assume also that the server is using the C locale or the index is set up with text_pattern_ops, so that this really works.) Now take a second, similar table "b" (can be the same table). We want to find all entries in b where txt begins with an existing txt entry in a: select * from b join a on b.txt like a.txt||'%' On the first glance you would expect that this is performant since it can use the index, but sadly it doesn't work. The problem seems to be that Postgres can not guarantee that column a.txt does not contain a '%', so it cannot optimize. I feel there should be a performat way to query these entries, but I can't come up with anything. Can anybody help me? Thanks, -- Christoph
On 10 May 2010, at 24:01, Christoph Zwerschke wrote: > We want to find all entries in b where txt begins with an > existing txt entry in a: > > select * from b join a on b.txt like a.txt||'%' > > On the first glance you would expect that this is performant > since it can use the index, but sadly it doesn't work. > The problem seems to be that Postgres can not guarantee that > column a.txt does not contain a '%', so it cannot optimize. > > I feel there should be a performat way to query these entries, > but I can't come up with anything. Can anybody help me? Have you tried using substring instead of like? Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,4be7d6ec10411051620847!
Am 10.05.2010 11:50 schrieb Alban Hertroys: > On 10 May 2010, at 24:01, Christoph Zwerschke wrote: > >> select * from b join a on b.txt like a.txt||'%' >> >> I feel there should be a performat way to query these entries, >> but I can't come up with anything. Can anybody help me? > > Have you tried using substring instead of like? How exactly? I tried this: substr(b.txt, 1, length(a.txt)) = a.txt but it cannot be optimized and results in a nested loop, too. It only works with a fixed length: substr(b.txt, 1, 3) = a.txt So theoretically I could do something like select * from b join a on substr(b.txt, 1, 1) = a.txt and length(b.txt) = 1 union select * from b join a on substr(b.txt, 1, 2) = a.txt and length(b.txt) = 2 union select * from b join a on substr(b.txt, 1, 3) = a.txt and length(b.txt) = 3 union ... ... up to the maximum possible string length in a.txt. Not very elegant. If the question is not finding text cols in b starting with text cols in a, but text cols in b starting with text cols in a as their first word, then the following join condition works very well: split_part(b.txt, ' ', 1) = a.txt But I'm still looking for a simple solution to the original problem. -- Christoph
On 10 May 2010, at 21:24, Christoph Zwerschke wrote: > Am 10.05.2010 11:50 schrieb Alban Hertroys: > > On 10 May 2010, at 24:01, Christoph Zwerschke wrote: > > > >> select * from b join a on b.txt like a.txt||'%' > >> > >> I feel there should be a performat way to query these entries, > >> but I can't come up with anything. Can anybody help me? > > > > Have you tried using substring instead of like? > > How exactly? I tried this: > > substr(b.txt, 1, length(a.txt)) = a.txt > > but it cannot be optimized and results in a nested loop, too. I feared as much, but it was worth a try. Thinking more on the issue, I don't see a way to prevent the nested loop as there's no way to decide beforehand what partof the string to index for b.txt. It depends on a.txt after all. You would basically need a cross-table index, those are not supported. If it were, you could create a functional index ofsubstrings of b.txt with string lengths from a.txt (eeps, that'd be a table product!). Your best solution is probably to add a column to b that contains the substring of b.txt that would match a.txt. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,4be87c0b10418212361837!
Am 10.05.2010 23:34 schrieb Alban Hertroys: > Thinking more on the issue, I don't see a way to prevent the nested > loop as there's no way to decide beforehand what part of the string to > index for b.txt. It depends on a.txt after all. Yes, that seems to be the gist of the matter. I just felt I might have missed something. Thanks for your answers. -- Christoph